You are on page 1of 35

DATA ANALYTICS & MACHINE LEARNING IN FINANCE

LECTURE 1 – INTRODUCTION TO MACHINE LEARINING AND DATA ANALYTICS

Definitions: Big data is not equal to “a lot of data”. The moniker “big” stands for the three prominent characteristics of the
data we are going to analyse, also referred to as three Vs:
o Volume: refers to the quantity of data. The size of data collected and stored through records, transactions, tables,
files, etc. is very large; with the subjective lower bound for being called “big” being revised upward continuously;
o Variety: pertains to the array of available data sources. Data is often received in a variety of formats- be it
structured, semi-structured or unstructured from within and outside the enterprise;
o Velocity: is the speed at which the data are created. The speed with which data is sent or received often marks it
as big data. Analysing such “data-in-motion” poses challenges since relevant patterns and insights might be
moving targets relative to situations of “data-at-rest”.
When using big data for inference or prediction, a “fourth V” becomes important: veracity specifically refers to the
credibility and reliability of different data sources.
Example of Big Data commonly used in Finance : Trades and Quotes (TAQ) data
Daily TAQ data (millisecond) provides access to all trades and quotes for all issues traded on NYSE, Nasdaq and the
regional exchanges for the previous trading day. For example, the average size of daily TAQ quotes can be more than 100
million rows. As a result, the size of a database that includes current and historical TAQ can be terabytes. An example of
use of these data can be: identify liquidity, price impact, execution costs, estimation, etc…

Machine Learning is about extracting knowledge from the data and basically has the goal of helping humans learn.
Basically, machine learning is a research field at the intersection of statistics, artificial intelligence and computer science,
also commonly referred to ad predictive analytics or statistical learning. As suggested then, machine learning is not the
same thing as artificial intelligence:
o Artificial intelligence is a program that can sense, reason, act and adapt. It can be more specifically describes as
“the theory and development of computer systems able to perform tasks normally requiring human intelligence,
such as visual perception, speech recognition, decision-making, and translation between languages”.
o Machine Learning is made up of algorithms whose performance improves as they are exposed to more data over
time. More specifically, a computer program is said to learn from experience E with respect to some class of tasks
in T performance measure P, if its performance at tasks T, as measured by P improves with experience E. Deep
Learning, instead, describes a subset of machine learning in which multi-layered neural networks lean from vast
amount of data;
Algorithms describe a systematic set of operations to perform on a given data set- essentially they describe a procedure.

TRADITIONAL APPROACH MACHINE LEARNING APPROACH

Applications of ML in Finance : Examples include

- fraud prevention; - process automation


- risk management; - document interpretation
- wealth management; - content creation
- investment predictions; - trade settlement
- customer service; - money-laundering prevention
- digital assistants; - custom machine learning solutions
- marketing;
- network security;
- loan underwriting;
- algorithmic trading
Types of Machine Learning : we commonly distinguish among four types of Machine Learning: supervised, unsupervised,
semi-supervised and reinforcement learning. Their basic distinction relies in the type of data they use: supervised ML uses
mostly labelled data, whereas unsupervised ML makes use of unlabelled data.

LABELLED DATA UNLABELLED DATA

Supervised learning involves ML algorithms that infer patters between a set of inputs (the X’s) and the desired output (Y).
The inferred pattern is then used to map a given input set into a predicted output. Supervised learning requires a labelled
dataset, one that contains matched sets of observed inputs and the associated outputs. In the context of finance, supervised
learning models represent the most-uses class of machine learning models. The most common supervised learning tasks
are regression (predicting values) and classification (predicting classes). Some of the most important supervised learning
algorithms that we will cover include: linear regression, penalized regression, logistic regression, support vector machines
(SVMs), k-nearest neighbour, naïve Bayes Classifiers, Decision Trees and Random Forests, Neural Networks.
Unsupervised learning is machine learning that does not make use of labelled data. There are two types of unsupervised
learning: dimensionality reduction and clustering.
- Dimensionality reduction is the process of reducing the number of features, or variables, in a dataset while
preserving information and overall model performance. It is a common and powerful way to deal with datasets
that have a large number of dimensions;
- Clustering allows us to discover hidden structures in data. The goal of clustering is to find a natural grouping in
data so that items in the same cluster are more similar to each other than those from different clusters.

Reinforcement Learning (RL) is an approach toward training a machine (an agent) to find the best course of action
through optimal policies that maximize rewards and minimize punishments. The basic concepts in reinforcement learning
include the following:
- agent: is the entity that performs actions;
- action: is what an agent can do in each state;
- environment: is the world in which the agents
resides;
- state: describes the current situation of the
agents;
- reward: is the immediate return sent by the
environment to evaluate the last action by the
agent. A reward can be positive (reward) or
negative (punishment)

Application of RL in Finance? Algorithmic trading, derivatives hedging, portfolio allocation.


Using API An application programming interface (API) is a computing interface which defines interactions between
multiple software intermediaries. Thus, it is a software-to-software interface. It defines the kinds of calls or requests that
can be made, how to make them, the data formats which should be used and the conventions to follow. Some examples
include: quandl (Python API for financial and economic datasets), fredapi (Python API for the FRED data provided by the
Federal Reserve Bank of St.Louis), or world_bank_data (Python API for the World Bank data).

LAB 1 – UNDERSTANDING ALGORITHMS

Learning algorithms usually consist of three key parts:


- a loss function
- an optimization criterion based on the loss function (a cost function, for example); and,
- an optimization routine leveraging training data to find a solution to the optimization criterion.
The loss function computes the error for a single training example, while the cost function is the average of the loss
functions of the entire training set. In a nutshell, we can say that the loss function is a part of the cost function.
Gradient descent is one of the most frequently used optimization algorithms used in cases where the optimization
criterion is differentiable. It can be used to find optimal parameters for linear and logistic regression, SVM and also neural
networks. For many models the optimization criterion is convex: convex functions have only one minimum, which is
global. Optimization criteria for neural networks are not convex, but in practice even finding a local minimum suffices.
More specifically, gradient descent is an iterative optimization algorithm for finding the minimum of a function. It is
iteratively moving in the direction of the steepest descent as defined by the negative of the gradient. In machine learning,
we use gradient descent to update the parameters of our model. In a nutshell, as the name suggests, gradient means
“slope” & descent implies a downward movement. Gradient descent, then, follows the slope of the curve and descends to
those combination of parameters in the function which give the minimum value.

How the Gradient Descent works Linear Regression Example

o univariate linear regression: h ( x ; θ )=θ0 +θ 1 x


o GOAL: select the “best” parameter values θ0 , θ1, i.e. determine the parameter
vector θ=(θ 0 , θ1 )

There are two methods to determine the parameters of a linear regression:


1. Analytical solution based on the closed formula: θ∗¿( X T X )−1 X T y ;
2. Numerical solution (e.g. gradient descent)

How does it work ?


Fit the model : h ( x ; θ )=θ0 +θ 1 x by selecting the right parameter values θ0 , θ1:
1. Hypothesis: h ( x ; θ )=θ0 +θ 1 x
2. Parameters: θ0 , θ1 (each θi are in R)
m
1

2
3. Cost function: J ( θ0 , θ1 )= (h ( x(i) ,θ )− y (i )) → minimization of sum of squared residuals
2 m i=1
4. Goal: min J ( θ0 , θ1 )
θ0 ,θ 1
Let’s assume that we have a dataset with 3 observations. The x and y coordinates (x,y) of the 3 observations are: (1,1),
(2,2), (3,3). In the following, our hypothesis is that: h ( x ; θ )=θ0 +θ 1 x simplifies to h ( x ; θ )=θ1 x i.e. that θ0 is zero
(intercept). In this simple example, we are looking for the optimal slope able to fit the observed intercepts:

m
1 1

2
J ( θ1 )= (h ( x(i) ,θ )− y (i)) = ( ( x (1) ,θ )− y ( 1) ¿ ¿ ¿ 2+¿ ( x (2 ) , θ ) − y (2) ¿ ¿ ¿ 2+ ( x (3 ) ,θ )− y (3 ) ¿ ¿ ¿ 2)¿
2 m i=1 2m

Attempt 1: θ1=1

1
J ( 1 )= ¿
2m
( x (3 ) , θ ) − y (3 ) ¿ ¿ 2 ¿= 0

Attempt 1: θ1=0.75

1
J ( 0.75 )= ¿
2m
( x (3 ) , θ ) − y (3 ) ¿ ¿ 2 ¿= 0.15

Attempt 1: θ1=0.5

1
J ( 0.5 )= ¿
2m
( x (3 ) , θ ) − y (3 ) ¿ ¿ 2 ¿= 0.58

GOAL: minimize J ( θ1 )
ANSWER: ( θ1 ) = 1

Gradient descent algorithm functions in the following way:


o why minus? Well, the aim is to pick those values of θ0 , θ1 to minimize the cost function J ( θ1 ), so we compute the
negative gradient for each value of θ ;

o what is α? The learning rate

The importance of α (the Learing Rate) Alpha basically represent the “learning rate” of any machine learning algorithm. It
basically is the scalar by which the gradient is multiplied and is also known as step size. It is use to determine the next
point: for instance, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will
pick the next point 0.025 away from the previous point.
It is important to choose the appropriate value of the learning rate α:
o if the learning rate of α is too high, we may end up bouncing between two points and may never reach the minima;
o if the value of α is too low, then we will for sure reach the minima but we will proceed towards it very slowly and
thus the algorithm may take a very long time to converge.

How to implement Gradient Descent on Python Use scikit-learn:

Variable Types Big Data usually include a huge variety of variables to be analysed. They can be:
o categorical: nominal (e.g. hair color, color of the eyes,..), ordinal (e.g. age, weight, height,…)
o measurement (numerical variables): discrete (e.g. goals shot, number of visits in a museum,..), continuous (e.g.
time until dinner is ready,…).
Nonetheless, we must take into account the fact that most machine learning algorithms work almost exclusively with
numerical data. Therefore, we need to encode categorical features into a numeric features. Popular encoding approaches
include:
o label encoding: we replace the categorical value with numeric values between 0 and # of classes – 1. This method
presents nonetheless a major drawback: since we basically transform every variable in a number, an appearance
of relationship that does not exist between the different numbers can be created (the algorithm might simply
think that the variable encoded as 1 is less important than the variable encoded as 2, and so on).
o one-hot encoding: for each category of a feature, we create a new columns (also called a dummy variable) with
binary encoding to denote whether a particular row belongs to this category. For instance, columns will look like
these:
- income statement = [0,0,1]
- balance sheet = [1,0,0]
- cashflow statement = [0,1,0]
Also this method presents a significant drawback: since we are creating a new column for each feature present in
the dataset which we would like to encode, we would most likely face a dimensionality problem (“course of
dimensionality”). We can make use of the sparse matrix in order to face this problem.
Data Transformation Data transformation or scaling is a process of adjusting the range of a feature by shifting and
changing the scale of data. Variables such as age and income can have a diversity of ranges that result in an heterogeneous
training dataset. Many machine learning algorithms can benefit from rescaling all the attributes to the same scale. This can
be accomplished through data transformation. There are different data transformation techniques:
o min-max scaling: to min-max rescale variable , the minimum value X_min is subtracted by each observation, and
then this value is basically divided by the difference between the maximum and the minimum values of X (the
range):
Xi ¿

The function MinMaxScaler rescales the data set such that all feature values are in the range [0,1].

o
standard

scaling : standardization (standard scaling, or z-score normalization) is the process of both centering and scaling
the variables. Centering involves subtracting the mean of the variable from each observation ( X i ) so the new
mean is 0; scaling, instead, adjusts the range of the data by dividing the centered values by the standard deviation
of feature X. The resultant standardized variable will have a mean of 0 and a standard deviation of 1. This
technique is most suitable in the cases when we assume that the input variables represent a normal distribution.
X i−μ
X i ( standard scaling )=
σ
StandardScaler removes the mean and scales the data to unit variance.

o robust

scaling : this process works similarly to the standard scaling in that it ensures statistical properties for each
feature that guarantee that they are on the same scale. However, the robust scaling uses the median and quartiles
instead of mean and variance. This makes the robust scales ignore data points that are very different from the rest
(like measurement errors). These odd data points are also called outliers, and can lead to trouble for other scaling
techniques. Roubust scaling, then, removes the median and scales the data according to the quantile range
(defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile and the 3rd quartile:

X i−median
X i ( robust scaling ) =
p 75− p 25

o
sample- wise
L2

normalizing: in comparison to the previous three scalers, the Normalizer does a very different kind of rescaling.
Sample-wise L2 normalizing refers to rescaling each observation (row) to have a length of one (called a unit norm
or a vector). It scales each data point such that the feature vector has a Euclidean length of 1. This method of
scaling is often used when only the direction (or angle) of the data matters, not the length of the feature vector.

‖ X‖2= √ X 21+ X 22+ X 23 +…


Normalizer rescales the vector for each sample to have unit norm, independently of the distribution of the samples.

o binning : one way to make linear models more powerful on continuous data is to use binning (which basically
consist of splitting the variables values in to intervals) also known as discretization of the feature to split it up into
multiple
features,
as shown
below.
Recall that: Y = β0 + β 1 X 1 + β 3 X 1 X 2 + ε, where X 1 and X 2 are the values of the two features, respectively X 1 X 2
represents the interaction between the two. It can be useful to use scikit-learn’s PolynomialFeatures to create
interaction terms for all combination of features.

o polynomials : polynomial regression is a form of regression analysis in which the relationship between the
dependent variable x and independent variable y is modelled and an nth degree polynomial in x. Using polynomial
features together with a linear regression model yields the classical model of polynomial regression. For a given
feature X, we might want to consider x 2 , x 3 , x 4 and so on.

o univariate nonlinear transformation : while adding squared or cubed features can help linear model for
regression, there are no other transformations that often prove useful for transforming certain features: in
particular, applying mathematical functions like log, exp or sin. In finance, it is common to apply the log
transformation. For example, fir age is usually modelled as log(firm_age + 1). The “plus one” term appears
because firms right after the IPO are younger than one year and the logarithm is not defined as 0. Firm size is also
commonly shown as log(book_value_assets). Such transformations can improve the inference, but they need to be
applied using expert judgment.

LECTURE 2 – METHODS FOR SUPERVISED ML

The primary scope of Machine Learning is to apply an algorithm to a specific data set with the aim of inferring the pattern
between the inputs and outputs. This specific algorithm is known as the training algorithm.
Once the algorithm has been trained, the inferred pattern can be used to predict output values based on new inputs (i.e.
the ones not included in the training data set). As a consequence, we always split our working data set into training set
(the part of the data on which we are going to apply the training algorithm) and the test set (meaning the part of the data
on which we test our model’s performance). Why do we need to split the data? In order to avoid overfitting: the latter
refers to learning a function that perfectly explains the training data that the model learned from but does not generalize
well to unseen data. Data splitting, then, is a very important stem in the analysis: if done improperly, data leakage can
occur through which one can introduce biases into data. One example of bad practice is to replace the missing values with
the feature’s average prior to splitting the data.

Do you see any issues that


could arise in the context of finance if the basic split is used? In such a case, we should think about time-series or
imbalance data,…More specifically, when it comes to time-series data, the order of observations matters. In such a case,
then, it will be extremely important to specify into the basic function shuffle = False (otherwise, the split will be completely
random and could mess up the timely order of observations).

Do you see any issues that could arise if we have an imbalance data? In such a case we should thin of credit card theft
data, where it can be that the 99% of the data is normal and only 1% contains fraudulent transactions. In such a case, we
should make use of the option stratify whenever we want to ensure that both the training and the test sets will have
possible identical distribution of the specified variables. This is very important when using non-balanced datasets. We can
make use of the UCI machine learning repository dataset and run the following code:

In some cases, the data are split into 3 parts and not just 2 of them. The part that we have not mentioned is known as the
validation set. If we have a model with no hyperparameters or some parameters which cannot be easily turned, you highly
likely won’t need a validation set. The latter is a sample of data held back from training your model that is used to give an
estimate of model skill while turning model’s hyperparameters. The validation ser is sometimes also called the
development set or the “dev set”.

Classification vs. Regression Classification is the task of predicting a discrete class label, whereas regression is the task of
predicting continuous quantity. However, both share the same concept of utilizing known (labelled) variables to make
predictions, and there is a significant overlap between the two models. Some models can be used for both classification
and regression with small modifications. Examples include: k-nearest neighbour, decision trees, support vector machines
(SVMs), ensemble methods, and ANNs (including deep neural networks). However, some model, such as linear regression
and logistic regression, cannot (or cannot easily) be used for both problem types.
The most prominent supervised ML algorithms include:
o Linear Regression;
o Regularized Regression;
o Logistic Regression;
o K-nearest neighbour;
o Support Vector Machines (SVMs)
o Naïve Bayes Classifiers;
o CART;
o ANN-Based models.

Why there are so many models? Each model is a simplification of reality, which is naturally based on assumptions. The
latter can fall in certain specific situations. Consequently, in ML no model works best for all possible situations (also
known as the no free lunch theorem). The most prominent uses of supervised ML algorithms include, for instance:
o Credit default predictions;
o Derivative pricing;
o Robo-advisory;
o Stock price prediction;
o Asset allocation.

Linear Regression Linear Regression, or ordinary least squares (OLS) is a linear model, e.g. a model that assumes a linear
relationship between the input variables (x) and the single output variable (y). Our model will be a function that predicts y
given X 1 , X 2 … .: Y = β0 + β 1 X 1 +…+ β i X i, where β 0 is called intercept and β 1 , … … , β i are the coefficients of the
regression.
In a data set with a single feature, the previous equation simplifies to: Y = β0 + β 1 X 1.

One proceeds in 2 steps:


1. Define a loss function: the loss function measures how inaccurate the model’s predictions are. The residual sum of
squares (RSS) measures the squared sum of the differences between the actual and predicted values and is the
loss function for linear regression:
m
^2
RSS=∑ (∈ ¿ ¿(i)) ¿ ¿
i=1
2. Find the parameters that minimize loss: mathematically, we look at the difference between each real data point
(y) and our model’s prediction (y). square these differences to avoid negative numbers and penalize larger
differences, and then add them up and take the average. This is a measure of how well our data fits the line.

Main strengths:
o easy to understand and interpret;
o linear regression has no parameters, which is a benefit, but it also has no way to control model complexity.
Main weaknesses:
o prone to overfitting;
o multicollinearity assumption
o does not work well when there is a non-linear relationship between predicted and predictor variables;
o complexity control

Main parameters: none

Penalized Regression Linear models are frequently favourable due to their interpretability and often good predictive
performance. Yet, ordinary least squares estimators faces challenges:
o interpretability: OLS cannot distinguish variables with little or no influence. These variables distract from the
relevant regressors;
o overfitting: OLS works well when the number of observations m is bigger than the number of predictors p. If m is
very close to p, overfitting results into low accuracy on unseen observations; whereas if m < p, variance of
estimates is infinite and OLS fails. As a remedy, one can identify only relevant variables by feature selection.
The main idea of regularized regression is to fit a linear model with least squares but impose constraints on the coefficient.
Why? Regularization means explicitly restricting a model to avoid overfitting. Simply put, it is a penalty mechanism that
applies shrinkage to model parameters. The common ways (also known as shrinkage methods) to regularize a linear
regression model include:
o L2 regularization or Ridge regression: ridge regression can shrink parameters close to zero;
o L1 regularization or LASSO regression: LASSO models can shrink some parameters exactly to zero;
o Elastic net: combines the LASSO and ridge regression penalties.
Ridge regression is also a linear model for regression, so the formula it uses to make predictions is the same as the one
used for OLS. Ridge regression performs L2 regularization by adding a penalty factor to the cost function used in linear
regression. Mathematically, ridge penalizes the L2 norm of the coefficients or the Euclidean length of the coefficient vector.
The goal basically is to minimize:

p is the sum of squared coefficients and


RSS+ λ ∑ β2 together with lambda they make up the
i =1 shrinkage penalty

Turning or regularization parameter λ> 0 controls the relative impact of the penalty. The penalty term ( λ ) regularizes the
coefficients such that if the coefficients take large values, the optimization function is penalized. Ridge regression shrinks
the coefficients and helps to reduce the model’s complexity. Shrinking the coefficients leads to a lower variance and a
lower error value. Therefore, ridge regression decreases the complexity of a model but does not reduce the number of
variables; it just shrinks their effect. Ridge regression makes a trade-off between the simplicity of the model (near-zero
coefficients) and its performance on the training set.

here you must set your own lambda

Main strengths:
o ridge regression can reduce the variance (with an increasing bias): it works best in situations where OLS
estimates have high variances;
o can improve predictive performance;
o works in situations where the number of observations is bigger than the number of predictors p ;
o involves mathematically simple computations.

Main weaknesses:
o ridge regression is not able to shrink coefficients to exactly zero and, consequently, cannot perform a variable
selection;
o with enough training data, regularization becomes less important and linear regression catches up with ridge in
the end.

Main parameter: λ
o if λ is equal to zero, the cost function becomes identical to the linear regression cost function;
o increasing λ forces coefficients to move more towards zero, which decreases the training set performance but
might help generalization.
L1 Regularization or LASSO Regression LASSO stands for Least Absolute Shrinkage Operator and performs regularization
by adding a factor of the sum of the absolute value of coefficients to the objective function (RSS) for linear regression.
Mathematically, LASSO penalizes the L1 norm of the coefficients. The goals is to minimize:

p
RSS+ λ ∑ | β|
i =1

The consequence of L1 regularization is that when using LASSO some coefficients are exactly zero. This can be seen as a
form of automatic feature selection: having some coefficients be exactly zero often makes a model very easy to interpret,
and can reveal the most important feature of your model. The larger the value of λ , the more features are shrunk to zero.

Main
strengths:
o LASSO usually results in sparse model (most values are actually reduced to zero), that are easier to interpret;

Main weaknesses:
o LASSO cannot do group selection. If there is a group of variables among which the pairwise correlations are very
high, then the LASSO tends to arbitrarily select only one variable from the group;

Main parameter: λ
o if λ is too low, however, we remove the effect of regularization and end up overfitting, with a result similar to a
linear regression.

Elastic Nets Elastic nets also add a regularization term to the model: the penalty term is a combination of both L1 and L2
regularization. Because the elastic net combines two penalization methods, it involves turning two parameters. They are
usually labelled as λ and α.

Main strengths:
o Elastic Net reduces the impact of different features while not eliminating all of the features;
o Elastic Net produces a sparse model with good prediction accuracy, while allowing for a grouping effect. It has the
ability to perform grouped selection;
o appropriate for the p >> m problem

Main weaknesses:
o selection of two parameters necessary;

Main parameter: λ and α

Logistic Regression Despite its name, the logistic regression is a classification algorithm and not a regression algorithm,
and it should not be confused with linear regression. The two most common linear classification algorithms are logistic
regression implemented through the command linear_model.LogisticRegression, and linear support vector machines (linear
SVMs), implemented through the command svm.LinearSVC (SVC stands for support vector classifier).
The logistic regression model arises from the desire to model the probabilities of the outpt classes given a function that is
linear in x, at the same time ensuring that output probabilities sum up to one and remain between zero and one as we
would expect from probabilities.
Similar to linear regression, logistic regression can have regularization which can be L1, L2 and elastic net. The value in the
sklearn library are [11,12, elasticnet].

Main strengths:
o Easy to implement, has good interpretability, and performs very well on linearly separable classes;
o The output of the model is a probability, which provides more insights and can be used for ranking;

Main weaknesses:
o The model may overfit when provided with large numbers of features;
o Logistic regression can only learn linear functions and is less suitable to complex relationships between features
and the target variable;
o Also, it may not handle irrelevant features well, especially if the features are strongly correlated.

K- nearest neighbours (KNN) K-nearest neighbour (KNN) algorithm is arguably the simplest machine learning algorithm.
Building the model consists only of storing (memorizing) the training data set. To make a prediction for a new data point,
the algorithm finds the closest data points in the training dataset – its “nearest neighbours”. The key gradient is
represented by the distance measure.

KKN is considered a lazy learner, as there is no learning required in the model. KNN does not learn any function from the
training data but memorizes the training dataset. Lazy learning is, indeed, defined as an algorithm which simply stores the
training data and waits until it is given a new data to assess; and eager learning algorithm is one, which – given a set of
training data – constructs a model before receiving new data to assess. To determine which of the K istances in the training
dataset are most similar to a new input, a distance measure is used. Two common choices of a distance measure include:
o Euclidean distance: is a good distance measure to use if the input variables are similar in type;
o Manhattan distance is a good measure to use if the input variables are not similar in type.

The KNN method can be used both for goals of classification and prediction. In a case of prediction, the algorithm is able to
predict the average value of the class value of the k-nearest neighbours. An example of classification using KNN could be,
instead, to take the majority vote of class labels among the K-nearest neighbour. In such a case, a problem might arise
when ties are present: in such a case, they are broken arbitrarily.

Main strengths:
o No training is involved and hence there is no learning phase;
o New data can be added seamlessly without impacting the accuracy of the algorithm;
o Intuitive and easy to understand;

Main weaknesses:
o When your training set is very large (either in number of features or in number of samples) prediction can be
slow;
o Future scaling is required before applying the KNN algorithm to any dataset;
o KNN does particularly badly with datasets where most features are 0 most of the time (so-called sparse data sets)

Main parameters:
o Number of neighbours (n_neighbors): in practice, using a small number of neighbors like three of five often works
well, but you should certainly adjust this parameter;
o Distance metric (metric): choosing the right distance measure is beyond the scope of this class. By default,
Euclidean distance is used, which works well in many settings.

LECTURE 3 – METHODS FOR SUPERVISED ML (CONTINUED)

Support Vector Machines (SVM) The objective of supervised vector machine (SVM) algorithm is to maximize the margin,
which is defined as the distance between the separating hyperplane (or decision boundary) and the training samples that
are the closest to this hyperplane, the so-called support vectors. While there are support vector machines for classification
and regression, we will analyse in detail the classification case.

During training, the SVM learns how important each of the training data points is to represent the decision boundary
between the two classes. Typically only a subset of the training point matter for defining the decision boundary: the ones
that lie on the border between classes. These are called support vectors. To make a prediction for a new point, the distance
to each of the support vectors is measured. A classification decision is made based on the distances to the support vector,
and the importance of the support vectors that was learned during training.
What happens if we have a two-class classification dataset in which classes are not linearly separable? SVM is, indeed,
quite intuitive when the data set is linearly separable. Nonetheless, when they are not, as shown in the diagram below,
SVM can be extended to perform well. There are two main steps form non-linear generalization of SVM:
1. the first step involves the transformation of the original training (input) data into a higher dimensional data using
non-linear mapping;
2. the second step involves finding a linear separating hyperplane in the new space. The maximal marginal
hyperplane found in the new space corresponds to a non-linear separating hypersurface in the original space.

In geometry, an hyperplane is a subspace whose dimension is one less than that of its ambient space (D-1). In ML,
hyperplanes are decision boundaries that help classify the data points. The dimension of the hyperplane depends upon the
number of features:
o if the number of input features is 2 (i.e. in a 2-D space), then the hyperplane is just a line;
o if the number of input features is 3 (i.e. in a 3-D space), then the hyperplane is a plane.
In some cases, it is not possible to find a hyperplane or a linear decision boundary, and kernels are used. Kernelized
support vector machines (often just referred to as SVMs) are an extension that allows for more complex models that are
not defined simply by hyperplanes in the input space. a kernel is just a transformation of the input data that allows the
SVM algorithm to treat/process the data more easily. Using kernels, the original data is projected into a higher dimension
to classify the data better.

There are two ways to map your data into a higher-dimensional space that are commonly used with support vector
machines:
o polynomial kernel, which computes all possible polynomials up to a certain degree of the original features;
o radial basis function (RBF) kernel, also known as the Gaussian kernel.
Main strengths:
o SVM is fairly robust against overfitting, especially in higher dimensional space;
o Kernel trick allows for non-linear generalizations, with many kernels to choose from;
o There is no distributional requirements for the data;
o
Main weaknesses:
o SVM can be inefficient to train and memory-intensive to run and tune;
o it doesn’t perform well with large datasets;
o it requires the feature scaling of the data;
o there are also many hyperparameters, and their meanings are often not intuitive.

Naïve Bayes Classifiers Naïve Bayes Classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is
not a single algorithm but rather a family of algorithms where all of them share a common principle, i.e. every pair of
features being classified is independent of each other (that is why the naïve label).
We will cover three kinds of naïve Bayes classifiers implemented in scikit-learn:
o Gaussian Naïve Bayes: GaussianNB
o Bernoulli Naïve Bayes: BernoulliNB
o Multinomial Naïve Bayes: MultinomialNB

Remember that Bayes theorem states the following:

P ( A )∗P(B∨A )
P ( A|B )=
P(B)

A,B = events
P(A|B) = probability of A happening given the fact that B is true (conditional probability)
P(B|A)= probability of B given the fact that A is true (conditional probability)
P(A), P(B) = the independent probabilities of A and B (unconditional probabilities)

The different naïve Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(B|A).
In general, each type of Bayes naïve classifier can be applied (or is better suited to be applied) to different types of data:
o Gaussian Naïve Bayes: can be applied to any continuous data;
o Multinomial Naïve Bayes: can only be applied to count data. This algorithm is suitable for classification with
discrete features. The multinomial distribution normally requires integer feature counts;
o Bernoulli Naïve Bayes: can only be applied to binary data. Like MultinomialNB, this classifier is suitable for
discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed
for binary/Boolean features.

Decision Trees Decision trees are widely used models for classification and regression tasks. Essentially, they learn a
hierarchy of if/else questions, leading to a decision. They are also known as CART: Classification and Regression Trees
Models.
the
heretop
younode, alsoyour
must set called
ownthe root,
lambda
represents the whole dataset

each node in the tree either


represents a question or a terminal
node (also called a leaf) that
contains the answer. The edges
connect the answer to a question
with the next question you would
ask.

Building decision trees Data does come in the form of binary yes/no features or can be represented as continuous
features. The question that are used on continuous data are of the form “ is the feature larger than the value a?” . In the
machine learning setting, these questions are called tests (not to be confused with the test set). It is a recursive process. It
yields a binary tree of decisions, with each node containing a test. The recursive partitioning of the data is repeated until
each region in the partition (each leaf in the decision tree) only contains a single target value (a single class or a single
regression value). A leaf of the tree that contains data points that all share the same target value is called pure.

Typically, building a tree as described and continuing until all leaves are pure leads to models that are very comples and
highly overfit to the training data. Two common strategies to prevent overfitting are:
o pre-pruning: stopping the creation of the complex tree earlier, before it perfectly classifies the training set.
Possible criteria for pre-pruning include:
- limiting the maximum depth of the tree ( max_depth);
- limiting the maximum number of leaves (max_leaf_nodes);
- the minimum number of samples required to be at a leaf node. A split point at any depth will only be considered
if it leaves at least min_samples_leaf training samples in each of the left and right branches (min_samples_leaf)
o post-pruning: building a complex tree but then removing or collapsing nodes that contain little to no information.
Decision trees in scikit-learn are implemented in the DecisionTreeRegressor and DecisionTreeClassifier classes. Scikit-learn,
additionally, only implements pre-pruning, and not post-pruning. If we do not restrict the depth of a decision tree, the tree
can become arbitrarily deep and complex. Unpruned trees are therefore prone to overfitting and not generalizing well to
new data.
Instead of looking at the whole tree, which can be taxing, there are some useful properties that we can derive to
summarize the workings of the tree. The most commonly used summary is feature importance, which rates how important
each feature is for the decision a tree makes. It is a number between 0 and 1 for each feature, where 0 means “not used at
all” and 1 means “perfectly predicts the target”. The feature importance always sum up to 1. However, if a feature has a low
feature_importance, it doesn’t mean that this feature is uninformative. It only means that the feature was not picked by the
tree, likely because another feature encodes the same information.

Main strengths:
o decision trees have two advantages over many of the algorithms, we have discussed so far:
- the resulting model can easily be visualized and understood by non-experts (at least for smaller trees), and;
- the algorithms are completely invariant to scaling of the data.

Main weaknesses:
o the main downside of decision trees is that even with the use of pre-pruning, they tend to overfit and provide poor
generalization performance. Therefore, in most applications, the ensemble methods we discuss next are usually
used in place of a single decision tree;

Main parameters:
o most common pre-pruning strategies are: max_depth, max_leaf_nodes, min_sample_leaf;

Ensembles of Decision Trees Ensembles are methods that combine multiple machine learning models to create more
powerful models. Two ensemble models that have proven to be effective on a wide range of datasets for classification and
regression are:
o random forests;
o gradient boosted trees.
Random forest is essentially a
collection of decision trees, where each
tree is slightly different from the others.
Random forests are one way to address the problem of overfitting commonly observes in the training data of decision
trees. The idea behind random forests is that each tree might do a relatively good job of predicting, but will likely overfit
on part of the data. If we build many trees, all of which work well and overfit in different ways, we can reduce the amount
of overfitting by averaging their results. This reduction in overfitting, while retaining the predictive power of the trees, can
be shown using rigorous mathematics. There are two ways in which the trees in a random forest are randomized: by
selecting the datapoints used to build a tree and by selecting the features in each split test. To build a random forest model,
you need to decide on the number of tree to build (the n_estimator parameter of RandomForestRegressor or
RandomForestClassifier).
1. to build a tree, we first take what is called a bootstrap sample of our data. That is, from our n_samples datapoints,
we repeatedly draw an example randomly with replacement (meaning the same sample can be picked multiple
times), n_samples times;
2. next, a decision tree is built based on this newly created dataset. However, the algorithm we described for the
decision tree is slightly modified. Instead of looking for the best test for each node, in each node the algorithm
randomly selects a subset of the features, and it looks for the best possible test involving one of these features. The
number of features that are selected is controlled by the max_features parameter. This selection of a subset of
features is repeated separately in each node, so that each node can make a decision using a different subset of the
features. The bootstrap sampling leads to each decision tree in the random forest being built on a slightly different
dataset. Because of the selection of features in each node, each split in each tree operates on a different subset of
features. Together, these two mechanisms, ensure that all the trees in the random forest are different.

What happens if we set max_features to n_features? It means that each split can look at all features in the dataset, and no
randomness will be injected in the feature selection (the randomness due to the bootstrapping remains, though).

To make a prediction using the random forest, the algorithm first makes a prediction for every tree in the forest:
o for regression, we can average these results to get our final prediction;
o for classification, a “soft voting” strategy is used. This means each algorithm makes a soft prediction, providing a
probability for each possible output label. The probabilities predicted by all the trees are averages, and the class
with the highest probability is predicted.
Apart from adjusting the max_features setting, we can also apply the pre-pruning as we did for a single decision tree.
The random forest overfits less than any of the trees individually, and provides a much more intuitive decision boundary.
In any real application, we would use many more trees (often hundreds or thousands), leading to an even smoother
boundaries. Similarly to the decision tree, the random forest provides feature importance, which are computed by
aggregating the feature importance over the trees in the forest. The random forest fives non-zero importance to many
more feature than the single tree.

Main strengths:
o currently among the most widely used machine learning methods;
o often work well without heavy tuning of the parameters and don’t require scaling of the data;

Main weaknesses:
o building random forests on large datasets might somewhat be time-consuming;
o random forests don’t tend to perform well on very high dimensional, sparse data, such as text data;
Main parameters
o the important parameters to adjust are n-estimators, max_features, and possibly pre-pruning options like
max_depth. For n_estimators, larger is always better. Averaging more trees will yield a more robust ensemble by
reducing overfitting. However, there are diminishing returns, and more trees need more memory and more time
to train.
o you can use the n_jobs parameter to adjust the number of cores to use;
o a common rule of thumb is to build “as many as you have time/memory for you”;

Caution
o if you want to have reproducible results, it is important to fix the random_state

Gradient Boosted Trees Gradient boosting is considered a gradient descent algorithm. These models can be used for both
regression and classification. In contrast to the random forest approach, gradient boosting works by building trees in a
serial manner, where each tree tries to correct the mistakes of the previous one. By default, there is no randomization in
gradient boosted regression trees; instead, strong pre-pruning is used. Gradient boosted trees often use very shallow trees,
of depth one to five, which makes the model smaller in terms of memory and makes prediction faster.
The main idea is that as more and more trees are added, we can iteratively improve the performance of the algorithm.

Correcting the model’s mistakes: apart from the pre-pruning and the number of trees in the ensemble, another important
parameter of gradient boosting is the learning_rate, which controls how strongly each tree tries to correct the mistakes of
the previous trees. A higher learning rate means each tree can make stronger corrections, allowing for more complex
models. Adding more trees to the ensemble, which can be accomplished by increasing n_estimators also increases the
model complexity, as the model has more chances to correct mistakes on the training set.

Main strengths:
o it belongs to the most powerful and widely used models form supervised ML;
o the algorithm works well without scaling and on a mixture of binary and continuous features;
Main weaknesses:
o as with other tree-based models, it also often does not work well on high-dimensional sparse data;
Main parameters
o the main parameters are the number of trees, n_estimators and the learning_rate;
o in contrast to random forests, where a higher n_estimators value is always better, increasing the parameter in
gradient boosting leads to a more complex model, potentially leading to overfitting.

LECTURE 4 – NEURAL NETWORKS

The family of algorithms known as neural network is known as such because those kinds of algorithms are loosely inspired
by neuroscience. These are constructed such as to have different type of input features and predictions, which are shown
as nodes; and the coefficients (weights) which are
connections between the nodes.
An ANN architecture, overall, comprises:
o input layers;
o hidden (computation) layers, composed of hidden units;
o output layers
A family of algorithms known as neural networks has recently seen a revival under the name “deep learning”. Today, we
will only discuss some relatively simple methods, namely multilayer perceptrons for classification and regression, that can
serve as a starting point for more involved deep learning methods. Multiplayer perceptrons (MLPs) are also known as
vanilla (feed forward) neural networks, or sometimes just neural networks. Feed-forward neural networks are called
networks because they typically are represented by composing together many different functions. For example, we might
have three functions f1, f2 and f3 connected in a chain, to form f(x) = f 3(f2(f1(x))). These chain structures are the most
commonly used structures of neural networks. In this case, f 1 is called the first layer of the network, f 2 is called the second
layer, and so on. The overall length of the chain gives the depth of the model.
The most-commonly used types of neural networks are the following:
o feedforward neural networks used to perform basic pattern and image recognition;
o recurrent neural networks used in natural language processing and speech recognition;
o convolutional neural networks used in object recognition and video analysis.

Feed-forward network Feed-forward networks have the following characteristics:


o perceptrons are arranged in layers, with the first layer taking in inputs and the last layer producing outputs. The
middle layers have no connection with the external world, and hence are known as hidden layers;
o each perceptron in one layer is connected to every perceptron on the next layer. Hence, information is constantly
“fed-forward” from one layer to the next, and this explains why these networks are called feed-forward networks;
o there is no connection among perceptrons in the same layer.

Training a neural network basically means calibrating all of the weights in the ANN. This optimization is performed using
an iterative approach involving forward propagation and backpropagation steps. The goal is to optimize the loss function
(i.e. make the loss as small as possible) over the training set. Feed-forward or forward propagation is a process of feeding
input values to the neural network and getting an output, which we call predicted value. The feed process repeats for all
layers until an output value from the last layer is received. Backpropagation, instead, refers to a process happening after
forward propagation, when we already have a predicted value from the ANN. The difference between the predicted output
and the desired output is converted into the loss (or cost) function J(w) , where w represents the weights in ANN.
If there are multiple layers, then we say that the network is deep. Usually having two or more hidden layers counts as
deep. In contrast, a network with only a single hidden layer is conventionally known as “shallow”.

Deep Learning (DL) applications in Finance Some applications include:

- option pricing with deep learning ;


- using generative adversarial networks to synthetize artificial financial datasets;
- high frequency trading;
- predicting financial market movement directions:
- portfolio optimization;
- text analysis using convolutional neural networks (CNN)

Activation Function The activation function decides which neurons will be activated – that is, what information is passed
to further layers. In a nutshell, it determines whether the neuron fires or not. Activation functions (AFs) refer to the
functions used over the weighted sum of inputs in ANNs to get the desired output. Every activation function takes a single
number and performs a certain fixed mathematical operation on it. AFs allow the network to combine the inputs in more
complex ways, and they provide a richer capability in the relationship they can model and the output they can produce.
Examples of AFs include: linear identity functions, sigmoid functions, than functions and ReLU functions.
Linear AF is represented by the equation of a straight line, where activation is proportional to the input. If we have many
layers, and all the layers are linear in nature, then the final activation function of the last layer is the same as the linear
function of the first layer. The range of a linear function is from -∞ to + ∞: f ( x )=c+ mx .
Sigmoid AF refers to a function that is projected as an S-shaped graph. It is also referred to as logistic activation function.
1
The range of a sigmoid function is from 0 to 1 (very useful in order to model probabilities): f ( x )= −x .
1+e
Tangens hyperbolicus (than) is simply a scaled sigmoid AF. Like the sigmoid AF, its activations saturate, but unlike the
sigmoid AF its output is zero-centered (i.e. the output of this function has an equal mass on both sides of the zero-axis).
Than saturates at -1 for low input values and at +1 for high input values. The range of a than function is from -1 to 1:
f ( x )=tanh ( x ) where tahn ( x )=2ơ ( 2 x )−1, where ơ is the sigmoid AF covered on the previous slide. For the NN shown
previously, the full formula for computing ^y would be ^y =v [ 0 ]∗h [ 0 ] +v [ 1 ]∗h [ 1 ] + v [ 2 ]∗h[2], where:

o h [ 0 ] =tahn ( w [ 0 , 0 ]∗x [ 0 ] +w [ 1 , 0 ] ] x [ 1 ] + w [ 2 , 0 ]∗x [ 2 ] + w [ 3 , 0 ]∗x [ 3 ] ¿


o h [ 1 ] =tahn ( w [ 0 ,1 ]∗x [ 0 ] +w [ 1 , 1 ] ] x [ 1 ] +w [ 2 , 1 ]∗x [ 2 ] +w [ 3 , 1 ]∗x [ 3 ] ¿
o h [ 2 ] =tahn ( w [ 0 , 2 ]∗x [ 0 ] + w [ 1 , 2 ] ] x [ 1 ] + w [ 2 , 2 ]∗x [ 2 ] + w [ 3 ,2 ]∗ x [ 3 ] ¿
with w being the wights between he input x and the hiddent layer h, and v being the weights between the hidden layer h
and the output layer be ^y . The wights v and w are learned from data, x are input features and be ^y is the computed output,
and h are intermediate computations.
ReLu (Rectified Linear Unit) function cuts off values below zero. So, if the input is a positive number, the function returns
the number itself and if the inputs is a negative number, the function returns to zero. The range of the ReLU function is
from 0 to + ∞.
Which one to use? There is no hard-and-fast rule for activation function selection. The decision completely relies on the
properties of the problem and the relationship being modelled. We can try different activation functions and select the one
that helps provide faster convergence and a more efficient training process. The sigmoid function is extremely popular for
modelling probabilities since it is bounded between 0 and 1. Tahn function is preferred to the sigmoid nonlinearity,
because like the sigmoid AF, its activations saturate, but unlike the sigmoid AF its output is zero-centered. The ReLU
function is the most commonly used function because of its simplicity.

Complexity of a Neural Network An helpful measure when thinking about the model complexity of a neural network is the
number of weights or coefficients that are learned. If you have a binary classification dataset with 100 features, and you
have 100 hidden units, then there are 100*100=10,000 weights between the input and the first hidden layer. There are
also 100*1=100 weights between the hidden layer and the output layer, for a total of 10,100 weights. If you add a second
hidden layer with 100 hidden units, there will be another 100*100=10,000 weights from the first hidden layer to the
second hidden layer, resulting in a total of 20,100 weights.
There are many ways to control the complexity of a neural network and they include, for instance:
o adjusting the number of hidden layers;
o adjusting the number of units in each hidden layer;
o applying regularization (α)

Main strengths:
o captures the non-linear relationship between the variables quite well;
o given enough computation time, data, and careful tuning of the parameters, neural networks often bear other
machine learning algorithms (for classification regression tasks);

Main weaknesses:
o the main disadvantage of ANN is the interpretability of the model;
o ANN is not good with small data sets and requires a lot of tweaking and guesswork;
o ANN is computationally expensive and can take a lot of time to train.

Main parameters
o Hidden layers: represents the number of layers and nodes in the ANN architecture;
o Activation function: it represents the activation function of an hidden layer. Some of the activation functions such
as sigmoid, relu, or tanh, can be used.

Summary of Supervised ML

o Linear Models: go- to as first algorithms to try, good for very large datasets, good for very high dimensional data;
o Nearest neighbors: for small datasets, good as baseline and easy to explain;
o Support Vector Machines: powerful for medium-sized datasets of features with similar meaning. Require scaling
of data, sensitive to parameters;
o Naïve Bayes Classifiers: only for classification. Even faster than linear model, good for very large datasets and high
dimensional data. Often less accurate than linear models;
o Decision trees: very fast, don’t need scaling of the data, can be visualized and easily explained;
o Random forests: nearly always perform better than a single decision tree, very robust and powerful. Don’t need
scaling of data and are not very good for high-dimensional and sparse data;
o Gradient boosted decision trees: often slightly more accurate than random forests. Slower to train but faster to
predict than random forests, and smaller in memory. Need more parameter tuning than random forests;
o Neural Networks: can build very complex models, particularly for large datasets. Sensitive to scaling of the data
and to the choice of parameters. Large models need a long time to train.

Explainable AI (XAI) Explainable AI (XAI) refers to methods and techniques in the application of artificial intelligence (AI)
such that the results of the solution can be understood by humans. In your work, you might face the trade-off between
performance and explainability: simpler models are easier to explain, but more complex models tend to perform better.
Some useful features to adjust the complexity of the model are:
o feature importance graphs;
o LIME: Local Interpretable Model-Agnostic Explainations (LIME)
o Sharpey Values (SHAP)

LAB 2 – MODEL EVALUATION AND IMPROVEMENT

Generalization, Overfitting and Underfitting In supervised learning, we want to build a model on the training data and
then be able to make accurate prediction on new, unseen data that has the same characteristics as the training dataset we
used. if a model is bale to make accurate predictions on unseen data, we say it is able to generalize from the training set to
the test set. We measure whether an algorithm will perform well on new data using the evaluation on the test set.
Overfitting occurs when you fit a model too closely to the particularities of the training set and obtain a model than works
well on the training set but is not able to generalize to new data. Choosing too simple a model, instead, causes underfitting.
The more complex we allow out model to be, the better its performance on the training data. However, if our model
becomes too complex, we start focusing too much on each individual data points in our training set, and the model will non
generalize well to new data. There exist a sweet spot that will yield the best generalization performance.

The concepts of overfitting and underfitting are closely linked to bias-variance trade-off. Bias error, or the degree to which
a model fits the training data. Algorithms with erroneous assumption produce high bias with poor approximation, causing
underfitting and high in-sample error. Bias results in underfitting of the data.
Variance error, or how much the model’s results change in response to new data from validation and test sample. Unstable
models pick up noise and produce high variance, causing overfitting and high out-of-sample error. High variance gives rise
to overfitting.

How to combat overfitting


o Using more training datasets: the more training data we have, the harder it is to overfit the data by learning too
much from any single training example. It is important to note that the model complexity is intimately tied to the
variation inputs contained in your training dataset: the larger variety of datapoints your dataset contains, the
more complex a model you can use without overfitting. Usually, collecting more datapoints will yield more variety,
so larger datasets allow building more complex models. However, simply duplicating the same datapoints or
collecting very similar data will not help. Never underestimate the power of using more data!
o Using regularization: adding a penalty in the loss function for building a model assigns too much explanatory
power to any one feature, or allows too many features to be taken into account. In such a case, optimal
regularization can be achieved by varying the regularization parameter.

Model Evaluation We will focus on the supervised methods, regression and classification, as evaluating and selecting
models in unsupervised learning is often a very qualitative process. In the following, we will expand on two aspects of
model performance evaluation. We will first introduce cross-validation, a more robust way to assess generalization
performance, and discuss methods to evaluate classification and regression performance that go beyond the default
measures of accuracy and R2 provided by the score method. We will also discuss grid search, an effective method for
adjusting the parameters in supervised models for the best generalization performance.

Cross-validation One of the challenges of machine learning is training models that are able to generalize well to unseen
data. Indeed, it is fundamental to first remember that train_test_split performs a random split of the data. Imagine that we
are lucky when randomly splitting the data, and all examples that are heard to classify end up in the training set. In that
case, the test set will only contain easy examples, and our test set accuracy will be unrealistically high. Conversely, if we
are unlucky, we might have randomly put all the hard-to-classify examples in the test set and consequently obtain an
unrealistically low score.
Cross-validation is a statistical method of evaluating generalization performance that is more stable and thorough than
using a split into training and a test set. In cross-validation, the data is split repeatedly and multiple models are trained.
The most commonly used version of cross-validation is k-fold cross validation, where k is a user-specified number, usually
5 or 10. It is recommended to use at least 5-fold splits.
When doing k-fold cross-validation:
o Data is first partitioned into k parts of (approximately) equal size, called folds. We end up with k folds.
o Then we train the model using k-1 folds and evaluate the performance of the kth fold;
o We repeat this process k times and average the resulting scores.

Main benefits
o having multiple splits of the data provides information about how sensitive our model is to the selection of the
training dataset;
o compared to using a single split of the data, we can use our data more effectively. We are able to use 80-90% of
the data instead of 70-75%.

Main drawback
o a potential drawback of cross validation is the computational cost, especially when paired with a grid search for
the hyperparameter tuning.

When using time-series data, order of data matters. Can we still apply cross-validation? Yes, but we need to use a specific
methods suited to the problem at hand. When using time series data, indeed, the order of the data matters. There are two
simple methods we can use:
o sliding window;
o expanding window.

SLIDING WINDOW EXPANDING WINDOW


Let’s imagine that our dataset has the following labels:

00000000000000000000000000000000000000000000000000000000000000001111111111

What if one of the sample splits (or multiple) captures only zeros?

SOLUTION 1: STRATIFIED CROSS-VALIDATION: In stratified cross-validation, we split the data such that the proportions
between classes are the same in each for as they are in the whole dataset. For instance, if 95% of your samples belong to
Class A and 5% of your samples belong to Class B, then stratified cross-validation ensures that in each fold, 95% of
samples belong to class A and 5% of samples belong to Class B. It is usually a good idea to used stratified k-fold cross-
validation instead of k-fold cross validation to evaluate a classifier, because it results in more reliable estimates of
generalization performances.
StratifiedKFold is a variation of K-Fold that returns stratified folds. The folds are made by preserving the percentage of
samples for each class.

SOLUTION 2: SHUFFLING THE DATA Another way to resolve this problem is to shuffle the data instead of stratifying the
folds, to remove the ordering of the samples by label. We can do that by setting the shuffle parameter of KFold to TRUE. It
is good practice to fix the random_state to get a reproducible shuffling.

Leave-one-out cross validation : The main idea of this method is to predict each instance, training on all N-1 instances. You
can think of leave-one-out cross validation as k-fold cross validation, where each fold is a single sample. For each split, you
pick a single data point to be the test set. This can be very time-consuming, particularly for large datasets, but sometimes
provides better estimates on small datasets.

Shuffle-split cross-validation: each split samples train_size many points for the training datasets and test_size many
(disjoint) points for the test set. This splitting is repeated n_iter times.
Cross-validation within groups: another very common setting for cross-validation is when there are groups in the data that
are highly related. Say you want to build a system to recognize emotions from pictures of faces, and you collect a dataset of
pictures of 100 people where each person is captured multiple times, showing various emotions. The goal is to build a
classifier that can correctly identify emotions of people not in the dataset. If we used the default stratified cross-validation
to measure the performance of a classifier here, it is likely that pictures of the same person will be both in the training and
the test set. It will be much easier for a classifier to detect emotions in a face that is part of the training set, compared to a
completely new face. To accurately evaluate the generalization to new faces, we must ensure that the training sets contains
images of different people.
Grid Search Now we know how to evaluate how well a model generalizes. Consequently we can take the next step and
improve the model’s generalization performance by tuning its parameters. The most commonly used method is grid
search, which basically means trying multiple possible combination of the parameters of interest. Note that the terms
parameters and hyperparameters have a different meaning. Despite the difference, it is common to speak of parameters
and the reader can infer from the context whether we speak of model parameters or hyperparameters:
o model parameters: properties of the training data that are learnt during training. They are internal to the model
and can be node weights in a NN or split points in decision trees;
o hyperparameters: cannot be learned during training but are those parameters set beforehand. Hyperparameters
control the learning process and are external to the model (i.e. learning rate, hidden layers and hidden units).
Let’s assume that we are tuning two parameters of the SVC model that employs the RBF kernel. It has C parameter and the
gamma parameters.

Let’s say that we want to test six different settings for C and gamma. As a result, we have 36 combinations of parameters in
total. Looking at all possible combinations creates a table (or grid) of parameter settings. Because it is such a common
task, there are standard methods in scikit-learn to implement it. While the method of splitting the data into a training, a
validation and a test set that we just saw is workable, and relatively commonly used, it is quite sensitive how to exactly the
data is split. For a better estimate of the generalization performance, instead of using a single split into a training and
validation set, we can use cross-validation to evaluate the performance of each parameter combination.

Let’s say we use 10-fold-CV – how many models do we have to train with six gamma and C values each? 6 x 6 x 10 = 360

Implementation in Python To implement grid search with cross-validation, we can use GridSearchCV class in scikit-learn.
To use the GridSearchCV class, you first need to specify the parameters you want to search over using a dictionary.
GridSearchCV will then perform all the necessary model first.

GridSearchCV will use cross-validation in place of the split into a training and validation set that we used before. However,
we still need to split the data into a training and a test data set, to avoid overfitting the parameters.
Supervised Performance Metrics Supervised performance metrics can fundamentally be divided in two groups: those used
for assessing the performance of regression algorithms and those used to assess the accuracy of classification algorithms:
o Regression: mean absolute error (MAE), mean squared error (MSE), R-squared (R 2), adjusted R-squared and root
mean squared error (RMSE);
o Classification: accuracy, precision, recall, area under ROC curve (AUC), confusion matrix.
The mean absolute error (MAE) measures the average magnitude of the errors in a set of forecasts, without considering
n
1
their direction: MAE= ∑|Y −Y^ |. The MAE is a linear score, which means that all the individual differences are
n i=1 i
weighted equally in the average. It gives an idea of how wrong the predictions were. The measure gives an idea of the
magnitude of the error, but no idea of the direction.
The Mean Squared Error (MSE) represents the average squared difference between the actual values and the estimated
n
1
∑ |Y i−Y^ | . Taking the square root of the mean squared error converts the units back to the
2
values (residuals): MSE=
n i=1
original units of the output variable and can be meaningful for description of the presentation. This is called root mean
squared error (RMSE).
The R2 metric provides an indication of the goodness of fit of the predictions to actual value. In statistics, this measure is
also called the coefficient of determination. Just like R2, adjusted R2 also shows how well terms fir a curve or line but
adjusts for the number of predictors in a model.
Which one to use? In terms of a preference among these evaluation metrics, if the main goal is predictive accuracy, then
RMSE is the best choice. R2 and adjusted R2 are often used for explanatory purposes by indicating how well the selected
independent variable(s) explains the variability in the dependent variable.
When it comes to classification performance metrics, two fundamental kind of tools need to be analysed:
o Binary classification problems;
o Multi-class classification problems.
The confusion matrix represents the performance of out algorithm through a graphic representation of:
- True positives (TP): predicted positive and actually positive values;
- False positives (FP) = Type I error: predicted positive and actually negative;
- True negatives (TN): predicted negative and actually negative;
- False negatives (FN)=Type II error: predicted negative and are actually positive.

Additional metrics, such as accuracy, precision and recall, can be computed:


o Accuracy (A) = (TP+TN)/(TP+FP+TN+FN) or (TP+TN)/TOTAL
o Precision (P)= TP/(TP+FP)
o Recall (R) = TP/(TP+FN)

Accuracy basically computes the number of correct predictions made as a ratio of all predictions made. This is the most
common evaluation metric for classification problems and is also the most misused. It is most suitable when there are an
equal number of observation in each class (which is rarely the case) and when all predictions and the released prediction
errors are equally important, which is often not the case.
Why can’t we just use accuracy for everything? Imbalanced data sets refer to datasets where one of two classes is much
more frequent than the other one (in the binary case). Example : imagine that you are looking at 100 transactions and in
your dataset, there are 99 non-fraudulent transactions and 1 fraudulent one; in other words, 99% of the samples belong to
the non-fraudulent class. In reality, imbalanced data is the norm, and it is rare that the events of interest have equal or
even similar frequency in the data.
Precision is the percentage of positive instances out of the total predicted positive instances. Precision is also known as
positive predictive value (PPV) and is a good measure to determine when the cost of a false positive is high.
Recall is the percentage of positive instances out of the total of actual positive instances. Recall is also known as sensitivity,
hit rate or true positive rate (TPR). Recall is a good measure when there is a high cost associated with false negative (fraud
detection and disease diagnostics).
F1 score is the harmonic mean of precision and recall. F1 score = (2*P*R)/(R+P) where P is precision and R is recall. This
metrics results in being more appropriate than accuracy when unequal class distribution is in the dataset and it is
necessary to measure the equilibrium of precision and recall. High scores on both of these metrics suggest good model
performance.
Area under ROC curve (AUC) is an evaluation metric for binary classification problems. Receiver Operating Characteristic
(ROC) is a probability curve, and AUC represents degree or measure of separability. It tells how much the model is capable
of distinguishing between classes.

Which model to use?

Selecting the perfect machine learning model is both an art and a science.
These criteria will play a role in the final decision:
o Simplicity;
o Training time;
o Presence of non-linearity in the data;
o Robustness to overfitting;
o Size of the dataset;
o Number of features;
o Model interpretation.

LECTURE 5 – TEXTUAL ANALYSIS THROUGH ML ALGORITHMS

As more information becomes available, it becomes more difficult to find and discover what we need. News, indeed, as
always been a key factor in investment decisions. It is well established that company-specific, macroeconomic and political
news strongly influence the financial market. we need tools to help us organize, search and understand these vast amounts
of information. computational linguistics, also known as natural language processing (NLP), is the subfield of computer
science concerned with using computational techniques to learn, understand and produce human language content.
The goals of NLP involve:
o aiding human – human communication (machine translation, MT)
o adding human-machine communication (with conversational agents)
o benefiting both humans and machines by analysing and learning from the enormous quantity of human language
content that is now available online.
Natural language processing (NLP) is, more specifically, a branch of AI that deals with the problems of making a machine
understand the structure and the meaning of natural language as used by humans. Several techniques of machine learning
and deep learning are used within NLP. NLP has many applications in the financial sector in areas such as:
o sentiment analysis;
o chatbots;
o document processing;
o risk management (liquidity risk management, credit default modelling, etc.).
In the financial landscape, one of the earliest applications of NLP was implemented by the US securities and exchange
commission (SEC).
Automation using NLP is well-suited in the context of finance, since it reduces the strain that repetitive, low-value tasks
put on human employees. It tackles the routine, everyday processes, freeing up teams to finish their high value work. In
doing so, it drives enormous time and cost-savings. A lot of information, such as sell side reports, earning calls, and
newspaper headlines, is communicated via text message, making NPL very useful in the financial domain. Nonetheless, it is
unfortunately quite common for companies to obscure machine-readable disclosure by inserting tables into the
documents in the “picture format”. Pictures can be parsed using optical character recognition (OCR).

Terminology

- Corpus (pl. corpora): in the context of text analysis, the dataset is often called the corpus;
- Document: each data point, represented as a single text, is called a document;
- Token is equivalent to a word.
Every string is not a text data. There are, indeed, four kinds of string data that you might see:
o Categorical data: can easily be mapped into a variable (e.g. sources of data with subcategories “balance sheet”,
“cash-flow statement”, “income statement”);
o Free strings can be semantically mapped to categories (e.g. sources of data with subcategories the same as the
above, but some shortcuts and miss-spellings such as the “P&L statement”, “CF statement”);
o Structured string data: data items with a certain underlying structure (e.g. address, phone number)
o Text data: phrases, sentences, etc. that do not belong to any of the above groups.

NLP processing follows a specific pipeline: pre-processing, feature representation, inference.

Data Pre-processing These three Python – based NLP libraries are a useful starting point for NLP. NLTK provides an easy-
to-use interface to cover 50 corpora and lexical resources such as WordNet. TextBlob and SpaCy also result in being very
useful libraries.
Tokeniation is the task of splitting a text into meaningful segments, called tokens. These segments could be words,
punctuation, numbers, or other special characters that are the building blocks of a sentence. At times, in addition,
extremely common words that offer little value in modelling are excluded from the vocabulary. These words are called
stop words. In finance, we do not always drop all stop words. Some of them can have an important role in e.g.
differentiating sentiment.
Normalization is a necessary step since vocabulary often contains singular and plural versions of some words. Different
verb forms (past tense, present tense, etc.) and a noun relating to the verb are very common. The presence of such words
would likely result in overfitting and poor generalization performance. Solution: there exist processing methods that try to
extract some normal form of a word such as lemmatization and stemming – both are forms of normalization. Through
stemming, each word can be represented using its stem. It is done by using a rule-based heuristic, like dropping common
suffixes. For instance, the English stemmer maps connection, connections, connective, connected, and connecting to
connect.
Lemmatization is, instead, the process of converting inflected forms of a word into its morphological root (known as
lemma). It is basically a process in which a dictionary of known words form is used, and the role of the work in the
sentence is taken into account. Lemmatization is an algorithmic approach and depends on the knowledge of the word and
language structure. For example, the lemma of the words “analyzed” and “analyzing” is “analyse”. Lemmatization is
computationally more expensive and advanced. The major difference between the two processes is that stemming can
often create non-existent words, whereas lemmas are actual words.
Part – of-speech (PoS) tagger uses language structure and dictionaries to tag every token in the text with a corresponding
part of speech. Some common POS tags are noun, verb, adjective and proper noun.
Named entity recognition is an optional next step in data pre-processing that seeks to locate and classify named entities in
text into pre-defined categories. These categories can include name of persons, organizations, locations, expressions of
times, quantities, monetary values, or percentages.
Additional pre-processing methods are the following:
o Lowercasing: the alphabet removes distinctions among the same words due to upper and lower cases. This action
helps the computers to process the same words appropriately;
o Removal of non-alphanumeric characters. These include any characters that are not letters or digits, such as >
or ?. One can use filter() to remove all non-alphanumeric characters from a string. For instance str.isalnum()
returns TRUE if str contains only alphanumeric characters.
o Dependency parsing: it is the process of extracting the dependency parse of a sentence to represent its
grammatical structure. It defines the dependency relationship between headwords and their dependents.
o Conference resolution: is the process of connecting tokens that represent the same entity. It is common in
languages to introduce a subject with their name in one sentence and then refer to them as him/her/it in
subsequent sentences.

Feature representation Words embeddings convert textual data into numerical data. The process of converting NLP texts
into numbers is called vectorization. Representing the meaning of a word is a fundamental task in NLP. Two main
approaches for computing word embeddings can be identified in the literature: frequency-based (count-based: count
vectorization, TF-IDF vectorization, co-occurrence vectorization) and prediction-based (also known as learning based:
pre-trained models and customized deep learning-based feature representation.
In count-based models, the semantic similarity between words is determined by counting the co-occurrence frequency.
Examples of count-based models are N-grams, bag of words, etc. Bag of Words (BoW) is a method for which documents
are described by word occurrences while completely ignoring the relative position information of the words in the
document. In NLP, a common technique for extracting features from text is to place all word that occur in the text in a
bucket. This approach is called a Bag of Words model. It is referred to as a bag of words because any information about the
structure of the sentence is lost. Since the Bow model creates a feature for every unique word in the data, the resulting
matrix can contain thousands of features. This means that the size of the matrix can sometimes become very large in
memory. Luckily we can reduce the amount of data we need to store by using a sparse matrix. One of the nice features of
CountVectorizer is that the output is a sparse matrix by default. One of the main disadvantages of using a bag-of-words
representation with unigrams is that word order is completely discarded. For instance, “it is bad, not good at all” and “it is
good, not bad at all” have exactly the same representation, even though the meanings are inverted. A solution to this
problem is represented by n-grams: one way of capturing context when using a bag-of-words representation is to consider
the counts of pairs or triplets (or more9 of tokens that appear next to each other. N-grams are representations of word or
token sequences. They can offer invaluable contextual information that can complement and enrich unigram BOW. Pairs of
tokens are known as bigrams, triplets of tokens are known as trigrams, and more generally sequences of tokens are known
as n-grams. We can use BoW to produce word counts, which represent a very basic method but can be a good starting
point. Alternatively, we can use BoW to produce word co-occurrence matrix: the latter allow us to count how often things
co-occur in some environment. For some worjd w occurring in the document, we consider the context window
surrounding w. We build a symmetric word-by-word matrix in which M is the number of times w appears inside w
window among all documents. The concepts of BoW and co-occurrence matrix are the building blocks for more
sophisticated NLP algorithms.
Below is the co-occurrence matrix for the corpus containing: I like deep learning, I line NLP and I enjoy flying .

As an alternative to the BoW method, we can calculate word frequencies. By far the most popular method for that is TF-
IDF, which stands for the term Term Frequency- Inverse Document Frequency. In a nutshell, it is a word frequency score
that tries to highlight words that are more interesting (i.e. frequent within a document, but not across documents).

Prediction-based word embedding techniques take context into account. In predictive models the word vectors are learnt
by trying to improve on the predictive ability i.e. minimizing the loss between the target word and the context word. Some
of the models of learning words embeddings from text include word2vec, GloVe and spaCy’s pretained word embedding
models.

Example The word queen:


o Main idea behind Word2vec: king – men+woman = queen.
Main idea behind GloVe: the distance between king → queen is roughly the same as the one between man and
woman, or brother and sister.

Inference Supervised NLP works with naïve Bayes classifiers – or better with the Naïve Bayes model, since the latter can
produce reasonable accuracy using simple assumptions. Unsupervised is a relatively less developed subdomain of NLP in
which measuring document similarity is among the most common tasks. Latent Dirichlet Allocation (LDA) has been
extensively used for topic modelling – a growing area of research in which NLP practitioners build probabilistic generative
models to reveal likely topic attributions for words.

Topic modelling provides methods for automatically organizing, understanding, searching and summarizing large
electronic archives:
1. Discover the hidden themes in the collection:
2. Annotate the documents according to these themes;
3. Use annotations to organize, summarize, search and form predictions.
Often, when people talk about topic modelling, they refer to one particular decomposition method called Latent Dirichlet
Allocation (LDA), a model which tries to find groups of words (the topics) that appear together frequently.

LECTURE 6 – UNSUPERVISED MACHINE LEARNING

Unsupervised learning is a type of machine learning used to draw inferences from data sets consisting of input data
without labelled responses. There are two types of unsupervised learning: dimensionality reduction and clustering.
o Dimensionality reduction is the process of reducing the number of features, or variables, in a data while
preserving information and overall model performance. It is a common and powerful way to deal with datasets
that have a larger number of dimensions;
o Clustering is a sub-category of unsupervised learning techniques that allows us to discover hidden structures in
data. The goal of clustering is to find a natural grouping in data so that items in the same cluster are more similar
to each other than those from different clusters.

DIMENSIONALITY REDUCTION CLUSTERING


What is the main challenge of unsupervised ML? Algorithm performance evaluation : a major challenge of unsupervised
machine learning algorithms is evaluating whether the algorithm learned something useful. However, there is no way for
us to tell the algorithms what we are looking for, and often the only way to evaluate the result of an unsupervised
algorithm is to inspect it manually.

Dimensionality reduction The most frequently-used techniques for dimensionality reduction include:
o principal component analysis (PCA)
o Kernel principal component analysis (KPCA)
o t- distributed stochastic neighbour embedding (t-SNE)
Linear algorithms (such as PCA) force the new variables to be linear combinations of original features; non-linear
algorithms (such as KPCA and t-SNE) can capture more complex structures in the data.
Dimensionality reduction can be used in many fields of finance: portfolio management, yield curve construction and
interest rate modelling, speed and accuracy enhancement of a trading strategy,…

Clustering Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labelling of
groups of points based on their similarity. In this lecture, we will cover the following three clustering techniques: k-means
clustering, hierarchical clustering and affinity propagation clustering. When performing clustering algorithms, we basically
tell the algorithm to look for similarities: it will basically focus on minimizing dissimilarity (i.e. the distance between
datapoints). Clustering has a wide range of applications across finance: portfolio construction (pairs trading, peer
selection), investor classification, risk management (identifying securities with a similar risk profile, …).
Pairs trading is a non-directional, relative value investment strategy which seeks to identify two companies or funds with
similar characteristics whose equity securities are currently trading at a price relationship that is out of their historical
trading range. This investment strategy will entail buying the undervalued security while short selling the overvalued
security, all while maintaining market neutrality. It can also be referred to as market neutral or statistical arbitrage.
Asset management and investment allocation is a tedious and time-consuming process in which investment managers
often must design customized approaches for each client or investor. What if we were able to organize these clients into
particular investor profiles, or clusters, wherein each group is indicative of investors with similar characteristics? In
practice, one often takes data from the Individual Investment Policy Statement (IPS) to determine the investor’s ability
and willingness to take risk.
k-means clustering is the most well-known clustering technique. It is centroid-based algorithm, or distance-based
algorithm. It tries to find cluster centres that are representative of certain regions of the data. The algorithm alternates
between two steps: 1) assigning each data point to the closest cluster centre, and then; 2) setting each cluster centre as the
mean of the data points that are assigned to it. The algorithm is finished when the assignment of instances to clusters no
longer changes.
The k-means algorithm divides a set of N samples X into K disjoint clusters S, each described by the mean μ of the samples
in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general points from X,
although they live in the same space. The k-means algorithm aims to choose centroids that minimize the inertia, also
knowns as the within-cluster sum-of-squares criterion. The centroid is representing the centre of the cluster and is
calculated as the arithmetic mean. Inertia can be recognized as a measure of how internally coherent clusters are.

Main strengths:
o simplicity, wide range of applicability, fast convergence, and linear scalability to large data while producing
clusters of an even size;
o it is most useful when we now the exact number of clusters, k, beforehand.

Main weaknesses
o the main weakness of k-means is having to tune the “number of clusters” hyperparameter;
o additional drawbacks include the lack of guarantee to find a global optimum and its sensitivity to outliers;
o it can only capture relatively simple shapes.

Main parameters
o number of clusters: the number of clusters and centroids to generate;
o maximum iterations: maximum iterations of the algorithm for a single run.
How to find the optimal number of clusters in the data? We can plot the objective function for multiple values for k and
look for an abrupt change in it, which is highly suggestive of the number of clusters in the data. This technique for
determining the number of clusters is known as “knee finding” or “elbow finding”. As you know, if k increases, average
distortion will decrease, each cluster will have fewer constituent instances, and the instances will be closer to their
respective centroids. However, the improvements in average distortion will decline as k increases. The elbow method
plots the value of the cost function produced by different values of k.

Which statistic measure is preferred over the mean (our centroid) in the presence of outliers? The median (here:
medoid). One issue with the k-means algorithm is it’s sensitivity to outliers. As the centroid is calculated as the mean of the
observations in a cluster, extreme values in a dataset can disrupt a clustering solution significantly. K-medoids is a popular
approach to overcoming this problem. As the name suggests, this alternative algorithm uses medoids rather than centroids
as the centre point of a cluster which implies that the centre of a cluster must be one of the observations in that specific
cluster.

Hierarchical clustering involves creating clusters that have a predominant ordering (i.e. a hierarchy). The main advantage
of clustering is that we do not need to specify the number of clusters; the model determines that by itself. This clustering
technique is divided in two types:
o agglomerative hierarchical clustering: it is a bottom -up approach and each observation starts in its own cluster,
and pairs of cluster are merged as on moves up the hierarchy;
o divisive hierarchical clustering: it is a top down approach, meaning that all observations start in one cluster, and
splits are performed recursively as one moves down the hierarchy.
Dendogra, refers to a tree-based representation of the clusters, i.e. visualizations of a hierarchical clustering.
Main strengths:
o in contrast to k-means, hierarchical clustering will create a hierarchy of clusters and therefore does not require us
to pre-specify the number of clusters;
o its results can be easily visualized using dendrograms.

Main weaknesses:
o unfortunately, agglomerative clustering still falls at separating complex shapes present in the data structure;
o the choice of both the distance metric and the linkage criteria is often arbitrary. There is rarely any strong
theoretical basis for such decisions. A core principle of science is that findings are not the result of arbitrary
decisions, which makes the technique of dubious relevance in modern research.

Main parameters
o distance metric is the metric used to compute the linkage
o linkage criterion determines which distance to use between sets of observations. The algorithm will merge the
pairs of clusters that minimize this criterion.

Divisive hierarchical clustering in comparison to agglomerative clustering

Main strength:
o bottom-up methods make clustering decisions based on local patterns without initially taking into account the
global distribution. These early decisions cannot be undone. Top-down clustering benefits from complete
information about the global distribution when making top-level partitioning decisions.

Main weaknesses:
o it is quite sensitive to initialization, due to the possible division of data into two clusters at the first step.

Affinity Propagation Clustering Affinity propagation clustering creates clusters by sending messages between pairs of
samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those
most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the
exemplar of the other, which is updated in response to the value from other pairs. This updating happens iteratively until
convergence, at which point the final exemplars are chosen, and hence the final clustering is given.
This clustering methodology describes a data set using a small number of exemplars. These are members of the input set
that are representative of clusters. This is the main difference between APC and k-means clustering.

Main strengths:
o it chooses the number of clusters based on the data provided;
o the algorithm is fast.

Main weaknesses:
o the main drawback of affinity propagation is its complexity (time and memory-complexity)
o due to complexity, most appropriate for small- to medium-sized data sets;
o the algorithm often converges to suboptimal solutions, and at times it can fail to converge.

Main parameters: two important parameters are:


o the preference, which controls how many exemplars are used, and
o the damping factor which damps the responsibility and availability messages to avoid numerical oscillations when
updating these messages.

You might also like