Professional Documents
Culture Documents
Analytics
Part 1
MODERN DATA ANALYTICS
[G0Z39A]
PROF. DR. IR. JAN DE SPIEGELEER
Contents
◦ Tools
◦ Scikit-Learn
◦ Three Case Studies
◦ Gaussian Mixtures
◦ Spectral Clustering
◦ PCA
Tools
3
Tools
Supervised Learning
◦ Classification
◦ Regression
Unsupervised Learning
◦ Clustering
◦ Dimensionality Description
4
Supervised vs. Unsupervised Learning
◦ In a supervised learning model, the algorithm learns on a labeled dataset (X,y), providing an answer key that the
algorithm can use to evaluate its accuracy on training data.
◦ A model is chosen that maps X->Y
◦ This model has hyper-parameters (eg coefficients of a regression)
◦ The training of a machine learning model = hyper-parameter optimisation.
◦ The optimisation uses a Loss function.
◦ Techniques such as Gradient Descent are used (also in Deep Learning)
◦ An unsupervised model, in contrast, provides unlabeled data that the algorithm tries to make sense of by extracting
features and patterns on its own.
5
Structured vs. Unstructured Data
Another dimension to split the data-analytics problems is the type of data. Broadly speaking there are two categories
◦ Structured Data : Structured data is highly-organized and formatted in a way so it's easily searchable.
◦ Structured data is most often categorized as quantitative data, and it's the type of data most of us are used to work
with. Structured data fits neatly within fixed fields and columns in relational databases and spreadsheets.
◦ Examples of structured data include names, dates, addresses, credit card numbers, stock information, geolocation,
and more.
◦ Structured data is highly organized and easily understood by machine language. Those working within relational
databases can input, search, and manipulate structured data relatively quickly. This is the most attractive feature of
structured data.
◦ Unstructured Data : Unstructured data is difficult to deconstruct because it has no pre-defined model, meaning it
cannot be organized in relational databases. Here NoSQL databases are going to be used and specific tools (eg
Natural Language Processing, WebScraping) are required.
6
Sci-Kit Learn
◦ Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent
interface in Python.
◦ The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-
learn
◦ The library is focused on modeling data. It is not focused on loading, manipulating and summarizing
data. For these features, refer to NumPy and Pandas.
◦ Sci-Kit Learn also provides us with a limited range of sample data-sets
7
PyTorch
Deep Learning open-source software primarily developed by Facebook's AI Research lab.
◦ Tenors are similar to NumPy’s ndarrays, with the addition being that Tensors can also be used on a GPU
to accelerate computing.
◦ Central to all neural networks in PyTorch is the Autograd package. The autograd package provides
automatic differentiation for all operations on Tensors.
Remember: Optimisation is about minimizing a loss-function. Gradients will help us to understand how
a particular hyper-parameter impacts the vaue of the loss function.
8
TensorFlow
Deeplearning library developed by Google in 2015 (1 year older than Pytorch)
Larger user base than Python (https://bit.ly/3uSMIJg)
Major release from Tensorflow 1.0 to Tensorflow 2.0
With Keras, you have an easier entry point to learn Tensorflow.
Less “Python-esque” than Pytorch.
9
Cross Validation
10
Cross Validation
Cross-validation is a resampling procedure used to evaluate
machine learning models on a limited data sample.
- Kfold http://bit.ly/2WZ6oLK
11
Cross Validation : variations
Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each
fold has the same proportion of observations with a given categorical value, such as the class
outcome value. This is called stratified cross-validation.
Repeated: This is where the k-fold cross-validation procedure is repeated n times, where
importantly, the data sample is shuffled prior to each repetition, which results in a different split
of the sample.
Nested: This is where k-fold cross-validation is performed within each fold of cross-validation,
often to perform hyperparameter tuning during model evaluation. This is called nested cross-
validation or double cross-validation.
12
Unsupervised Learning
CLUSTERIN G & PCA
13
Clustering
Kmeans
Mini-Batch KMeans
Gaussian Mixture Models
Spectral Clustering
14
K Means
K-means clustering is one of the simplest and popular
unsupervised machine learning algorithms. Typically,
unsupervised algorithms make inferences from datasets
using only input vectors without referring to known, or
labelled, outcomes.
The objective of K-means is to group similar data points
together and discover underlying patterns. To achieve this
objective, K-means looks for a fixed number (k) of clusters in
a dataset.
In the chapter on data-preprocessing, K Means has been
covered in depth (credit spread example).
15
Mini-Batch KMeans
The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches. Mini-batches are subsets of the
input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of
computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time
of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.
MiniBatchKMeans converges faster than KMeans, but the quality of the results is reduced. In practice this difference in
quality can be quite small, as shown in the example below.
Example notebook :
http://bit.ly/3hwYCCc
16
Gaussian Mixture Models
Imagine the 10 datapoints are sampled from either a blue or a red curve population.Each
population can be described by a normal density function ~𝑁(𝜇, σ). Our observation is the
result of a mixing of 2 Gaussian distributions:
17
Gaussian Mixture Models
For the blue population: 𝑁 𝜇! , σ!
For the red population: : 𝑁 𝜇" , σ"
We will use the EM-algorithm (Expectation Maximisation) to
determine these population parameters.
It is a chicken-and-egg problem:
◦ To determine 𝜇! , σ! 𝜇" , σ" we need to know those points 𝑥# that
belong to blue or red
◦ To determine if 𝑥# is blue or red, we need in
𝜇! , σ! 𝜇" , σ"
For each data point 𝑥# we can determine if it belongs to blue
or red (Eq GM 1)
⇣ ⌘
p 1 1 xi µb 2
P (xi | b) = 2⇡
exp 2( )
b ⇣ b ⌘
p 1 1 x i µr 2
P (xi | r) = 2⇡
exp 2( r
)
r
Using:
18
Gaussian Mixture Models
Procedure:
1. Choose 𝜇! , σ! 𝜇" , σ"
2. Determine for each of the points the corresponding
population (blue/red) using the corresponding density
functions
3. After assigning the points to blue/red, calculate
𝜇! , σ! 𝜇" , σ"
19
Gaussian Mixture Models
GMM BitCoin.ipynb
20
Spectral
Clustering
This is an example of a dataset, where Kmeans
will struggle. The points are drawn from two
concentric circles with some noise
added. Kmeans will find two clusters, but the
result will be flawed because the algorithm
makes use of Euclidean Distances.
In spectral clustering, data points are treated as
nodes of a graph. Thus, spectral clustering is a
graph partitioning problem. The nodes are then
mapped to a low-dimensional space that can be
easily segregated to form clusters. No assumption
is made about the shape/form of the clusters.
The goal of spectral clustering is to cluster data
that is connected but not necessarily compact or
clustered within convex boundaries.
21
Spectral Clustering
22
Spectral
Clustering
Spectral Clustering vs. Kmeans
◦ Compactness/Grouping (Kmeans)
Points that lie close to each other
fall in the same cluster and are
compact around the cluster center.
◦ Connectivity (Spectral) Points that
are connected or immediately next
to each other are put in the same
cluster. Even if the distance
between 2 points is small, if they
are not connected, they are not
clustered together.
Examples of DataSets where Kmeans clustering will not work, but
where Spectral Clustering works
23
Spectral Clustering
One can use the Sci-kit learn API
(http://bit.ly/3mY6G0d).
24
Spectral Clustering : Graphs
A graph is a set of nodes (aka vertices) that are connected with To make this graph undirected, one of the following
edges. There are several ways to construct graphs. approaches are followed:
◦ Direct an edge from u to v and from v to u if either v is
- Relationships (Person A is a <friend of> person B) among the k-nearest neighbours of u OR u is among the k-
nearest neighbours of v.
- Correlation between two financial instruments A/B
◦ Direct an edge from u to v and from v to u if v is among the
- ... k-nearest neighbours of u AND u is among the k-nearest
neighbours of v.
In this particular case, the K-Nearest Neighbours algortihm
can be used to specify a link between two modes
Sklearn has lots of possible affinity metrics. The default
A parameter k is fixed beforehand. For two vertices u and v, an choice is the Radial Basis Function (RBF) kernel.
edge is directed from u to v only if v is among the k-nearest
neighbours of u. Note that this leads to the formation of a
weighted and directed graph because it is not always the case More Info : https://bit.ly/38JMp9s
that for u->v has as consequence that v->u
25
Example
Spectral
Clustering:Graphs
This graph has 10 nodes and 12 edges. It also has
two connected components {0,1,2,8,9} and
{3,4,5,6,7}. A connected component is a maximal
subgraph of nodes which all have paths to the
rest of the nodes in the subgraph.
26
Spectral Clustering : Adjacency Matrix
The graph on the previous page can be represented as a
matrix A. This is an adjaceny matrix. If the edges were
weighted, the weights of the edges would go in this matrix
instead of just 1s and 0s.
Since our graph is undirected, the entries for row i, col j will be
equal to the entry at row j, col i.
27
Spectral Clustering : Degree Matrix of A
The degree of a node is how many edges connect to it. In a
directed graph we could talk about in-degree and out-degree,
but in this example we just have degree since the edges go
both ways.
We can get the degree by taking the sum of the node’s row in
the adjacency matrix.
28
Spectral Clustering: Laplacian Matrix
Starting from the Degree Matrix (D) and the Adjacency
Matrix (A), we can construct a Laplacian Matrix (L) .
29
Spectral Clustering: Laplacian Matrix
To identify good clusters. The Laplacian L should be
approximately re-arranged as a block-diagonal, with
each block defining a cluster. If we have 3 major
clusters (C1, C2, C3), we would expect 3 blocks
30
We see that when the graph is completely disconnected, all ten of our eigenvalues are 0.
As we add edges, some of our eigenvalues increase. In fact, the number of 0 eigenvalues corresponds
to the number of connected components in the graph.
31
As that final edge is added, connecting the two components
into one, all of the eigenvalues but one have been lifted.
The number of Zero-eigenvalues corresponded to the
number of connected graphs (in this case this is 1,
there is only one graph left)
32
Spectral Clustering: Eigenvalues of
Laplacian
The first non-zero eigenvalue is not
very large, we are not too far away of having 2 connected
graphs
33
Spectral Clustering: Gaps in the
Eigenvalues
34
Spectral Clustering: Gaps in the
Eigenvalues
The second eigenvalue is called the Fiedler value, and the
corresponding eigenvector is the Fiedler vector.
35
Spectral Clustering: Whole Procedure
Spectral Clustering.ipynb
36
Spectral Clustering: Whole Procedure
Let’s go back to our problem where the data
is not in a graph. The points are drawn from
two concentric circles with some noise
added. We’d like an algorithm to be able to
cluster these points into the two circles that
generated them. Kmeans will not do the job,
let’s see what Spectral Clustering does.
37
Spectral Clustering: Whole Procedure
38
Spectral Clustering: Whole Procedure
39
Spectral Clustering: Whole Procedure
Adjacency Matrix
Degree Matrix
Simple Laplacian
40
Spectral Clustering: Whole Procedure
41
Dimensionality
Reduction
42
Principal Component Analysis (PCA)
PCA is commonly used for dimensionality reduction by
projecting each data point onto only the first few principal
components to obtain lower-dimensional data while
preserving as much of the data's variation as possible.
43
Principal Component Analysis (PCA)
44
Principal Component Analysis (PCA)
45
Principal Component Analysis (PCA)
46
Principal Component Analysis (PCA)
Centering Data
In our previous example the data was centered. The PCA model in Sklearn will center the data but will not scale the data.
Incremental PCA
The PCA model requires that all of the data to be processed must fit in main memory. The IncrementalPCA object uses a
different form of processing and allows for partial computations which an almost exact match of the results of PCA while
processing the data in a minibatch fashion. This approach updates the explained_variance_ratio_ incrementally batch
after batch.
47
PCA : A Classical Case Study
A Yield curve, in economics and finance, a curve that A yield curve is typically upward sloping; as the time to
shows the interest rate associated with different maturity increases, so does the associated interest
contract lengths for a particular debt instrument (e.g., rate. The reason for that is that debt issued for a
a treasury bill). It summarizes the relationship between longer term generally carries greater risk because of
the term (time to maturity) of the debt and the the greater likelihood of inflation or default in the long
interest rate (yield) associated with that term. run. Therefore, investors (debt holders) usually require
a higher rate of return (a higher interest rate) for
longer-term debt.
48
PCA : A Classical Case Study
A Yield curve for a given currency has many
tenors. The combination of all the tenors on
one single day, constitutes the yield curve
for that day.
Every point on the yield curve is called a
tenor. Each of these points can be
considered a set of (correlated) random
variable.
Each of these variable has a corresponding
time series
PCA_EUR.ipynb
49
PCA : A Classical Case Study
13-Yr History of the EUR
interest rate curves
50
PCA : A Classical Case Study
The three primary yield curve movements Change in the overall level: This is called a
are of importance to the portfolio manager. parallel shift of the yield curve. Here the
These are changes in: interest rates across all maturities change
by the same number of basis points (1basis
- level point = 0.01%)
- slope The slope of the yield curve can become
- curvature steeper of flatter. The gap between the
interest rates on short-term tenors and
of the yield curve. long-term tenors increases. The opposite
movement is a flattening of the yield curve.
51
PCA : A Classical Case Study
In this case study, the curve is defined by 8
curvature
tenors. In practice, there are over 20 points
making up te yield curve. Each tenor has a
corresponding timeseries.
52
The first two eigenvectors of the covariance matrix
seem to tell the complete story
PCA : A
Classical
Case Study
53
PCA : A Classical Case Study
The first two eigenvector can be projected in
the domaine (2Y,10Y).
What is the intuition behind the first two
components ?
54
PCA : A Classical Case Study
Which of basic movement is
captured by the third
eigenvector ?
55
PCA : A Classical Case Study
The impact of the credit crisis in 2008 (steepening )
and financial crisis in 2001 is visible
56
PCA : A Classical Case Study
n observations Dimension = p
The transformed dataset (Data*)
has been constructed from the
PCA
Dimension = p
Dimension = p
Data*
57
PCA : A Classical Case Study
58
PCA : Other Examples
◦ Stock Market : http://bit.ly/2X3ugxQ
◦ Rainfall prediction :http://bit.ly/2KJ009e
◦ Air polution: http://bit.ly/3aYasEl
59