You are on page 1of 59

Data

Analytics
Part 1
MODERN DATA ANALYTICS
[G0Z39A]
PROF. DR. IR. JAN DE SPIEGELEER
Contents
◦ Tools
◦ Scikit-Learn
◦ Three Case Studies
◦ Gaussian Mixtures
◦ Spectral Clustering
◦ PCA
Tools

3
Tools
Supervised Learning
◦ Classification
◦ Regression

Unsupervised Learning
◦ Clustering
◦ Dimensionality Description

Visualisation (Matplotlib, Seaborn, Plotly)

Deep Learning (PyTorch, TensorFlow)

4
Supervised vs. Unsupervised Learning
◦ In a supervised learning model, the algorithm learns on a labeled dataset (X,y), providing an answer key that the
algorithm can use to evaluate its accuracy on training data.
◦ A model is chosen that maps X->Y
◦ This model has hyper-parameters (eg coefficients of a regression)
◦ The training of a machine learning model = hyper-parameter optimisation.
◦ The optimisation uses a Loss function.
◦ Techniques such as Gradient Descent are used (also in Deep Learning)
◦ An unsupervised model, in contrast, provides unlabeled data that the algorithm tries to make sense of by extracting
features and patterns on its own.

5
Structured vs. Unstructured Data
Another dimension to split the data-analytics problems is the type of data. Broadly speaking there are two categories
◦ Structured Data : Structured data is highly-organized and formatted in a way so it's easily searchable.
◦ Structured data is most often categorized as quantitative data, and it's the type of data most of us are used to work
with. Structured data fits neatly within fixed fields and columns in relational databases and spreadsheets.
◦ Examples of structured data include names, dates, addresses, credit card numbers, stock information, geolocation,
and more.
◦ Structured data is highly organized and easily understood by machine language. Those working within relational
databases can input, search, and manipulate structured data relatively quickly. This is the most attractive feature of
structured data.
◦ Unstructured Data : Unstructured data is difficult to deconstruct because it has no pre-defined model, meaning it
cannot be organized in relational databases. Here NoSQL databases are going to be used and specific tools (eg
Natural Language Processing, WebScraping) are required.

6
Sci-Kit Learn
◦ Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent
interface in Python.
◦ The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-
learn
◦ The library is focused on modeling data. It is not focused on loading, manipulating and summarizing
data. For these features, refer to NumPy and Pandas.
◦ Sci-Kit Learn also provides us with a limited range of sample data-sets

7
PyTorch
Deep Learning open-source software primarily developed by Facebook's AI Research lab.
◦ Tenors are similar to NumPy’s ndarrays, with the addition being that Tensors can also be used on a GPU
to accelerate computing.
◦ Central to all neural networks in PyTorch is the Autograd package. The autograd package provides
automatic differentiation for all operations on Tensors.

Remember: Optimisation is about minimizing a loss-function. Gradients will help us to understand how
a particular hyper-parameter impacts the vaue of the loss function.

8
TensorFlow
Deeplearning library developed by Google in 2015 (1 year older than Pytorch)
Larger user base than Python (https://bit.ly/3uSMIJg)
Major release from Tensorflow 1.0 to Tensorflow 2.0
With Keras, you have an easier entry point to learn Tensorflow.
Less “Python-esque” than Pytorch.

9
Cross Validation

10
Cross Validation
Cross-validation is a resampling procedure used to evaluate
machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to


the number of groups that a given data sample is to be split
into. As such, the procedure is often called k-fold cross-
validation. When a specific value for k is chosen, it may be
used in place of k in the reference to the model, such as k=10
becoming 10-fold cross-validation.

Each data-point is used once as member of the test-section


and k-1 as training.
Sklearn API :

- Kfold http://bit.ly/2WZ6oLK

- Model Selection : http://bit.ly/2L1WMgF

11
Cross Validation : variations
Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each
fold has the same proportion of observations with a given categorical value, such as the class
outcome value. This is called stratified cross-validation.
Repeated: This is where the k-fold cross-validation procedure is repeated n times, where
importantly, the data sample is shuffled prior to each repetition, which results in a different split
of the sample.
Nested: This is where k-fold cross-validation is performed within each fold of cross-validation,
often to perform hyperparameter tuning during model evaluation. This is called nested cross-
validation or double cross-validation.

12
Unsupervised Learning
CLUSTERIN G & PCA

13
Clustering
Kmeans
Mini-Batch KMeans
Gaussian Mixture Models
Spectral Clustering

14
K Means
K-means clustering is one of the simplest and popular
unsupervised machine learning algorithms. Typically,
unsupervised algorithms make inferences from datasets
using only input vectors without referring to known, or
labelled, outcomes.
The objective of K-means is to group similar data points
together and discover underlying patterns. To achieve this
objective, K-means looks for a fixed number (k) of clusters in
a dataset.
In the chapter on data-preprocessing, K Means has been
covered in depth (credit spread example).

The main issue is that fact that the complete data-set is to be


stored in the memory of the computer.
Sklearn API :http://bit.ly/3n0RBef

15
Mini-Batch KMeans
The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches. Mini-batches are subsets of the
input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of
computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time
of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.
MiniBatchKMeans converges faster than KMeans, but the quality of the results is reduced. In practice this difference in
quality can be quite small, as shown in the example below.
Example notebook :
http://bit.ly/3hwYCCc

16
Gaussian Mixture Models
Imagine the 10 datapoints are sampled from either a blue or a red curve population.Each
population can be described by a normal density function ~𝑁(𝜇, σ). Our observation is the
result of a mixing of 2 Gaussian distributions:

It is straightforward to determine the two parameters 𝜇, σ based on the samples.

17
Gaussian Mixture Models
For the blue population: 𝑁 𝜇! , σ!
For the red population: : 𝑁 𝜇" , σ"
We will use the EM-algorithm (Expectation Maximisation) to
determine these population parameters.

It is a chicken-and-egg problem:
◦ To determine 𝜇! , σ! 𝜇" , σ" we need to know those points 𝑥# that
belong to blue or red
◦ To determine if 𝑥# is blue or red, we need in
𝜇! , σ! 𝜇" , σ"
For each data point 𝑥# we can determine if it belongs to blue
or red (Eq GM 1)
⇣ ⌘
p 1 1 xi µb 2
P (xi | b) = 2⇡
exp 2( )
b ⇣ b ⌘
p 1 1 x i µr 2
P (xi | r) = 2⇡
exp 2( r
)
r

Using:
18
Gaussian Mixture Models
Procedure:
1. Choose 𝜇! , σ! 𝜇" , σ"
2. Determine for each of the points the corresponding
population (blue/red) using the corresponding density
functions
3. After assigning the points to blue/red, calculate
𝜇! , σ! 𝜇" , σ"

Iterate between steps 2 and 3, until convergence is reached.

Gaussian_Mixtures.ipynb (requires some knowledge of PCA)

19
Gaussian Mixture Models

GMM BitCoin.ipynb

20
Spectral
Clustering
This is an example of a dataset, where Kmeans
will struggle. The points are drawn from two
concentric circles with some noise
added. Kmeans will find two clusters, but the
result will be flawed because the algorithm
makes use of Euclidean Distances.
In spectral clustering, data points are treated as
nodes of a graph. Thus, spectral clustering is a
graph partitioning problem. The nodes are then
mapped to a low-dimensional space that can be
easily segregated to form clusters. No assumption
is made about the shape/form of the clusters.
The goal of spectral clustering is to cluster data
that is connected but not necessarily compact or
clustered within convex boundaries.

21
Spectral Clustering

22
Spectral
Clustering
Spectral Clustering vs. Kmeans
◦ Compactness/Grouping (Kmeans)
Points that lie close to each other
fall in the same cluster and are
compact around the cluster center.
◦ Connectivity (Spectral) Points that
are connected or immediately next
to each other are put in the same
cluster. Even if the distance
between 2 points is small, if they
are not connected, they are not
clustered together.
Examples of DataSets where Kmeans clustering will not work, but
where Spectral Clustering works

23
Spectral Clustering
One can use the Sci-kit learn API
(http://bit.ly/3mY6G0d).

This API-follows a procedure (http://bit.ly/38JMp9s)


1. Construct a Graph

2. Calculate Adjacency Matrix (A)

3. Calculate Degree Matrix (D)

4. Calculate Laplacian Matrix (L)

5. Calculate Eigenvalues and use the Gap to


determine the number of clusters

24
Spectral Clustering : Graphs
A graph is a set of nodes (aka vertices) that are connected with To make this graph undirected, one of the following
edges. There are several ways to construct graphs. approaches are followed:
◦ Direct an edge from u to v and from v to u if either v is
- Relationships (Person A is a <friend of> person B) among the k-nearest neighbours of u OR u is among the k-
nearest neighbours of v.
- Correlation between two financial instruments A/B
◦ Direct an edge from u to v and from v to u if v is among the
- ... k-nearest neighbours of u AND u is among the k-nearest
neighbours of v.
In this particular case, the K-Nearest Neighbours algortihm
can be used to specify a link between two modes
Sklearn has lots of possible affinity metrics. The default
A parameter k is fixed beforehand. For two vertices u and v, an choice is the Radial Basis Function (RBF) kernel.
edge is directed from u to v only if v is among the k-nearest
neighbours of u. Note that this leads to the formation of a
weighted and directed graph because it is not always the case More Info : https://bit.ly/38JMp9s
that for u->v has as consequence that v->u

25
Example

Spectral
Clustering:Graphs
This graph has 10 nodes and 12 edges. It also has
two connected components {0,1,2,8,9} and
{3,4,5,6,7}. A connected component is a maximal
subgraph of nodes which all have paths to the
rest of the nodes in the subgraph.

26
Spectral Clustering : Adjacency Matrix
The graph on the previous page can be represented as a
matrix A. This is an adjaceny matrix. If the edges were
weighted, the weights of the edges would go in this matrix
instead of just 1s and 0s.

Since our graph is undirected, the entries for row i, col j will be
equal to the entry at row j, col i.

The diagonal of this matrix contains all zeroes, since none of


the nodes have edges to themselves.

27
Spectral Clustering : Degree Matrix of A
The degree of a node is how many edges connect to it. In a
directed graph we could talk about in-degree and out-degree,
but in this example we just have degree since the edges go
both ways.

We can get the degree by taking the sum of the node’s row in
the adjacency matrix.

The degree matrix is a diagonal matrix where the value at


entry (i, i) is the degree of node i.

Points 0 and 5 are the points with the highest degree

28
Spectral Clustering: Laplacian Matrix
Starting from the Degree Matrix (D) and the Adjacency
Matrix (A), we can construct a Laplacian Matrix (L) .

There are different possibilities


◦ Simple Laplacian 𝐿 = 𝐷 − 𝐴
◦ Normalized Laplacian 𝐿% = 𝐷&'/) 𝐿𝐷&'/)
◦ Generalized Laplacian 𝐿* = 𝐷&' 𝐿
◦ Relaxed Laplacian 𝐿+ = 𝐿 − 𝐷

Using L : The Laplacian’s diagonal is the degree of our


nodes, and the off diagonal is the negative edge
weights.

This is a 10x10 matrix, with 10 eigenvalues and


corresponding eigenvectors.

29
Spectral Clustering: Laplacian Matrix
To identify good clusters. The Laplacian L should be
approximately re-arranged as a block-diagonal, with
each block defining a cluster. If we have 3 major
clusters (C1, C2, C3), we would expect 3 blocks

30
We see that when the graph is completely disconnected, all ten of our eigenvalues are 0.
As we add edges, some of our eigenvalues increase. In fact, the number of 0 eigenvalues corresponds
to the number of connected components in the graph.

Spectral Clustering: Eigenvalues of Laplacian

31
As that final edge is added, connecting the two components
into one, all of the eigenvalues but one have been lifted.
The number of Zero-eigenvalues corresponded to the
number of connected graphs (in this case this is 1,
there is only one graph left)

Spectral Clustering: Eigenvalues of Laplacian

32
Spectral Clustering: Eigenvalues of
Laplacian
The first non-zero eigenvalue is not
very large, we are not too far away of having 2 connected
graphs

One Component (there is one zero eigenvalue)

33
Spectral Clustering: Gaps in the
Eigenvalues

There is a large gap between 4 and 5 =>


very unlikely to have 5 clusters.

We are also not too far way to have 4 connected


graphs (clusters)

34
Spectral Clustering: Gaps in the
Eigenvalues
The second eigenvalue is called the Fiedler value, and the
corresponding eigenvector is the Fiedler vector.

The Fiedler value approximates the minimum graph cut needed to


separate the graph into two connected components.

Recall, that if our graph already had two connected components,


then the Fiedler value would be 0. Each value in the Fiedler vector
gives us information about which side of the cut that node belongs.

35
Spectral Clustering: Whole Procedure

Spectral Clustering.ipynb

36
Spectral Clustering: Whole Procedure
Let’s go back to our problem where the data
is not in a graph. The points are drawn from
two concentric circles with some noise
added. We’d like an algorithm to be able to
cluster these points into the two circles that
generated them. Kmeans will not do the job,
let’s see what Spectral Clustering does.

We will construct the affinity matrix using 5-


Nearest Neighbours

37
Spectral Clustering: Whole Procedure

Kmeans fails ->

38
Spectral Clustering: Whole Procedure

The kneighbors_graph constructs


the adjacency matrix A.

39
Spectral Clustering: Whole Procedure

Adjacency Matrix

Degree Matrix

Simple Laplacian

40
Spectral Clustering: Whole Procedure

41
Dimensionality
Reduction

42
Principal Component Analysis (PCA)
PCA is commonly used for dimensionality reduction by
projecting each data point onto only the first few principal
components to obtain lower-dimensional data while
preserving as much of the data's variation as possible.

it can be shown that the principal components are


eigenvectors of the data's covariance matrix. Thus, the
principal components are often computed by
eigendecomposition of the data covariance matrix.

In regression analysis, the larger the number of explanatory


variables allowed, the greater is the chance of overfitting the
model, producing conclusions that fail to generalise to other
datasets. One approach, especially when there are strong
correlations between different possible explanatory variables,
is to reduce them to a few principal components and then run
the regression against them, a method called principal
component regression.

43
Principal Component Analysis (PCA)

Project the 2-D coordinates onto one single line,


hereby we drop one from 2D->1D pca_intro.ipynb

44
Principal Component Analysis (PCA)

The two vectors corresponding


to the max and min variance,
are perpendicular to each other

45
Principal Component Analysis (PCA)

46
Principal Component Analysis (PCA)
Centering Data
In our previous example the data was centered. The PCA model in Sklearn will center the data but will not scale the data.

Incremental PCA
The PCA model requires that all of the data to be processed must fit in main memory. The IncrementalPCA object uses a
different form of processing and allows for partial computations which an almost exact match of the results of PCA while
processing the data in a minibatch fashion. This approach updates the explained_variance_ratio_ incrementally batch
after batch.

47
PCA : A Classical Case Study
A Yield curve, in economics and finance, a curve that A yield curve is typically upward sloping; as the time to
shows the interest rate associated with different maturity increases, so does the associated interest
contract lengths for a particular debt instrument (e.g., rate. The reason for that is that debt issued for a
a treasury bill). It summarizes the relationship between longer term generally carries greater risk because of
the term (time to maturity) of the debt and the the greater likelihood of inflation or default in the long
interest rate (yield) associated with that term. run. Therefore, investors (debt holders) usually require
a higher rate of return (a higher interest rate) for
longer-term debt.

48
PCA : A Classical Case Study
A Yield curve for a given currency has many
tenors. The combination of all the tenors on
one single day, constitutes the yield curve
for that day.
Every point on the yield curve is called a
tenor. Each of these points can be
considered a set of (correlated) random
variable.
Each of these variable has a corresponding
time series
PCA_EUR.ipynb

49
PCA : A Classical Case Study
13-Yr History of the EUR
interest rate curves

50
PCA : A Classical Case Study
The three primary yield curve movements Change in the overall level: This is called a
are of importance to the portfolio manager. parallel shift of the yield curve. Here the
These are changes in: interest rates across all maturities change
by the same number of basis points (1basis
- level point = 0.01%)
- slope The slope of the yield curve can become
- curvature steeper of flatter. The gap between the
interest rates on short-term tenors and
of the yield curve. long-term tenors increases. The opposite
movement is a flattening of the yield curve.

51
PCA : A Classical Case Study
In this case study, the curve is defined by 8
curvature
tenors. In practice, there are over 20 points
making up te yield curve. Each tenor has a
corresponding timeseries.

How can we reduce this multi-dimensional


data-set in order to understand changes in
the overall shape of the curve: flattening,
steepening, curvature.

52
The first two eigenvectors of the covariance matrix
seem to tell the complete story

PCA : A
Classical
Case Study

53
PCA : A Classical Case Study
The first two eigenvector can be projected in
the domaine (2Y,10Y).
What is the intuition behind the first two
components ?

54
PCA : A Classical Case Study
Which of basic movement is
captured by the third
eigenvector ?

55
PCA : A Classical Case Study
The impact of the credit crisis in 2008 (steepening )
and financial crisis in 2001 is visible

56
PCA : A Classical Case Study
n observations Dimension = p
The transformed dataset (Data*)
has been constructed from the
PCA

Dimension = p
Dimension = p

most important amount of


information.
Data Fit()
Dimension = p

Data*

57
PCA : A Classical Case Study

58
PCA : Other Examples
◦ Stock Market : http://bit.ly/2X3ugxQ
◦ Rainfall prediction :http://bit.ly/2KJ009e
◦ Air polution: http://bit.ly/3aYasEl

59

You might also like