Inner Speech Recognition

Signatures
I hereby authorize the student Ms. INES BOUSSELMI to submit her internship report
Professional supervisor, Mr. BEN HAMED Bassem
Signature and Stamp
I hereby authorize the student Ms. INES BOUSSELMI to submit her internship report
Academic supervisor, Ms. BEL HAJ DAOUD Sinda
Signature
Dedications
First and foremost, I thank God (Allah), the Almighty, for endowing his immeasurable
blessings on me at every step of my journey toward the successful completion of my
studies.
To My beloved mother Sihem
To the one person who has supported and pushed me since day one and has instilled in me
a passion for learning, words fail to describe my gratitude and love.
I always knew deep down that you wanted to see me achieve what life has denied you.
Here I am today making our dream come true.
To my beloved Father Fathi
To my hero and role model who made everything possible, I will always be most grateful
to you since everything you have done in life is for us.
To my siblings Nacer and Ameni
You are the essence of my life; I would not be me without you I wish you all the success in
the world.
To my uncle Yassine
Your support and help have been critical to my success.
To Ms. Lilia Atig
Your considerable efforts to help me have shed light on a very dark time in my life and for
that, I owe you my deepest gratitude.
To my friends Yosr and Bchira
I would like to extend my thanks to you for all that you have done to guide and encourage
me during my internship.
I would like to express my gratitude to my teachers and mentors who have guided me
through my studies especially Mr. Zakaria Jaraya, Mrs. Ikram Laataoui, Mrs. Jalila
Mouelhi, and Mr. Raouf Ben Kileni.
Last, I wish to express my sincere thanks to all my family members and friends for their
timely help, moral support and encouragement.
i
Thanks
I would like to express my gratitude to all those who contributed to the success of this project.
To my academic supervisor, Ms. BEL HAJ DAOUD Sinda
I would like to thank you warmly for the guidance, unwavering support, and encouragement
that you have given me throughout my project. Your constant availability, your judicious
advice, your comments and corrections were invaluable for the success of this project. I hope
that you found me to be up to your expectations.
To my technical supervisor Mr. BEN HAMED Bassem
I would like to thank you very sincerely for your pertinent advice and suggestions that guided
me during the phases of my internship. Your scientific and professional qualities helped me
to succeed in this project.
To all the managers and colleagues of Digital Solutions Ms. Chantal Ebel Mr. Christian
Bock, Mr. Moez Hanin and Ms. Yosr Halleb
I would like to express my sincere gratitude for the welcome and the pleasant integration.
Your availability, your remarks, and your commitment allowed me to follow the right path
for the realization of this project.
To all the members of the jury,
I would also like to thank and express my deep respect to the members of the jury for having
accepted to judge this work.
To all the members of ESPRIT
I would also like to thank the pedagogical team of ESPRIT as well as the professional
speakers in charge of the training in the IT department who gave me the theoretical notions
necessary for the elaboration of this project.
ii
Table of contents
Chapter I: GENERAL PRESENTATION............................................................................. 2

I.1: Presentation of the host organization ................................................................. 2
I.1.1: Swiss Digital Network ................................................................................. 2
I.1.2: Machine Learning Architects Basel.............................................................. 2
I.2: Context of the project ........................................................................................ 3
I.2.1: Presentation of the project............................................................................ 3
I.2.2: Work methodology ...................................................................................... 3
Chapter II:Review of Literature ......................................................................................... 8
II.1: Definition of time series .................................................................................... 8
II.2: Characteristic of time series ............................................................................... 8
II.2.1: Trends ......................................................................................................... 9
II.2.2: Seasonality .................................................................................................. 9
II.2.3: Cyclic variations .........................................................................................10
II.2.4: Irregular movements ...................................................................................10
II.2.5: Stationarity .................................................................................................11
II.2.6: Autocorrelation ...........................................................................................12
II.3: Time series tasks ..............................................................................................13
II.4: Time series analysis ..........................................................................................13
II.4.1: Similarity Measures ....................................................................................14
II.4.2: Features Extraction .....................................................................................16
II.4.3: Univariate and Multivariate Time Series .....................................................18
II.4.4: Time-Domain and Frequency Domain Models ............................................19
II.5: Time Series classification .................................................................................20
II.5.1: Distance Based classification ......................................................................20
II.5.2: Interval-based Classifiers ............................................................................21
II.5.3: Frequency-based Classifiers ........................................................................21
II.5.4: Shapelet-Based Classifiers ..........................................................................22
II.5.5: State of the art of machine learning for time series classification .................23
II.6: Time Series forecasting ....................................................................................24
II.6.1: Autoregressive models ................................................................................25
iii
II.6.2: ARIMA models ..........................................................................................25
II.6.3: SARIMA models ........................................................................................26
II.6.4: Topical methods of time series forecasting ..................................................27
II.7: TIME SERIES CLUSTERING .........................................................................28
II.7.1: HIERARCHICAL METHODS ...................................................................29
II.7.2: Partitioning Methods...................................................................................30
II.7.3: Model Based Methods ................................................................................31
II.7.4: Density-Based Methods ..............................................................................32
II.7.5: Deep Clustering Methods for Time-Series Data ..........................................33
II.8: Machine learning for biomedical science ..........................................................34
II.8.1: Machine learning tasks in healthcare ...........................................................35
II.8.2: Benefits of Machine Learning in Healthcare ...............................................35
II.8.3: Time series for biomedicine ........................................................................36
Chapter III: INNER SPEECH RECOGNITION USE CASE ........................................................40
III.1: Contextualization .............................................................................................40
III.2: Related work ....................................................................................................41
III.3: Data description................................................................................................41
III.3.1: Data acquisition ......................................................................................41
III.3.2: Data acquisition process ..........................................................................42
III.3.3: BCI interactive conditions .......................................................................44
III.4: Data Processing and Analysis ...........................................................................45
III.4.1: Feature engineering .................................................................................46
III.4.2: Feature extraction ....................................................................................47
III.4.3: Feature selection .....................................................................................48
III.5: Modeling ..........................................................................................................50
III.5.1: Machine learning models ........................................................................50
III.5.2: Deep Learning models.............................................................................53
III.5.3: Models Evaluation ..................................................................................58
Conclusions and Prospects ...............................................................................................66
Webography ....................................................................................................................68
Résumé ............................................................................................................................70
Abstract ...........................................................................................................................70
iv
List of figures
Figure 1: Logo of SWISS DIGITAL NETWORK ..............................................................2
Figure 2: Logo of ML ARCHITECTS BASEL ..................................................................2
Figure 3: Scrum life-cycle ..................................................................................................4
Figure 4: Time trend in a time series graph ........................................................................9
Figure 5: Seasonal time series .......................................................................................... 10
Figure 6: Female hormone levels during the menstrual cycle............................................ 10
Figure 7: Decomposition of a noisy signal........................................................................11
Figure 8: Example of stationary time series ...................................................................... 12
Figure 9: Example of non-stationary time series ...............................................................12
Figure 10: Autocorrelation ...............................................................................................12
Figure 11: Time series analysis pipeline ........................................................................... 14
Figure 12: Euclidean mapping..........................................................................................15
Figure 13: DTW mapping ................................................................................................ 15
Figure 14: Methods for univariate vs multivariate time series data ................................... 18
Figure 15: Example of time series classification ...............................................................20
Figure 16: KNN and DTW classifier ................................................................................21
Figure 17: Example of a single shapelet ........................................................................... 22
Figure 18: Example of some python time series libraries ..................................................23
Figure 19: Explanation of the ARIMA abbreviation ......................................................... 26
Figure 20: A graph of a seasonal time series..................................................................... 27
Figure 21: Hierarchical clustering dendrogram .................................................................30
Figure 22: K-Means clustering algorithm graph ............................................................... 31
Figure 23: Model based clustering method plot illustration .............................................. 31
Figure 24: Density based clustering method plot illustration ............................................ 32
Figure 25: from SOM to T-DPSOM ................................................................................. 34
Figure 26: Single cell RNA seq analysis steps .................................................................. 37
Figure 27: Benefits of single-cell RNA sequencing for biological discoveries .................. 37
Figure 28: cell clustering analysis pipeline ....................................................................... 38
Figure 29: organization of the recording day for each participant ..................................... 43
Figure 30: trial workflow ................................................................................................. 44
Figure 31: EEG data processing pipeline .......................................................................... 45
Figure 32: A big picture of the idea of PCA algorithm. "Eigenstuffs" are eigenvalues and
eigenvectors. .................................................................................................................... 49
Figure 33: Hyperplan SVM .............................................................................................. 51
Figure 34: XGBoost illustration .......................................................................................51
Figure 35: KNN illustration .............................................................................................52
Figure 36: CNN architecture ............................................................................................ 54
Figure 37: Deep Convnet architecture ..............................................................................55
Figure 38: Overall visualization of the EEGNet architecture ............................................ 56
Figure 39: Model scores .................................................................................................. 58
Figure 40: Accuracy of binary classification models ........................................................59
Figure 41: Confusion matrix with two class labels ...........................................................60
Figure 42: SVM confusion matrix ....................................................................................61
Figure 43: XGBoost confusion matrix ..............................................................................61
v
Figure 44: KNN confusion matrix ....................................................................................61
Figure 45: CNN confusion matrix ....................................................................................61
Figure 46: EEGNET confusion matrix .............................................................................61
Figure 47: ConvNet confusion matrix .............................................................................. 61
Figure 48: SVM classification report................................................................................62
Figure 49: XGBoost classification report .........................................................................62
Figure 50: KNN classification report................................................................................62
Figure 51: Loss and accuracy curves of the EEGNET model ............................................64
vi
List of tables
Table 1: Differences between stationary and non-stationary data ...................................... 11
Table 2: time series univariate and multivariate model examples...................................... 19
Table 3: methods for time domain vs frequency domain analysis ..................................... 19
Table 4: EEGNet architecture .......................................................................................... 57
Table 5: Train and validation accuracies of our models .................................................... 60
Table 6: Evaluation metrics for the DL models ................................................................ 63
vii
Abbreviations and Acronyms
Artificial Intelligence
AI
ARIMA Autoregressive integrated moving average
CNN Convolutional Neural Network
ConvNet Convolutional Neural Network
DL Deep Learning
DNN Deep Neural Network
DTW Dynamic Time Wrapping
ELU Exponential Linear Unit
EEG Electroencephalography
F1 Score Harmonic Precision-Recall Mean
FFT Fast Fourier transform
kNN k-Nearest Neighbours
PCA Principal Component Analysis
RNA seq RNA Sequencing
SVM Support Vector Machine
SARIMA seasonal ARIMA
XGBoost eXtreme Gradient Boosting
viii
GENERAL INTRODUCTION
General introduction
Machine learning research has progressed to the point where specially designed
computers can outperform humans on challenging cognitive tasks. This has lately been
proven in a number of difficult fields, including self-driving vehicles, automatic language
translation, and strategic games.
One of the most difficult concerns is health care, where existing systems, equipment,
and techniques are straining to keep up with rising demand. Machine learning has significant
promise. Several converging factors are driving the creation of large-scale, complex
electronic data repositories in healthcare today.
Many biomedical data sets are available as time series, especially in the field of
public health and epidemiology, where indicators are usually collected over time. Clinical
studies with long follow-ups are also sometimes best analysed with time series methods. The
analysis of administrative health care data often gives rise to time series problems too, as
events are frequently converted to counts over a given interval. Finally, some biomedical
measurements may also be viewed as time series, such as EEG recordings.
In this context, our end-of-study project falls within the scope of acquiring the
national engineering diploma. The project's goal is to conduct a study on machine learning
techniques for time series data in the biomedical field.
This report wraps up the project's various tasks and stages. It is divided into three
chapters:
The first chapter will introduce the hosting organization and the project's setting.
We will also present the appropriate methodology for developing the project.
The second chapter will go through a comprehensive study of time series analysis
along with the state of the art of machine learning for biomedical time series.
The final chapter will describe the project's development environment and
execution steps. In this stage, we will present the EEG data processing pipeline.
Finally, we'll wrap up our report with a general conclusion that summarizes the work
while also highlighting the project's contribution and potential extensions.
1
Chapter I: GENERAL PRESENTATION
GENERAL PRESENTATION
Introduction
In this first chapter, we shall introduce the general context of the project in addition
to its main objectives. Furthermore, we will present the host company and its main areas of
activity as well as the approaches adopted in this work.
I.1: Presentation of the host organization
I.1.1: Swiss Digital Network
Swiss Digital Network (SDN) is the first independent and open advisory network that
cooperates with innovative IT providers to offer clients an efficient digital cloud
transformation journey. The network was created by senior consulting architects who
combine IT and emerging technologies to assist clients and customers in capitalizing on
innovative projects. The Swiss Digital Network is composed of four cells, each specialized
in key areas related to digital transformation. This project was carried out in the Machine
Learning Basel Architectures unit.
Figure 1: Logo of SWISS DIGITAL NETWORK
I.1.2: Machine Learning Architects Basel

Machine Learning Architects Basel assists both retail, organizations, and corporations
to design, build and activate their digital highway for effective machine learning
development and operations (MLOps), covering technologies, operating models, and culture
as well as skills with a holistic package of consulting, engineering, and training services.
Figure 2: Logo of ML ARCHITECTS BASEL
2
I.2: Context of the project

The first part has brought us as close to understanding the concept of the SDN as
thoroughly as possible. We shall now define the theme, along with the purposes and
requirements for its existence.
I.2.1: Presentation of the project

In our work, we examine the application of a 136-channel non-invasive surface EEG
device in the decoding of four imagined Spanish words from brainwave data: "Arriba,"
"Abajo," "Derecha," and "izquierda" (i.e., "top," "bottom," "right," and "left," respectively).
For speech recognition, we use an open-access EEG dataset from a study conducted by
Nicolás Nieto et al. at Torcuato Di Tella University's Neuroscience Laboratory in
Argentina.
We evaluate the performance of six classification models (SVM, KNN, XGBOOST,

CNN, CONVNET, and EEGNET). We show that our classification performance is nearly
comparable to similar results obtained using EEG data.
The following are the key goals of this paper:
1. To give a thorough examination of time series analysis.
2. To deliver state of the art machine learning for biomedical time series data.
3. To test several machine learning models for inner speech classification using EEG.
I.2.2: Work methodology

Project management methodologies are essential for managing software
development projects effectively. With so many different approaches to choose from, it's
important to understand the advantages and disadvantages of each methodology. For
example, the Waterfall methodology is a structured, linear approach that is best suited for
projects with well-defined requirements, while Agile methodologies such as Scrum are more
flexible and iterative, making them ideal for projects with changing requirements or high
levels of uncertainty. Choosing the right methodology for our project can depend on factors
such as the project's scope, timeline, and team size. By selecting the most appropriate
3
methodology for the project at hand, we can increase our chances of success and deliver
valuable results.
In the real-world work environment, it's common to see a combination of

methodologies being used, with teams taking an eclectic approach to project management
based on the specific needs of their project.
In our situation, we opted to combine two project management methods: Scrum and CRISP-
DM
I.2.2.a: AGILE/SCRUM
The SCRUM method consists in defining a framework allowing the realization of

complex projects. This method was initially intended for the development of IT projects, but
it can be applied to any type of project, from the simplest to the most innovative, and it does
so in a very simple way. Projects that follow the agile SCRUM method are divided into
several relatively short work cycles called "sprints."
Figure 3: Scrum life-cycle
This method is becoming more and more necessary because of the permanent
evaluations it allows, which are considered very useful and effective. Indeed, the SCRUM
method has several advantages: it improves productivity and communication within the
4
project; it is based on a fixed set of roles, responsibilities, and meetings that never change;
while ensuring flexible and adaptive project management.
The stages of the SCRUM Lifecycle
To implement the SCRUM method, we start by
● Defining the features of our project in order to build the backlog.

● Then, we will introduce sprint planning in order to build a detailed plan of an
iteration. The duration of a sprint is two to four weeks. Throughout a sprint,
we plan daily meetings with the project collaborators to present the progress
of the tasks in progress, the problems encountered, and the tasks to be
completed.
● As soon as the sprint is over, we check the compatibility of the completed
features with the client/ project's needs in order to improve them in the
retrospective.
SCRUM roles
Three main roles are required in the Scrum method:
 Product owner: a business expert able to define the functional specifications. He

prioritizes the features to develop or correct and validates the developed features. In
short, he plays the role of the customer.
 Scrum Master: The Scrum Master must ensure that the principles and values of
Scrum are respected. His role is to facilitate communication among the team
members and improve their productivity.
 Team: The team has no defined roles: architect, developer, tester, etc. All team
members bring their expertise to accomplish the tasks.
SCRUM Modeling Artifacts
The Agile philosophy is supported by a set of values, principles, and practices that are
the foundation of SCRUM artifacts.
● The Sprint Backlog is a real-time, highly visible view of the work that the Team plans
to accomplish during the Sprint.
5
● The Product Backlog is a kind of warehouse that contains all the features of the
product. The tasks must be ordered with discretion according to the priority in which
they must be carried out.
● The product increment is one of the most important SCRUM artifacts of the Agile
culture. During each Sprint, the development team makes a product increment.
During my internship, I was given biweekly goals to achieve. To ensure that I was on track, I
was asked to break down those objectives into daily objectives and send a calendar of my
daily objectives to my supervisor. Additionally, I had daily meetings with my supervisor to
review my progress towards these objectives. At the end of each sprint, there was a review
meeting with the entire team to discuss progress and identify areas for improvement. Finally,
there was a monthly meeting to review overall progress and discuss any challenges or
opportunities for growth. These processes helped to ensure that I was accountable and working
towards achieving my goals throughout my internship.
I.2.2.b: CRISP-DM (Cross-Industry Standard Process for Data

Mining)
CRISP-DM (Cross-Industry Standard Process for Data Mining) is a widely used

methodology for developing data mining and machine learning projects. It consists of six
phases, as follows:
1. Business Understanding: In this phase, the project objectives and

requirements are defined, and the problem to be solved is framed.
2. Data Understanding: This phase involves collecting and exploring the
relevant data to gain a better understanding of its quality, completeness, and
suitability for the project.
3. Data Preparation: In this phase, the data is cleaned, transformed, and
formatted to ensure that it is ready for analysis.
4. Modelling: In this phase, various techniques are applied to the prepared data
in order to develop models that can help address the problem at hand.
6
5. Evaluation: The models developed in the previous phase are evaluated in

this phase, in order to determine their effectiveness in solving the problem.
6. Deployment: In this final phase, the models that have been developed and
evaluated are deployed, in order to provide insights and recommendations
that can help address the business problem.
The CRISP-DM methodology provides a structured approach to data mining and

machine learning projects, allowing for greater clarity, efficiency, and effectiveness
throughout the entire process.
The Hybrid Approach: Combining Scum and CRISP-DM Methodologies for

Effective Machine Learning Project Management.
The combination of CRISP-DM and Scrum methodologies is a powerful approach

for managing machine learning projects. While CRISP-DM provides a structured framework
for developing and deploying models, Scrum offers an iterative and flexible approach for
managing the overall project. By using Scrum to organize development sprints and monitor
progress, we managed to adapt to changing project requirements and make necessary
adjustments to the CRISP-DM process. Additionally, Scrum helped ensure that the team is
working collaboratively and efficiently, with frequent check-ins and feedback sessions.
The combination of these two techniques allowed us to ensure that our project is successful,
with a focus on accomplishing the business objectives.
Conclusion
In this chapter, we presented the context of the project, the host organization, and the
objectives of the project.
We finished by deciding on the methodology that would be used throughout this
project. The following chapter is devoted to the completion of a comprehensive study of
time series analysis.
7
Chapter II:Review of Literature
REVIEW OF LITERATURE
Introduction
In this chapter, we will tackle the different concepts and definitions in order to gain
an understanding of time series, its characteristics, and its main tasks.
Furthermore, we will get an overview of the current state of the art in machine learning for
time series analysis, particularly in the biomedical industry.
II.1: Definition of time series
Time series data is a set of values that are gathered and sorted chronologically across
even periods of time. The time interval at which data is gathered is known as the time series
frequency.
What distinguishes time series data from other types of data is that the analysis may
illustrate how variables change over time. In other words, time is a critical variable since it
indicates how the data adapts over time as well as the final results. It provides a
supplementary source of information as well as a predetermined sequence of data
dependencies.
In order to ensure consistency and reliability, time series analyses typically require a
large number of data points. A large data set guarantees that your sample size is
representative, and your analysis will be able to sift through any ambiguous data. It also
ensures that any trends or patterns discovered are not outliers and can account for seasonal
variance.
Definition of a time series Data Set: A data set D of n-1 univariate variable-length
time series is defined as follows:
[ ] [ ] [ ] [ ]
𝐷= (𝑡 , 𝑣 ), … , 𝑡 ,𝑣 , … , (𝑡 , 𝑣 ), … , 𝑡 ,𝑣 = 𝑇 ,…,𝑇 (1)
II.2: Characteristic of time series

Time series contain various properties that distinguish them from other forms of data;
these features must be considered while describing and modeling a time series.
In this section, we'll go over time series characteristics and illustrate them with plots.
Observed values are plotted on the y-axis of a time series graph against a time increment on
8
the x-axis. These graphs can serve as the basis for creating a solid model by visually
highlighting the behavior and patterns of the data.
II.2.1: Trends
Time-series data may contain a deterministic component that is proportional to the

time period. When this occurs, the time series data is said to have a time trend.
Time trends in time series data also have testing and modeling implications. A time
series model's reliability is dependent on properly identifying and accounting for time trends.
A time series plot that appears to be centered on an increasing or decreasing line,

such as the one below, indicates the presence of a time trend.
Figure 4: Time trend in a time series graph
II.2.2: Seasonality
Seasonality is another time-series data feature that can be seen visually in time-series
plots. When time series data exhibits regular and predictable patterns at time intervals less
than a year, this is referred to as "seasonality."
Retail sales are an example of a time series with seasonality because they typically
increase from September to December and decrease from January to February.
9
Figure 5: Seasonal time series
II.2.3: Cyclic variations

Cyclic variations describe any repeating patterns that are not related to the calendar.
The length of a cycle is determined by the type of data source being examined. Female
hormone levels during the menstrual cycle are examples of cyclic fluctuations, as is the
regulation of daily body temperature.
Figure 6: Female hormone levels during the menstrual cycle
II.2.4: Irregular movements

Random or irregular movements are a hallmark of every time series. They occur
unsystematically and are purely random and irregular. These fluctuations are unforeseen,
uncontrollable, unpredictable, and erratic. Sampling and measurement errors are common
examples of irregular movements that cause data variations.
For example : y(t) = signal(t) + noise(t)
10
Figure 7: Decomposition of a noisy signal
II.2.5: Stationarity
When all statistical characteristics of a time series remain unchanged by time shifts,
the series is said to be stationary. In technical terms, strict stationarity implies that the joint
distribution of (𝑦 , . . . . . . . . . , 𝑦 ) depends only on the lag, h, and not on the time period, t.
Strict stationarity is not widely necessary in time series analysis. This is not to imply
that stationarity does not play a role in time series analysis. Many time series models are
valid only under the assumption of weak stationarity (also known as covariance stationarity).
Weak stationarity, henceforth stationarity, requires only that:
● A series has the same finite unconditional mean and finite unconditional variance
over all time periods.
● The series autocovariance is time-independent.
Nonstationary time series are any data series that do not satisfy the weakly stationary
time series conditions.
Table 1: Differences between stationary and non-stationary data
Stationary data example Non-stationary data example

Gaussian white noise Deterministically trending data
4
11
Figure 8: Example of stationary time series Figure 9: Example of non-stationary time series
II.2.6: Autocorrelation
The degree of similarity between a given time series and a lagged version of itself
over successive time intervals is referred to as "autocorrelation." In other words,
autocorrelation is used to measure the relationship between a variable's current value and
any previous values to which you have access.
For the sake of comparison, autocorrelation is essentially the same process that you would
go through when calculating the correlation between two different sets of time series values
on your own. The main distinction here is that autocorrelation employs the same time series
twice: once in its original values and again after a few different time periods have passed.
Serial correlation, time series correlation, and lagged correlation are all terms for
autocorrelation. Autocorrelation, in whatever form it is used, is an excellent method for
discovering trends and patterns in time series data that would otherwise go unnoticed.
Figure 10: Autocorrelation
12
II.3: Time series tasks

In this section, we will present predicting tasks that can be combined with time series
datasets. Each of these tasks has several applications. Predictive tasks are those in which the
aim is to forecast the future state of a time series or a present condition that is not clearly
visible.
Time series analysis tasks include:
● Classification: is the process of identifying and categorizing data.

● Curve fitting: Plotting data along a curve to investigate the relationships between
variables in the data.
● Descriptive analysis: Identifying patterns in time series data such as trends, cycles,
or seasonal variation.
● Explanatory analysis: Attempts to comprehend data and its relationships, as well as
cause and effect.
● Exploratory analysis: Highlights the key features of time series data, usually in a
visual format.
● Forecasting: the prediction of future data. This type is founded on historical data. It
predicts scenarios that could occur along future plot points by using historical data
as a model for future data.
● Intervention analysis: investigates how a situation can alter the information.
● Segmentation: Dividing the data into sections to display the underlying
characteristics of the information source.
II.4: Time series analysis

In comparison to data without a temporal component, designing effective and efficient
algorithms for time-series data is more difficult due to the presence of noise, high
dimensionality, and high feature correlation. The complicated relationships between the
time-series variables make it difficult to analyse time-series data and can even change over
time.
There are three major difficulties with time-series analysis. First, many techniques can
only accept input data in the form of a vector of features. Sequence data, regrettably, lack
explicit features. Second, selecting features can be challenging due to the high
13
dimensionality and expensive computation of the feature space. Third, creating a partitioning
task can be challenging in some applications because the raw data lacks explicit features. In
order to reduce dimensionality and provide representative features of such data, feature
extraction and similarity measures must be used in order to efficiently handle the raw data
in time series.
These difficulties prompted the development of the traditional time-series analysis
pipeline, which consists of three different viewpoints: time-series data, similarity metrics,
and feature extraction.
Figure 11: Time series analysis pipeline
II.4.1: Similarity Measures

The idea of similarity between a pair of time series is used in the majority of mining
techniques. When working with time-series data, it is necessary to choose similarity
measures that account for outliers, varying amplitudes, and time axis distortion. The main
goals of representation techniques when working with time-series data are efficiency and
effectiveness, and similarity measures are crucial to the effectiveness of time-series
algorithms. Thus, the accuracy of such algorithms can be significantly impacted by the
choice of similarity measures.
Euclidean distance, dynamic time warping, correlation, and cross-correlation are

examples of similarity measures' metrics.
II.4.1.a: Euclidean distance (ED)
ED is a commonly used metric for time series. It is defined between two-time series
X and Y having length L; therefore, the Euclidean distance, between each pair of
corresponding points X and Y, is the square root of the sum of the squared differences. As a
result, the length of the two-time series under comparison must be equal, and the
computational cost scales linearly with the length of the temporal sequence. The distance
14
between the two-time series is determined along the horizontal axis by matching the
corresponding points. The Euclidean distance metric is extremely susceptible to noise and
distortion and is unable to cope with one of the elements being compressed or stretched. This
method is therefore unreliable, particularly when comparing time series with different time
durations.
II.4.1.b: Dynamic Time Warping (DTW)
DTW is proposed to overcome some Euclidean distance limitations such as non-

linear distortions. The two-time series in DTW don't have to be the same length; instead, the
series are aligned (or warped) before the distance is calculated. However, DTW might
unintentionally match two temporal points with totally different local structures. An
alignment algorithm improvement that takes point-wise local structural information into
account, such as shape dynamic time warping, can solve this problem. When working with
large datasets, DTW does not scale well due to its quadratic time complexity. Despite this,
it is widely used in many fields, including bioinformatics, finance, and medicine. DTW is
calculated as the square root of the sum of squared distances between each element in X and
its nearest point in Y.
𝐷𝑇𝑊 (𝑥, 𝑦) = 𝑚𝑖𝑛 𝑑 (𝑥 , 𝑦 ) (2)

( , )∈
where 𝛱 = [𝛱 , . . . . . , 𝛱 ] is a path that satisfies the following properties

● it is a list of index pairs 𝛱 = (𝑖 , 𝑗 ) with 0 ≤ 𝑖 < 𝑛 and 0 ≤ 𝑗 < 𝑚 .
● 𝛱 = (0,0) and 𝛱 = (𝑛 − 1, 𝑚 − 1)
● for all 𝑘 > 0, 𝛱 = (𝑖 , 𝑗 ) is related to 𝛱 = (𝑖 ,𝑗 ) as follows:
*𝑗 ≤𝑗 ≤𝑗 +1
*𝑖 ≤𝑖 ≤𝑖 +1
Figure 13: DTW mapping Figure 12: Euclidean mapping
15
For instance, we have two distinct, varying-length curves: red and blue.
The two curves follow the same pattern; however, the blue curve is longer than the red. If
we apply the one-to-one Euclidean match, the mapping is not perfectly synced up, and the
tail of the blue curve is being left out.
II.4.1.c: Correlation
Correlation is a common mathematical operation for describing how two or more

variables fluctuate together. By taking into account the level of measurement for each
variable, different types of correlation can be found. Two variables that are not necessarily
of equal dimension can be separated using distance correlation. In time series data, it is used
to detect a known waveform in random noise. Unlike DTW and LCS, correlation also offers
a linear complexity frequency space implementation in signal processing.
II.4.1.d: Cross-correlation
Cross-correlation is the correlation between two signals that shape a new signal, and
its peaks can indicate the similarity between the original signals; it is used as a distance
metric. However, cross-correlation can be carried out more efficiently in the frequency
domain. Autocorrelation occurs when the signal is correlated with itself, which is useful for
finding repeating patterns. Cross-correlation might be a slow operation in time-series space,
but it corresponds to point-wise multiplication in frequency space. It is also considered the
best distance measure to detect a known waveform in random noise. When processing the
signal, the correlation has a linear complexity in frequency space implementation, which
cannot be achieved by DTW.
II.4.2: Features Extraction

Feature extraction is a type of dimension reduction that helps to increase accuracy
while reducing the computational cost of working with high-dimensional data. Before
applying learning algorithms to the vector of extracted features, matching features from time
series data should be extracted. Several feature-based techniques have been proposed to
represent features with low dimensionality for time-series data, including the Discrete
Fourier Transform (DFT), Discrete Wavelet Transform (DWT), Principal Component
Analysis (PCA), Multidimensional Scaling (MDS), K-grams and Shapelets.
16
II.4.2.a: Principal Component Analysis (PCA)
PCA is an eigenvalue method that transforms time-series data into low-dimensional

features. As a feature extraction method, PCA is effectively applied to time-series data. It
converts data into a new set of variables whose elements are mutually uncorrelated, resulting
in the learning of a lower-dimensional data representation than the original input.
PCA is a linear dimensionality reduction technique that has been used to effectively
eliminate the least significant information in the data while preserving the most significant.
II.4.2.b: Multidimensional Scaling (MDS)
MDS is a widely used non-linear dimensionality reduction technique for effectively

representing high-dimensional data in lower dimensional space. It struggles, however, to
separate k-means clusters. It is used in the biomedical field to gain a better understanding of
gene interactions and regulatory behaviors. Thus, two distinct MDS representations are
considered in relation to time-series data. The first shows local differences between genes in
the same cluster group, while the second shows global differences between all genes in all
clusters. It is also used to reveal the distributions of time-series data, which aids in
understanding the relationships between time series.
II.4.2.c: K-grams
Transforming time-series data into a set of features does not fully reflect the series'
sequential nature. K-gram is an example of a feature-based approach that uses short sequence
segments of k-consecutive symbols to retain the order of components in a series. In time-
series data, K-grams presents a feature vector of symbolic sequences of K-grams. This
feature vector can express the frequency of K-grams given a set of K-grams.
II.4.2.d: Discrete Fourier Transform (DFT)
DFT is one of the most common transformation methods. It has been used to convert
original time series data into low-dimensionality time-frequency characteristics and index
them in order to perform an effective similarity search.
DFT is used to reduce dimensionality and extract features into an index that can be used for
similarity searching. This technique is constantly being improved, and some of its
shortcomings have been overcome.
17
II.4.2.e: Discrete Wavelet Transform (DWT)
This technique has also been used to transform the original time series and obtain
low-dimensional features that efficiently represent the original time series data.
Analysis tasks with a large set of time-series data face certain challenges in defining
matching features; thus, using wavelet decomposition to reduce data dimensionality is
beneficial. The discrete wavelet transform technique can be used to accurately perform the
analysis task.
II.4.2.f: Shapelets
When applying feature-extraction techniques to time-series data, discretization is

frequently required, but it can result in information loss. To address this, time-series
shapelets can be applied directly to time-series. This technique compares subsections of
shapes (shapelets) rather than entire time-series sequences to determine similarity. Each new
sequence is classified by a binary decision maker.
II.4.3: Univariate and Multivariate Time Series
There are two types of time series models: univariate time series models and
multivariate time series models. Univariate time series models are used when the dependent
variable is a single time series. A univariate time series model is one that attempts to model
an individual's heart rate per minute using only past observations of heart rate and exogenous
variables.
Multivariate time series models are used when there are multiple dependent
variables. Each series may rely on the past and present values of the other series in addition
to their own past and present values.
Figure 14: Methods for univariate vs multivariate time series data
18
Table 2: time series univariate and multivariate model examples
Univariate Model Examples Multivariate Model Examples

Univariate Generalized autoregressive Vector Autoregressive Models (VAR)
conditional heteroscedasticity (GARCH)
Seasonal Autoregressive Integrated Moving Vector Error Correction Model
Average (SARIMA) Models (VECM)
Univariate unit root tests Multivariate unit root tests
II.4.4: Time-Domain and Frequency Domain Models
For modeling time series data, two broad approaches have emerged: the time-domain
approach and the frequency-domain approach.
The time-domain approach predicts future values based on past and present values.
The time series regression of a time series' present values on its past values and the past
values of other variables forms the basis of this approach. These regression estimates are
frequently used for forecasting, and this method is popular in time series econometrics.
The idea behind frequency domain models is that time series can be represented as a
function of time using sines and cosines. These are referred to as Fourier representations.
To model the behavior of the data, frequency domain models use regressions on sines and
cosines rather than past and present values.
Table 3: methods for time domain vs frequency domain analysis
Time domain Model Examples Frequency Domain Model Examples
Autoregressive Moving Average Models Spectral analysis

(ARMA)
Autoregressive Integrated Moving Band Spectrum Regression

Average (ARIMA) Models
Vector Autoregressive Models (VAR) Fourier transform methods
Generalized autoregressive conditional Spectral factorization

heteroscedasticity (GARCH)
19
II.5: Time Series classification

A common task for time series machine learning is classification. Given a set of time
series with class labels, can we train a model to accurately predict the class of new time
series?
Figure 15: Example of time series classification
There are numerous algorithms dedicated to time series classification; in this section, we
will introduce four types of time series classification algorithms.
 Distance-based (KNN with dynamic time warping)

 Interval-based (TimeSeriesForest)
 Frequency-based (RISE; like TimeSeriesForest but with other features)
 Shapelet-based (Shapelet Transform Classifier)
II.5.1: Distance Based classification

Distance Based classifiers use distance metrics to determine class membership. By
replacing the Euclidean distance metric with the dynamic time warping (DTW) metric, the
popular "k-nearest neighbours" (KNN) algorithm can be adapted for time series
classification. Because it is simple, robust, and does not require extensive hyperparameter
tuning, KNN with DTW is frequently used as a benchmark for evaluating time series
classification algorithms.
20
Figure 16: KNN and DTW classifier
II.5.2: Interval-based Classifiers

These classifiers are based on information contained in various series intervals. A
time series forest (TSF) classifier: is a classifier that adapts the random forest classifier to
series data.
The TSF classifier's steps are as follows:
1. Divide the series into random intervals of varying lengths and start positions.
2. Summary features (mean, standard deviation, and slope) are extracted from each
interval and combined into a single feature vector.
3. Train a decision tree on the extracted features.
4. Steps 1-3 must be repeated until the required number of trees are built or time runs
out.
A majority vote of all the trees in the forest is used to classify new series. (In a majority
vote, the class predicted by the most trees is the forest's prediction.)
Experiments have shown that time series forests outperform baseline competitors such as
nearest neighbours with dynamic time warping.
In addition, time series forest is computationally efficient.
II.5.3: Frequency-based Classifiers
Frequency-based classifiers use frequency data extracted from series to train their
models. Random Interval Spectral Ensemble (RISE) is a popular variant of time series forest.
In two ways, RISE differs from time series forest. First, it employs a single time series
interval per tree. Second, rather than summary statistics, it is trained using spectral features
extracted from the series.
21
RISE employs a number of series-to-series feature extraction transformers, such as:

● Fitted auto-regressive coefficients
● Estimated autocorrelation coefficients
● Power spectrum coefficients (the coefficients of the Fourier transform)
The RISE algorithm is straightforward:

1. Choose a random series interval (length is a power of 2) (for the first tree, use the
entire series)
2. Apply the series-to-series feature extraction transformers (autoregressive

coefficients, autocorrelation coefficients, and power spectrum coefficients) to the
same interval on each series.
3. Concatenate the extracted features to create a new training set.
4. Train a decision tree classifier
5. Ensemble 1–4
Class probabilities are calculated as a proportion of base classifier votes. RISE manages
the run time by constructing an adaptive model of the time required to build a single tree.
This is critical for long series (such as audio), where very large intervals can result in a small
number of trees.
II.5.4: Shapelet-Based Classifiers
A subsequence S of length w at time point l from a univariate time series T is the

sequence of its w consecutive observation values: 𝑠 = (𝑠 , . . . . , 𝑠 ) . Shapelets are
subsequences, or small sub-shapes of time series, that are representative of a class. They can
detect "phase-independent localized similarity" between series belonging to the same class.
Figure 17: Example of a single shapelet
22
A single "shapelet" is an interval in a time series. The intervals in any series can be
enumerated. For example, [1,2,3,4] has 5 intervals: [1,2], [2,3], [3,4], [1,2,3], and [2,3,4].
Shapelet-based classifiers search for shapelets with discriminatory power. These shapelet
features can then be used to interpret a shapelet-based classifier. The presence of certain
shapelets increases the likelihood of one class over another. The Shapelet Transform
Classifier begins by identifying the top k shapelets in the dataset. The new dataset's k features
are then computed. Each feature is calculated as the series' distance from each of the k
shapelets, with one column for each shapelet. Finally, the shapelet-transformed dataset can
be subjected to any vector-based classification algorithm.
II.5.5: State of the art of machine learning for time series

classification
There are more great Python time series libraries to discover such as : Sktime,
Tslearn, Tsfresh, Prophet, and Pyts. Each of the above libraries has a unique approach to
dealing with time series learning problems such as regression, classification, and forecasting.
Their differences lie in the methodologies they employ, which range from standard statistical
approaches (e.g., ARIMA) to dynamic time series warping, symbolic time series
approximations, and others.
Figure 18: Example of some python time series libraries
In this section, I will discuss two popular machine learning approaches for time series
classification. The first is called ROCKET, and the second is called deep learning.
ROCKET (RandOm Convolutional KErnel Transform): is one of the finest general-

purpose, off-the-shelf time series classification algorithms available. It can attain the same
degree of accuracy as rival state-of-the-art algorithms, such as convolutional neural
networks, in a fraction of the time. Then there's Mini-ROCKET, which trains quicker
without sacrificing performance. Both are quite quick. ROCKET first modifies the time
series dataset with random convolutional kernels, similar to those used in CNN, and then
uses these features to train a linear classifier.
23
Deep learning approaches, on the other hand, tend to borrow or modify structures
often related to computer vision and natural language processing (NLP). People are
experimenting with ResNets, Transformers, LSTMs, CNNs, Temporal Convolutional
Networks, Wavelet-based approaches, and other combinations of these methods. These can
take a lot longer to train, not to mention the time spent on hyperparameter tuning. Still, it
may be worthwhile to investigate how deep learning models perform on your data.
SKTIME is inspired by Scikit-Learn and has a comparable API. It is the time-series

module of a more comprehensive Scikit-Learn. This library contains a variety of algorithms
and techniques, including the implementation of interval-based classifiers such as the
supervised time series forest, as well as ROCKET and MiniRocket, which are all regarded
as among the best for TSC tasks. But the limitation is that there are no deep learning methods
in Sktime, hence TSAI.
Tsai is an open-source deep learning package built on PyTorch and Fastai that
focuses on state-of-the-art approaches for time series problems such as classification,
regression, forecasting, and imputation. So, what does Tsai have to offer? It includes a
variety of deep learning architectures built with the PyTorch and Fastai libraries, as well as
ROCKET and MiniROCKET classification and regression models.
II.6: Time Series forecasting
While time series analysis focuses on understanding the dataset, forecasting focuses
on predicting it. Time series analysis comprises methods for analyzing time series data in
order to extract meaningful statistics and other characteristics of the data. Time series
forecasting is the use of a model to predict future values based on previously observed
values.
In other words, time series forecasting is a technique for predicting events over a
period of time. It forecasts future events by analyzing historical trends, with the assumption
that future trends will be similar to historical trends.
24
II.6.1: Autoregressive models
Autoregression is a time series model that predicts the value at the next time step by
using observations from previous time steps as input to a regression equation. In
autoregressive models, we assume a linear relationship between the value of a variable at
time t and the value of the same variable in the past, that is time 𝑡 − 1, 𝑡 − 2, . . . , 𝑡 −
𝑝, . . . ,2,1,0
𝑦 =𝑐+𝛽 𝑦 +𝛽 𝑦 + ⋯+ 𝛽 𝑦 +𝜖 (3)
Here, 𝑝 denotes the autoregressive model's lag order (the number of times that the
raw observations are different; also known as the degree of differencing).
For an AR model,
● When 𝛽 = 0, it signifies random data
● When 𝛽 = 1 , and 𝑐 = 0, it signifies random walk
● When 𝛽 = 1 , and 𝑐 ≠ 0 , it signifies a random walk with a drift
II.6.2: ARIMA models

A fundamental univariate time series model is the autoregressive integrated moving
average model (ARIMA). The ARIMA model consists of three major components:
● The autoregressive component is the relationship between the current dependent

variable and the dependent variable at lagged time periods.
● The integrated component refers to the use of data transformation to make the data
stationary by subtracting past values of a variable from current values of a variable.
● The moving average component denotes the relationship between the dependent
variable and previous values of a stochastic term.
25
Figure 19: Explanation of the ARIMA abbreviation
The order of these components is used to describe the ARIMA data, which is denoted by
the notation ARIMA (p, d, q), where:
● p is the number of autoregressive lags included in the model.
● d is the order of differencing used to make the data stationary.
● q is the number of moving average lags included in the model.
The Box-Jenkins method for estimating ARIMA models is made up of several steps:
1. Transform data so it meets the assumption of stationarity.
2. Identify initial proposals for p, d, and q.
3. Estimate the model using the proposed p, d, and q.
4. Evaluate the performance of the proposed p, d, and q.
5. Repeat steps 2-4 as needed to improve model fit.
II.6.3: SARIMA models

We now understand the ARIMA model. “AR” stands for autoregressive, which
means that we want to predict time series values based on previous periods. “I” is integrating,
which is an upward or downward trend, and to get rid of it, we use differencing. “MA” is
moving the average, which incorporates the previous period's errors into the next period. “S
“stands for seasonality which is the new concept in this context.
26
Figure 20: A graph of a seasonal time series
How do we know we should use the seasonal ARIMA (SARIMA) model? The above
is drawn to show seasonality. We see a very clear W-type pattern repeating, so we clearly
have seasonality.
In SARIMA (P, Q, D) m: m is the seasonal factor. It denotes the number of time steps
in a single seasonal period. Consider that each year is divided into four quarters in the graph
above. Now we'll have an m value of 4. Except for the seasonal components, the (P, D, Q)
are analogs of (p, q, d) in ARIMA model.
II.6.4: Topical methods of time series forecasting

The goal of this part is to provide a comprehensive review of the state-of-the-art of
machine-learning algorithms for time series forecasting. As previously stated, time series
forecasting has typically been approached using statistical approaches, with the
autoregressive integrated moving average (ARIMA) being one of the most extensively used
methodologies.
However, in recent years, ML approaches have been successfully employed,
exceeding traditional methods in many applications. The most often used algorithms are
presented below.
II.6.4.a: Convolutional Neural Network (CNN)
Although the analysis of image datasets is considered their main field of application,
convolutional neural networks can show even better results than RNNs in time series
prediction cases involving other types of spatial data. For one thing, they learn faster,
boosting the overall data processing performance. However, CNNs can also be joined with
27
RNNs to get the best of both worlds, i.e. CNN easily recognizes spatial data and passes it to
an RNN for temporal data storage.
II.6.4.b: Transformer Neural Networks
A transformer neural network is an advanced architecture focused on solving sequence-

to-sequence tasks. Its main goal is also to easily handle long-range dependencies. Such
networks are quite popular in ML-based models, simplifying regression by simply
customizing the loss function.
II.6.4.c: LightGBM
This is a widely used ML algorithm that is mostly focused on capturing complex patterns
within tabular datasets. In some cases, LightGBM outperforms the traditional ARIMA
approach when it comes to making tabular-based predictions.
II.6.4.d: Decision Trees
ML-based decision trees are used to classify items in the database. Generated classes get
dedicated multivariate time series models that help predict the future price of a certain item.
II.6.4.e: XGBoost
This is the machine learning algorithm that is being used, and it works with tabular and
structured data. In its core, lie gradient-boosted decision trees.
II.6.4.f: AdaBoost
Many consider this type of forecasting algorithm to be the best out-of-the-box

classifier. This means that it is best used for elaborating data classifications in conjunction
with other efficient algorithms.
II.7: TIME SERIES CLUSTERING
Clustering is widely used as an unsupervised learning method. Time-series clustering

attempts to define a grouped structure of similar objects in unlabeled data based on their
similar features. Clustering time-series data differs from traditional clustering due to its
unique structure (high dimensionality, noise, and high feature correlation); as a result,
28
several algorithms have been improved to deal with time-series data. Most works involving
time series clustering fall into one of three categories.
The first is whole time-series clustering, in which a set of individual time series is
given and the goal is to group similar time series into clusters based on their similarity.
The second method is subsequence clustering, which involves dividing time-series

data at specific intervals and performing clustering on the extracted subsequences of a time
series using a sliding window technique.
The third category is a grouping of time points based on their temporal proximity and
similarity of corresponding values. Some points may not be assigned to any clusters and are
thus classified as noise.
The choice of distance measure is critical for most time-series analysis techniques,
including clustering. The choice of distance measure is widely regarded as more important
than the clustering algorithm itself. The choice of feature extraction technique also has a
significant impact on the quality of clustering methods.
As a result, time-series clustering relies primarily on traditional clustering methods,
either by replacing the default distance measure with one more appropriate for time series or
by transforming time series into "flat" data so that existing clustering algorithms can be used
directly.
Next, we will be presenting various types of methods and clustering algorithms used
for time-series data.
II.7.1: HIERARCHICAL METHODS
Hierarchical clustering defines a tree structure for unlabeled data by aggregating data
samples into a tree of clusters. Unlike k-means, this method does not assume a value for K.
Hierarchical clustering methods are classified into two types: agglomerative (bottom-up) and
divisive (top-down).
Hierarchical clustering is typically accomplished by sequentially merging similar
clusters, as illustrated in the figure below. This is referred to as agglomerative hierarchical
clustering. In theory, it is also possible to accomplish this by first grouping all of the
29
observations into a single cluster and then successively splitting these clusters. This is
referred to as divisive hierarchical clusters. In practice, dividing clustering is rarely used.
Figure 21: Hierarchical clustering dendrogram
The advantages of hierarchical clustering are that it is simple to understand and

implement. Its shortcomings include the fact that it rarely provides the best solution, that it
involves many arbitrary decisions, that it does not work with missing data, that it performs
poorly with mixed data types, that it does not perform well on very large data sets, and that
its main output, the dendrogram, is frequently misinterpreted.
II.7.2: Partitioning Methods
The process of partitioning unlabeled data into K groups is described as partitioning

methods. The most widely used partitioning algorithms are K-means, K-medoids (PAM),
fuzzy C-means, and fuzzy C-medoids. Because of its speed, simplicity, ease of
implementation, and ability to assign the desired number of clusters, K-means has been used
to cluster time-series data, yielding efficient clustering results.
The K-medoids or PAM (partition around medoids) algorithm is frequently used in
conjunction with the DTW distance measure to cluster time-series data. For time-series
clustering, unsupervised partitioning has been shown to be just as effective at providing good
clustering accuracy.
30
Figure 22: K-Means clustering algorithm graph
II.7.3: Model Based Methods
This clustering method is based on the assumption that data is composed of

distributions, such as Gaussian distributions. In the figure below, the distribution-based
algorithm clusters data into three Gaussian distributions. The probability that a point belongs
to the distribution decreases as the distance from the distribution's center increases. The
bands demonstrate the decrease in probability. When you don't know what kind of
distribution your data has, you should use a different algorithm.
A self-organizing map (SOM), also known as a model-based method, is a type of

neural network (NN) that is used for model-based clustering. SOM has been used to analyse
temporal data and is utilized for Many works on clustering have chosen SOM due to its
advantages with regard to certain properties such as parameter selection and data analysis.
However, one of its main disadvantages is that it does not work perfectly with time-series of
unequal length, as it is difficult to define the dimension of weight vectors.
Figure 23: Model based clustering method plot illustration
31
II.7.4: Density-Based Methods
Density-based clustering connects areas of high density into clusters. This permits
distribution of any shape as long as dense regions can be connected. High-dimensional data
and data with varying densities present challenges for these algorithms. Furthermore, these
algorithms are not intended to assign outliers to clusters. Thus, objects in sparse areas are
usually considered to be noise and border points.
Density-based clustering for time-series data has some advantages; it is a fast
algorithm that does not require pre-setting the number of clusters, is able to detect arbitrary
shaped clusters as well as outliers, and uses easily comprehensible parameters such as spatial
closeness.
Density-Based Spatial Clustering (DBSCAN) of Applications with Noise is by far

the most popular density-based clustering algorithm.
Another algorithm is called ADBSCAN (Adaptive DBSCAN), and as the name

implies, it differs from the first one by adjusting the Eps and MinPts values to account for
the density distribution for each cluster. Automatically, the right Eps and MinPts values are
discovered.
Last but not least, there is the DENCLUE algorithm, which uses the kernel density
estimation technique to calculate the undefined probability density function of a random
variable that creates the data sample.
Figure 24: Density based clustering method plot illustration
32
II.7.5: Deep Clustering Methods for Time-Series Data

Clustering is one of the most natural methods for extracting meaningful information
from raw data. Long-established methods such as k-means and Gaussian Mixture Models
represent the cornerstone of cluster analysis. Their applicability, however, is constrained to
simple data, and their performance is limited in high-dimensional, complex real-world data
sets that do not exhibit a clustering-friendly structure.
To overcome this issue, dimensionality reduction techniques, such as PCA, have

been successfully applied to obtain a low-dimensional representation that is more suited for
clustering. Recently, deep neural networks (DNNs) such as Autoencoders (AEs), Variational
Autoencoders (VAEs), and Generative Adversarial Networks (GANs) have been used in
combination with clustering methods to substantially increase their clustering performance.
Indeed, the compressed latent representation produced by these models has been shown to
facilitate clustering. Despite their success, the majority of these methods do not investigate
the relationship between clusters. Furthermore, the clustered feature points are located in a
high-dimensional latent space that humans cannot easily visualize or interpret.
The Self-Organizing Map (SOM), on the other hand, is a clustering method that
provides such an interpretable representation. By inducing a flexible neighborhood structure
over the clusters, it generates a low-dimensional (typically 2-dimensional) discretized
representation of the input space. Unfortunately, its performance is strongly dependent on
the complexity of the data sets used, and, like other classical clustering methods, it typically
performs poorly on complex, high-dimensional data. While the SOM is extremely useful for
data visualization, only a few methods have attempted to combine it with DNNs. To address
the aforementioned issues, we propose Probabilistic SOM (PSOM), a novel method of fitting
SOMs with probabilistic cluster assignments. We also extend this PSOM to a deep
architecture, the Deep Probabilistic SOM (DPSOM), which trains a VAE and a PSOM
simultaneously to achieve an interpretable discrete representation while exhibiting cutting-
edge clustering performance.
Instead of assigning data points to specific clusters, our model employs centroid-
based probability distributions. It reduces their Kullback-Leibler divergence with respect to
auxiliary target distributions while enforcing a SOM-friendly space. To emphasize the
33
importance of an interpretable representation for temporal applications, we extend this

model to support time series, yielding the temporal DPSOM (T-DPSOM).
Figure 25: from SOM to T-DPSOM
Deep clustering:
Recent clustering analysis work has shown that using deep neural networks (DNNs)
in conjunction with clustering algorithms significantly improves clustering performance. In
that case, DNNs are used to embed the data set into a space that is better suited for clustering.
DEC is a method that sequentially applies embedding learning using Stacked

Autoencoders (SAE) and the Cluster Assignment Hardening method to the obtained
representations.
IDEC, an improvement to this architecture, incorporates the SAE's decoder network

into the learning process, so that training is affected by both the clustering and reconstruction
losses.
Similarly, DCN combines a k-means clustering loss with the reconstruction loss of
SAE to produce an end-to-end architecture that trains representations and clusters
concurrently. These models achieve state-of-the-art clustering performance, but they do not
investigate the relationship among clusters.
II.8: Machine learning for biomedical science
Biomedical science is also an industry that evolves with the times. With the amount
of data generated for each patient, machine learning algorithms in biomedicine have great
potential. It is no surprise, then, that there are numerous successful machine learning
applications in healthcare right now.
From the large-scale analysis of genomic data advancing personalized medicine to
the solving of a 50-year-old challenge in biology by predicting protein folding from amino
acid sequences, there’s no doubt machine learning is enabling breakthroughs that are shaping
the future of biomedical research.
34
II.8.1: Machine learning tasks in healthcare
Machine learning techniques can be applied to solve a wide variety of tasks. When it
comes to applications of machine learning in biomedicine, these tasks include:
● Classification: can help to determine and label the kind of disease or medical case
you’re dealing with;
● Recommendations: can offer necessary medical information without the need to
actively search for it;
● Prediction: using current data and common trends, machine learning can make a
prognosis on how the future events will unfold;
● Clustering: can help to group together similar medical cases to analyse the patterns
and conduct research in the future;
● Anomaly detection: Using machine learning in healthcare, you can identify things
that deviate from common patterns and determine whether any actions are required.
● Automation: machine learning can handle standard repetitive tasks that take too
much time and effort from doctors and patients, like data entry, appointment
scheduling, inventory management, etc.;
● Ranking: machine learning can put the relevant information first, making the search
for it easier.
II.8.2: Benefits of Machine Learning in Healthcare

Using machine learning in healthcare for the tasks listed above can provide numerous
benefits to healthcare organizations. For starters, it frees up healthcare professionals' time to
focus on patient care rather than searching for or entering information.
The increase in diagnostic accuracy is the second important role of machine learning
in healthcare. Machine learning, for example, has been shown to be 92% accurate in
predicting the mortality of COVID-19 patients.
Third, applying machine learning to medicine can aid in the development of a more
precise treatment plan. A lot of medical cases are unique and require a special approach for
effective care and side-effect reduction. Machine learning algorithms can simplify the search
for such solutions.
35
Machine learning was designed to deal with large data sets, and patient files are
exactly that: a large number of data points that require careful analysis and organization.
Furthermore, while a healthcare professional and a machine learning algorithm will

almost certainly arrive at the same conclusion based on the same data set, using machine
learning will yield results much faster, allowing us to begin treatment sooner.
Another reason for using machine learning techniques in healthcare is that they
eliminate human involvement to some degree, which reduces the possibility of human error.
This especially concerns process automation tasks, as tedious routine work is where humans
err the most.
II.8.3: Time series for biomedicine
We will now concentrate on time series and what time is, because medicine is
essentially a time series problem, and clinical professionals frequently refer to it.
II.8.3.a: Cell clustering analysis for time-series single-cell RNA-seq

data
In this section, I will discuss a time series clustering use case, cell clustering analysis for
single cell sequencing, and Zhuo Wang et al.'s study.
Single-cell RNA sequencing is a technique that extracts RNA from all cells and
quantifies the sequence RNA as well as the expression for each cell, providing us with
granular resolution expression profiles at the cellular level and allowing us to compare
expression between cells. Single-cell RNA sequencing is a technique that extracts RNA from
all cells and quantifies the sequence RNA as well as the expression for each cell, resulting
in granular resolution expression profiles at the cellular level and the ability to compare
expression between cells.
The development of single-cell RNA sequencing has allowed for profound biological
discoveries ranging from the dissection of complex tissue composition to the identification
of novel cell types and dynamics in some specialized cellular environments.
36
Figure 26: Single cell RNA seq analysis steps
Figure 27: Benefits of single-cell RNA sequencing for biological discoveries
The study presents an algorithm based on the Dynamic Time Warping score
(DTWscore) combined with time-series data that enables the detection of gene expression
changes across scRNA-seq samples and the recovery of potential cell types from complex
mixtures of multiple cell types.
The DTWscore focuses on detecting the cell-to-cell heterogeneity among time-

period scRNA-seq data and highlighting the highly divergent genes that are used to define
potential cell types. The input of the DTWscore is a matrix of time-series gene expression
data. The rows of the matrix stand for individual genes, and the columns represent the gene
expression profiles of different cells at discrete time points.
In particular, if a gene expression level between different time periods is quantified

through the same process function, we consider genes of this type to show non-heterogeneity
across cells, while the remaining genes are deemed as highly variable genes between time
series data for further analysis.
The method pipeline is described and illustrated in the figure below. To begin,
perform a traditional filter step to remove low-quality cells. Second, calculate the mean
37
DTW distance between all pairs of cells as an index for detecting a specific set of genes for
heterogeneity analysis. To reduce the bias toward extreme values, we must normalize the
DTW distance index values. Following normalization, the gene with the highest DTWscores
is selected for further study and is referred to as the most significantly highly variable gene.
The output could be used to categorize various types of cells.
Figure 28: cell clustering analysis pipeline
II.8.3.b: Biomedical time series Anomaly Detection use case
In this section, I will present a time series anomaly detection use case, ECG anomaly
detection use case, and a study that was conducted by Hongzu Li et al. using heartbeats,
which is one of the important vital signs that is typically collected in such a medical sensor
monitoring system.
Automatic collection of vital sign data enables remote medical monitoring and
diagnosis for improved energy efficiency. These real-time streams can be processed by
intermediate storage nodes to detect any anomalies. Once identified, only the abnormal data
must be sent to the physician for further diagnosis, while the rest of the normal data can be
archived at the local storage nodes. An anomaly detection scheme is required to determine
whether a real-time sensor data stream contains abnormal data.
The paper proposes an adaptive window-based discord discovery (AWDD) scheme

for detecting abnormal heartbeats in a series of heartbeat readings.
This scheme is an improvement on the Brute Force Discord Discovery (BFDD) scheme,
which is a one-pass algorithm that uses a fixed window size and thus requires the window
size to be specified by the user.
38
This algorithm compares a fixed-length subsequence to a similar-length subsequence

obtained by sliding down a given time series one sample at a time. As a result, the original
BFDD scheme is extremely computationally expensive. The AWDD scheme is a two-pass
approach with adaptive window size that is inspired by the BFDD scheme. It compares two
subsequences of varying lengths using a simple resampling method. Furthermore, this
algorithm is capable of detecting anomalies in patients' heartbeat time series with greater
accuracy.
Conclusion
This chapter provided an overview of the characteristics of time series and their
various tasks. We focused on three fundamental tasks, namely time series classification,
forecasting and clustering. We conclude by discussing two biomedical time series
applications.
39
Chapter III: INNER SPEECH RECOGNITION
USE CASE
INNER SPEECH RECOGNITION USE CASE
Introduction
In the previous part, we have provided an overview of time series analysis and ML
with biomedical time series. In this part, we address the last key goal of our internship,
which is to present the inner speech recognition use case. We begin in this chapter by stating
and further clarifying the problem we are trying to solve. We then examine the relevant
literature and the available remedies, focusing on their flaws and limitations. Next, in this
chapter, we begin to describe our dataset and its acquisition process. Finally, we will present
the ML pipeline we followed in order to create a model that is able to classify EEG signals
III.1: Contextualization
Neural engineering research has made tremendous strides in decoding motor or visual
neural signals in order to assist and restore lost function in patients with disabling
neurological diseases. The development of assistive devices that restore natural
communication in patients with intact language systems but limited verbal communication
due to a neurological disorder is an important extension of these approaches. Several brain-
computer interfaces have enabled relevant communication applications, such as moving a
cursor on the screen and spelling letters.
Although this kind of interface has been found to be helpful, patients have had to learn
to modulate their brain activity in an unnatural and counterintuitive manner, i.e., performing
mental tasks like spinning a cube, making calculations in their heads, moving in order to
operate an interface, or identifying letters presented quickly on a screen, as in the P300-
speller.
A communication system that can directly infer inner speech from brain signals would
be advantageous for people with speech impairments as a substitute, enabling them to
communicate with the outside world in a more natural way. This definition of inner speech,
also known as imagined speech, inner speech, hidden speech, silent speech, speech imagery,
or verbal thoughts, refers to the capacity to produce inner speech representations in the
absence of external speech stimulation or internally produced open speech.
40
III.2: Related work
People with paralysis may be able to type sentences letter by letter at up to 10 words
per minute with the aid of assistive devices, but that is a far cry from the 150 words per
minute average of everyday conversation.
New research from UC San Francisco shows it is possible to generate synthesized

speech directly from brain signals using intracortical neuron recording, which is an invasive
technique that uses electrodes that are implanted directly into the gray matter of the brain
during neurosurgery so it can produce the highest quality signals, but these techniques are
only used in severe cases, due to the complex nature of the surgery.
Previous attempts to classify EEG signals associated with imagined pronunciation of

words frequently focused on vowels pronunciation, as demonstrated by Torres Garcia, who
identified event-related potentials during imagined pronunciation vowels using Relative
Wavelet Energy (RWE) to generate feature vectors and Random Forest (RF) and Support
Vector Machines (SVM) as classifiers.
Furthermore, Jonathan Clayton and colleagues investigated the efficacy of

"lightweight" EEG devices for speech decoding by comparing classification performance to
data from a research-grade device. The FEIS dataset results show that "lightweight" mobile
EEG devices can obtain data that encodes speech processing in the same way as research-
grade devices with higher electrode density and fidelity where they have used several novel
ML approaches such as independent component analysis (ICA), support vector machine
(SVM), Convolutional Neural Networks (CNNs), Dense Network, and stacked Denoising
AutoEncoder (DAE).
III.3: Data description
III.3.1: Data acquisition
The data we will be working with comes from a study conducted by Nicolás Nieto,
Hugo Leonardo Rufiner, Victoria Peterson, and Ruben Spies at Torcuato Di Tella
University's Neuroscience Laboratory in Argentina. The data is a multi-speech-related BCI
dataset consisting of EEG recordings with 128 active EEG channels and 8 external active
41
EOG/EMG channels having a 24 bits resolution and a sampling rate of 1024 Hz, from ten
naïve BCI users, performing four mental tasks in three different conditions: inner speech,
pronounced speech, and visualized condition. In a single day of recording, each participant
completed between 475 and 570 trials, yielding a dataset with more than 9 hours of nonstop
EEG data collection and more than 5600 trials.
The participants are ten healthy right-handed individuals with a mean age of 34
(standard deviation: 10 years), four females and six males. None of the participants have any
speech or hearing impairments, and none have any neurological, motor, or psychiatric
disorders.
III.3.2: Data acquisition process
As depicted in Fig., each subject took part in a single recording day that included
three separate sessions. To avoid boredom and fatigue between sessions, a self-selected
break period (inter-session break) was provided. Each session began with a baseline of
fifteen seconds, during which the participant was instructed to unwind and maintain as much
stillness as possible.
Within each session, five stimulation runs were presented. Those runs match the
various conditions that have been put forth, including the pronounced speech, inner speech,
and visualized conditions. At the beginning of each run, the condition was announced on the
computer screen for a period of 3 seconds. In all cases, the order of the runs was: one
pronounced speech, two inner speeches, and two visualized conditions. Runs were separated
by an inter-run break of one minute.
42
Figure 29: organization of the recording day for each participant
The classes were specifically selected taking into account a natural BCI control
application with the Spanish words: "Arriba”, “Abajo”, “Derecha" and "izquierda" (i.e.
"top”, “bottom”, “right” and "left” respectively). The trial's class (word) was chosen at
random. In the first and second sessions, each participant had 200 trials. Nonetheless,
depending on their willingness and fatigue, not all participants completed the same number
of trials in the third session.
Figure 29 depicts the trial composition as well as the relative and cumulative times.
Each trial began at time t = 0 seconds and had a concentration interval of 0.5 seconds. A new
visual cue would be presented to the participant shortly. The participant was instructed to
maintain a fixed gaze on a white circle that appeared in the center of the screen and not blink
until the trial's conclusion. The cue interval started at time t = 0.5 seconds. A white triangle
with an arrow pointing in one of four directions was displayed. The direction of the cue
pointing corresponded to each class. After 0.5 seconds, or at t = 1 second, the triangle
disappeared from the screen, at which point the action interval began. As soon as the visual
cues vanished and the white circle appeared on the screen, participants were instructed to
start completing the indicated task. The white circle turned blue and the relaxation interval
started after 2.5 seconds of the action interval, or at t = 3.5 seconds. The participant was told
in advance to stop the activity at this point but not to blink until the blue circle vanished. At
t = 4.5 seconds, the blue circle disappeared, signifying that the trial was over. A rest interval,
varying in length from 1.5 seconds to 2 seconds, was allowed between trials.
43
Figure 30: trial workflow
III.3.3: BCI interactive conditions
The dataset was designed with the primary goals of decoding and understanding the
processes involved in the generation of inner speech, as well as analyzing its potential use
in BCI applications, in mind. As described in the “Background & Summary” Section, the
generation of inner speech involves several complex neural network interactions. To localize
the main activation sources and analyse their connections, we asked the participants to
perform the experiment under three different conditions: inner speech, pronounced speech,
and visualized condition.
III.3.3.a: Inner speech condition
Inner speech condition is the primary condition of the dataset, and it seeks to identify
the electrical activity in a participant's brain associated with their thought about a specific
word. During the inner speech runs, participants were instructed to imagine their voice as if
they were giving a direct order to the computer, repeating the corresponding word until the
white circle turned blue. Each participant was explicitly instructed not to concentrate on the
articulation gestures. In addition, each participant was instructed to remain as still as
possible, with no movement of the mouth or tongue. For the sake of natural imagination, no
rhythm cue was provided.
III.3.3.b: Pronounced speech
Although motor activity is mainly related to the imagined speech paradigm, inner
speech may also show activity in the motor regions. The pronounced speech condition was
proposed to identify motor regions involved in pronunciation that matched those activated
during the inner speech condition. During the pronounced speech runs, each participant was
instructed to repeat aloud the word corresponding to each visual cue, as if giving a direct
44
order to the computer. No rhythm cue was provided, as was the case with the inner speech
runs.
III.3.3.c: Visualized condition
This condition was proposed because the selected words have a high visual and
spatial component, and with the goal of finding any activity related to that being produced
during inner speech. Participants in the visualized condition runs were instructed to
concentrate on mentally moving the circle in the center of the screen in the direction
indicated by the visual cue.
III.4: Data Processing and Analysis
As indicated in Figure 31, the majority of recent applications follow a consistent path
for EEG data processing. Raw EEG data are preprocessed primarily to remove artifacts and
noise. Then, pertinent characteristics of brain activity are retrieved, and these characteristics
are classified to define a mental state.
Figure 31: EEG data processing pipeline
45
The preprocessing phase may include signal acquisition, artifact removal, averaging,
thresholding of the output, signal augmentation, and edge detection. The elimination of
artifacts is the most crucial phase in this stage and many other signal processing applications.
There are several sources of artifacts in raw EEG signal recordings. They are disruptions
that can arise during signal collection and affect the interpretation of the signals themselves.
If noise is not appropriately addressed, it might have a negative impact on the useful
characteristics of the original signal. Muscular activity, eye blinking during the signal
collecting operation, and power line electrical noise might be causes of artifacts. Thus, a
transformation procedure was created to restructure the continuous raw data into a more
compact dataset and to make their use easier. Such processing was performed in Python,
primarily with the MNE library. A function was created that allows for the rapid loading of
raw data corresponding to a specific participant and session.
The first step in the signal processing procedure was to ensure that the events in the
signals were correctly tagged. Missing tags were identified, and a method for correcting
them was proposed. Because the BioSemi acquisition system is "reference-free," the
Common-Mode (CM) voltage is recorded in all channels, necessitating a re-reference step.
This procedure was carried out using the MNE reference function and channels EXG1 and
EXG2. This step removes the CM voltage and aids in the reduction of line noise (50 Hz) and
body potential drifts. A zero-phase bandpass finite impulse response filter was used to filter
the data. The lower and upper bounds were both set to 0.5 and 100 Hz. A 50 Hz Notch filter
was also used. The data was decimated four times, resulting in a final sampling rate of 254
Hz. The dimension [channels x samples] matrices corresponding to each trial were stacked
in a final tensor of size [trials x channels x samples].
The continuously recorded data were then extracted, retaining only the 4.5 s signals
corresponding to the time window between the start of the concentration interval and the end
of the relaxation interval.
III.4.1: Feature engineering
The art of creating useful features from existing data is known as "feature
engineering." It entails transforming data into forms that are more closely related to the
46
underlying target to be learned. When done correctly, feature engineering can add value to
your existing data while also improving the performance of your machine learning models.
This phase can be broken down into two steps: first, extract more significant features
from the original input data; second, choose the best features from the new dataset.
III.4.2: Feature extraction
Observing EEG signals in the temporal domain makes it challenging to extract

relevant information from them. Consequently, there are several sophisticated signal
processing techniques that may be utilized to extract valuable information from this data.
Typically, the selection of a particular approach is determined by the application under
investigation and its particular requirements.
The goal of feature extraction in our project is to describe meaningful information

regarding brain activity using a limited number of relevant variables, if possible. All
retrieved characteristics are typically grouped into a feature vector, which is subsequently
used to classify brain activity.
There are three primary information sources that may be derived from EEG readings:
spatial information (for multichannel EEG), spectral information (power in frequency
bands), and temporal information (time windows-based analysis).
Fourier analysis is a common signal processing technique for transforming data from
the time domain to the frequency domain or vice versa. Both continuous and discrete
temporal signals are applicable to this technique. It is based on the premise that every signal
may be approximated or represented by the sum of trigonometric functions. FFT is a method
that calculates the Discrete Fourier Transform (DFT) or inverse of a sequence. It yields the
exact same result as evaluating the DFT definition directly, but considerably more quickly.
The DFT is defined by the formula:
𝑋 = 𝑥 𝑒 𝑘 = 0, … , 𝑁 − 1 (4)
Where:
47
𝑋 = DFT of 𝑥
𝑥 = input sequence
𝑁 = elements in input sequence
In this work, the power spectral density (PSD) was employed as the feature extraction
approach. PSD is a typical signal processing approach that displays the signal's energy as a
function of frequency by distributing its power over frequency. We have employed the
Welch technique in accordance with the PSD.
The Welch technique is a modified segmentation scheme used to assess the average
periodogram. In general, the Welch technique of the PSD may be expressed by the following
equations; the power spectra density equation is first defined. The Welch Power Spectrum
is then represented as the mean average of the periodogram for each interval.
1
𝑃 (𝑓 ) = 𝑥 (𝑛 ) 𝑤 (𝑛 ) 𝑒 (5)
𝑀𝑈
1
𝑃 (𝑓 ) = 𝑃 (𝑓 ) (6)
𝐿
After this procedure, each signal instance of the trial will be converted into a feature
vector of size 1 × m where m is the number of features extracted. The final dataset is a matrix
of shape n × m where n is the number of trials for all the subjects.
III.4.3: Feature selection
III.4.3.a: PCA
Principal component analysis, or PCA, is a dimensionality-reduction technique that

is frequently employed to decrease the dimensionality of big data sets by reducing a large
set of variables into a smaller one that retains the majority of the information in the large set.
Reducing the number of variables in a data collection entails a natural loss of

precision, but the key to dimensionality reduction is to sacrifice a little precision for
simplicity. Because smaller data sets are easier to examine and interpret and make it simpler
and quicker for machine learning algorithms to analyse data without having to process
48
superfluous factors. PCA is a straightforward method for reducing the number of variables
in a data collection while retaining as much information as feasible.
In our work, we used PCA for dimension reduction. The objective function of PCA
is, 𝑚𝑎𝑥 𝑢 𝐶𝑢 subject to 𝑢 𝑢 = 1, where C is the covariance matrix obtained and
vector 𝑢𝜖𝑅
Figure 32: A big picture of the idea of PCA algorithm. "Eigenstuffs" are eigenvalues and eigenvectors.
III.4.3.b: Recursive feature elimination
Recursive feature elimination (RFE) is a feature selection technique that optimizes a

model by eliminating the weakest feature (or features) until the desired number of features
is obtained. Features are prioritized using the model's coef_ or feature importances_
attributes.
RFE requires the retention of a specific number of features, but it is frequently unknown in
advance how many features are valid. Cross-validation is used with RFE to score several
feature subsets and pick the highest-scoring collection of features in order to determine the
optimal number of features. The RFECV visualizer plots the number of features in the model
together with their cross-validated test score and variability, and visualizes the number of
features picked. In the end, we only choose features that have been ranked as essential by at
least two different models.
49
III.5: Modeling
After analysing our data, it's time to develop a classification model that can predict
to which class an EEG signal belongs. We examine the potency of six classification models
in this section: three deep neural networks and three machine learning models.
III.5.1: Machine learning models
A mathematical representation of the results of the training process is known as a “machine

learning model”. A model will utilize the training dataset to determine the optimal way to
map samples of input data to specified class labels. As a result, the training dataset must be
sufficiently representative of the problem and contain a large number of samples for each
class label.
Therefore, this section was preceded by a pre-processing step. The pre-processing

section's result is a dataset including all of the trials from the 10 participants characterized
by the set of features suggested by the RFE approach. The Sklearn library’s standard scaler
is then used to normalize all features. Thus, the feature vectors of the EEG signals were
classified using three types of classifiers, a Support Vector Machine (SVM), eXtreme
Gradient Boosting (XGBoost) and K nearest neighbours (KNN).
III.5.1.a: Support-vector machine (SVM)
The goal of the method for the support vector machine is to locate, in a space of N
dimensions (where N is equal to the number of features), a hyperplane that categorizes the
data points in a manner that is unique. It is possible to choose from a wide variety of
hyperplanes in order to differentiate between the two categories of data points.
The goal is to locate a plane that has the greatest possible margin, which may be
understood as the greatest possible distance between data points of both classes. When the
margin distance is maximized, some reinforcement is provided, allowing for subsequent data
points to be categorized with a greater level of confidence.
50
Figure 33: Hyperplan SVM
III.5.1.b: XGBoost (eXtream Gradient Boosting)
Gradient Boosting is a specific kind of the boosting algorithm family. Boosting is the
process of combining a number of "weak learners" into a single "strong learner," which
means combining a number of algorithms that have poor performance into a single algorithm
that is significantly more effective and satisfying. The transformation from "poor learners"
into "strong learners" is accomplished by repeatedly calling them to estimate a variable of
interest.
Within the context of a classification, each individual is assigned a weight that is
constant at the outset and that, if a model is incorrect, is increased prior to estimating the
next model (which will thus take these weights into account). The update of the weights will
be computed using the stochastic gradient descent method.
Figure 34: XGBoost illustration
51
III.5.1.c: K-Nearest Neighbours (KNN)
Figure 35: KNN illustration
The graphic above shows that comparable data points are usually adjacent to one
another. The KNN algorithm is based on the premise that this assumption is true enough for
the algorithm to be beneficial. The KNN method combines the concept of similarity (also
known as distance or proximity) with some elementary mathematics, especially calculating
the distance between points on a graph.
The KNN algorithm:
1. Load the data
2. Set k to the desired number of nearest neighbours.
3. For each instance in the data:
● Determine the distance between our query and the current iterated
observation in the data loop.
● To an ordered collection of data, add the distance and index of the appropriate
observation.
● Sort this ordered collection of distances and indices from least to greatest
distance (in ascending order).
● Choose the first k elements from the sorted data set (equivalent to the k
nearest neighbours)
52
● Obtain the labels for the k chosen items.
● Return the k labels' mode (most frequent/common value).
III.5.2: Deep Learning models
DL is an area of machine learning that investigates computational models that

discover hierarchical representations of input data via consecutive non-linear
transformations. Deep neural networks (DNNs), which were inspired by earlier models such
as the perceptron, are models in which: stacked layers of artificial "neurons" each apply a
linear transformation to the data they receive, and the result of each layer's linear
transformation is fed into a non-linear activation function. Significantly, these
transformation parameters are learnt by directly minimizing a cost function. Although the
word "deep" indicates the presence of several layers, there is no agreement on how to assess
the depth of a neural network and, consequently, what makes a deep network and what does
not.
Three deep learning models were utilized to tackle the task of decoding EEG signals:
a standard convolution neural network, a deep Convnet architecture inspired by computer
vision models, and a compact CNN specifically created for EEG-based BCIs.
III.5.2.a: Conventional neural network (CNN)
Deep learning (DL) has sparked interest in numerous fields because of its better
performance. DL is capable of dealing with nonlinear and nonstationary data and learning
underlying characteristics from signals. For the categorization of EEG signals, certain deep
learning approaches are used. Because of their capacity to learn characteristics from small
receptive fields, CNNs have been frequently employed in EEG categorization. CNNs are
appropriate for complex EEG recognition tasks because the trained detector may be utilized
to identify abstract characteristics via convolutional layer repetition. They have obtained
good results and are widely used by many researchers.
Due to its structure, CNN can imitate the human brain's complex cerebral cortex.
It simply requires a large training dataset to train a complicated model that learns
features using backpropagation and the gradient descent optimization technique and extracts
features using a sequence of filtering, normalization, and nonlinear activation procedures.
53
Figure 36: CNN architecture
III.5.2.b: Convolutional neural networks (deep ConvNets)
Recent literature has indicated that there is promise in using Convolutional neural
networks (deep ConvNets) for EEG classification. Effective computer vision architectures
served as inspiration for our ConvNet model.
ConvNets may learn local non-linear features (through convolutions and

nonlinearities) and describe higher-level features as layers of processing compositions.
Our deep ConvNet featured four convolution-max-pooling blocks, as shown in the picture
below, with a specific initial block created to accommodate EEG input, followed by three
standard convolution-max-pooling blocks and a dense softmax classification layer.
Due to the high number of input channels, the first convolutional block was divided into two
convolutional layers. The convolution was divided into a first convolution in time and a
second convolution in space; each filter in these steps has weights for all electrodes and for
the filters in the previous time convolution.
The model uses exponential linear units as activation functions. (ELUs, 𝑓 (𝑥 ) = 𝑥 𝑓𝑜𝑟 𝑥 >
0 𝑎𝑛𝑑 𝑓(𝑥) = 𝑒 − 1 𝑓𝑜𝑟 𝑥 ≤ 0 ).
54
Figure 37: Deep Convnet architecture
III.5.2.c: EEGNET
Because the EEG signal pre-processing steps are often very specific to the EEG
feature of interest, it is possible that other potentially relevant EEG features could be
excluded from analysis. An EEG-specific model that incorporates well-known EEG feature
extraction ideas for BCI is thus required.
Here, we introduce EEGNet, a compact CNN for classifying and interpreting EEG-
based BCIs. We present the use of Depth wise and Separable convolutions, previously used
in computer vision, to build an EEG-specific network that covers numerous well-known
EEG feature extraction techniques, such as optimal spatial filtering and filter-bank building,
55
while simultaneously reducing the number of trainable parameters to fit when compared to
existing approaches.
Figure 40 and table 4 show a visualization and full description of the EEGNet model
for EEG trials collected at a 1024 Hz sampling rate, with C channels and T time samples, F1
= number of temporal filters, D = depth multiplier (number of spatial filters), F2 = number
of pointwise filters, and N = number of classes, respectively.
Figure 38: Overall visualization of the EEGNet architecture
The network first learns frequency filters using a temporal convolution, then learns
frequency-specific spatial filters using a depth wise convolution coupled to each feature map
independently. The separable convolution is made up of a depth wise convolution that learns
a temporal summary for each feature map independently, followed by a pointwise
convolution that learns how to optimally mix the feature maps. The table 4 below contains
further information on the model's architecture.
56
Table 4: EEGNet architecture
Module Layer Filters Size Output Activation
1 Input _ _ CxT _
Reshape _ _ 1xCxT _
Conv2D F1 (1,64) F1 x C x T Linear
Batch Norm _ _ F1 x C x T _
DepthwiseConv2D D x F1 (C,1) (D×F1) ×1×T Linear
Batch Norm _ _ (D×F1) ×1×T _
Activation _ _ (D×F1) ×1×T ELU
AveragePool2D _ (1,4) (D×F1) ×1×(T//4) _
Dropout _ p=0.5 (D×F1) ×1×(T//4) _
2 SeparableConv2D F2 (1,16) F2×1×(T//4) Linear
Batch Norm _ _ F2×1×(T//4) _
Activation _ _ F2×1×(T//4) ELU
AveragePool2D _ (1,8) F2×1×(T//32) _
Dropout _ p=0.5 F2×1×(T//32) _
Flatten _ _ F2×(T//32) _
3 Classifier N×(F2×T//32) Max-norm =0.25 N SoftMax
57
III.5.3: Models Evaluation
Models in machine learning are only as valuable as their predictive ability; hence,
our fundamental goal is to build high-quality models with potential predictive power. We
will now look at ways of evaluating the quality of models created by our machine learning
and deep learning algorithms. In order to improve our model's overall predictive capacity,
we should evaluate our model's performance using a variety of metrics before we deploy it
on real data.
III.5.3.a: Accuracy
The number of samples successfully categorized among the total number of samples
in the test set. We compute the accuracy as follows:
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (7)
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
Figure 39: Model scores
From the results, it is validated that EEGNET and ConvNet algorithms offer the best
outcomes. However, looking at the range of accuracy values, we can easily see that our
58
models are not effective at discriminating between classes, since it is more difficult to
classify more than two classes.
Another approach to consider is in terms of decision boundaries. More classes mean
more boundaries. With four classes, there will generally be a boundary between every pair
of classes (six boundaries). The number of boundaries increases as the number of classes
increases, and because our work is in a high-dimensional space, there is a lot of opportunity
for some classes to be adjacent to multiple others, which increases the amount of space that
is close to a boundary and the number of data that are close to a boundary whose position
may not have been perfectly estimated. Also, with fewer data points per class, we may have
a less accurate estimate of where the boundary lies.
To address this problem, we decided to take advantage of the benefits of binary
classification for our multi-class classification task by splitting the related dataset into
numerous binary classification datasets and training a binary classification model for each.
We now have four expert binary classifiers that are really good at recognizing one pattern
from all the others.
Accuracies for different class labels

100
90
80
70
60
UP
50
DOWN
40
RIGHT
30
LEFT
20
10
0
SVM XGBoost KNN CNN EEGNET ConVnet
Figure 40: Accuracy of binary classification models
As shown in the graph above, EEGNET and Convnet have the highest accuracy for
all class labels. The remainder of the analysis will concentrate on a single binary
59
classification task that belongs to the class "UP," as all binary classification tasks will use
the same result discussion.
Table 5: Train and validation accuracies of our models
Train accuracy Validation accuracy
SVM 72 72
XGBoost 84 80
KNN 80 80
1D-CNN 61 63
EEGNet 86 84
ConvNet 87 85
An average accuracy of 0.80 was reached by our models. This appears to be an

excellent outcome. However, we cannot see how well the model is doing at the class-level
predictions. Therefore, relying on the accuracy score alone is not enough and can even be
misleading. For the reasons stated above, we must turn to other evaluation metrics.
III.5.3.b: Confusion Matrix
A confusion matrix is a definition of the predicted outcomes of any binary testing

that is frequently used to evaluate the performance of the classification model. It is a class-
wise distribution of the predictive performance of a classification model.
Figure 41: Confusion matrix with two class labels

True Positive (TP) refers to a sample belonging to the positive class being classified
correctly.
60
True Negative (TN) refers to a sample belonging to the negative class being classified
correctly.
False Positive (FP) refers to a sample belonging to the negative class but being classified
wrongly as belonging to the positive class.
False Negative (FN) refers to a sample belonging to the positive class but being classified
wrongly as belonging to the negative class.
Below are the confusion matrices created from our models.
Figure 42: SVM confusion matrix Figure 43: XGBoost confusion matrix Figure 44: KNN confusion matrix
Figure 45: CNN confusion matrix Figure 46: EEGNET confusion matrix Figure 47: ConvNet confusion matrix
As can be seen from the color-bar on the side, the number increases as the color
becomes darker. As a result, we would intuitively conclude that darker colors on diagonal
parts and lighter colors on the others indicate that our model is performing well, and vice
versa. Despite their high accuracy, machine learning models are poor at identifying the class
in question. The Convent and EEGNET models generated the best results, as seen in the
figures; a darker colour for the diagonal element (1,1) implies our models perform really
well for class "UP".
61
III.5.3.c: Classification report
We can evaluate the model more closely using the four different numbers from the
matrix. In general, we can get the following quantitative evaluation metrics from this binary
class confusion matrix:
 Precision: Percentage of correct positive predictions relative to total positive

predictions. The Precision Evaluation Metric formula is as follows:
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (8)
𝑇𝑃 + 𝐹𝑃
 Recall: The ratio of correct positive predictions to total actual positives. The Recall
formula is as follows:
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (9)
𝑇𝑃 + 𝐹𝑁
 F1 Score: A precision and recall weighted harmonic mean. The closer the model is
to one, the better it is. The F1 Score Evaluation Metric has the following formula:
2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 = (10)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Fortunately, when building a classification model in Python, we can use the sklearn
library's classification report() method to generate all three of these metrics. The
classification reports for our three machine learning models (SVM/XGBoost/KNN) are
provided below.
Figure 48: SVM classification report Figure 49: XGBoost classification Figure 50: KNN classification
report report
62
Table 6: Evaluation metrics for the DL models
Precision Recall F1-score
1 0 1 0 1 0
1D-CNN 61 68 68 60 64 64
EEGNet 76 94 96 67 85 78
ConvNet 77 75 78 74 78 75
Here, we can see that the XGBoost classifier does a great job at identifying the
"UP" class, with a precision of 0.98, but it has a very low recall (0.59), which means that
the model has a 59% chance of predicting "1" when the actual value is 1. It would be far
less problematic to classify the word as "0" when its true value is "1" than to classify it as
"1" when it actually belongs to another class.
That being mentioned, we shall strive for the best precision value while keeping the
recall value into account. For the time being, we will assume that EEGNET will be the
model chosen for the first binary-classification. Let's now examine the evolution of our
model's performance over 200 epochs.
III.5.3.d: Loss and accuracy curves
For algorithms like deep learning that learn progressively, learning curves are a
frequently used diagnostic tool in machine learning. We evaluate model performance
throughout training time on the hold-out validation dataset as well as the training dataset,
and we plot this performance for each epoch.
By examining the learning curves of models during training, it is possible to identify

learning issues, such as an underfit or overfit model, and determine if the training and
validation datasets are sufficiently representative.
63
Figure 51: Loss and accuracy curves of the EEGNET model
As we can see in the right diagram, the accuracy increases rapidly in the first twenty
epochs, indicating that the network is learning fast. Afterwards, the curve flattens, indicating
that not too many epochs are required to train the model further. Furthermore, we can clearly
see that accuracy curves exhibit a small amount of overfitting, as evidenced by the small
difference between training and validation accuracy.
The loss curve shows that we have minimized overfitting, yet some overfitting still
exists. Because, while we can minimize overfitting in most cases, we rarely can completely
eliminate it while still minimizing our loss.
III.5.3.e: Results
In this project, the goal was to classify an EEG signal into four different classes (4
different Spanish words). However, traditional multiclass classification methods were not
providing satisfactory results due to the high dimensionality of the input data. Therefore,
the decision was made to divide the multiclass classification problem into four separate
binary classification problems, each focused on distinguishing between two classes.
After conducting a single binary classification study, it has been concluded that
EEGNet is the best classifier for performing the binary classification problem at hand. The
study involved comparing the performance of several popular classification algorithms,
including SVM,KNN,XGBoost,1D CNN and ConvNet, against EEGNet .
The results showed that EEGNet consistently outperformed the other classifiers in terms of
accuracy, precision, and recall, making it the most reliable and efficient option for binary
classification tasks. The success of EEGNet can be attributed to its ability to effectively
capture the spatial and temporal features of EEG signals, making it a powerful tool for
analyzing brain activity and detecting abnormalities or patterns in the data.
64
Finally, the input signal was passed through each of the four binary classifiers, and the
final classification decision was made based on the class with the highest probability.
Each binary classifier was properly trained and assessed individually to ensure accurate
results, and the overall performance of the classification model was monitored and
evaluated for fine-tuning. By employing this approach, we aimed to improve the accuracy
and efficiency of our classification model for EEG signal classification.
Conclusion
We addressed the final key goal of our internship—presenting the inner voice
recognition use case. The problem was stated and further explained in the beginning of this
chapter. Following that, we have explained our dataset and its acquisition process. In the last
section, we demonstrate the machine learning pipeline we created to classify the signals,
starting with the feature pre-processing phase and concluding with a discussion of the
modelling task's outcomes.
65
CONCLUSIONS AND PROSPECTS
Conclusions and Prospects

Data science is an interdisciplinary field that extracts knowledge and insights from
large structured and unstructured data using scientific methods. The healthcare industry
generates massive datasets containing useful information in this critical field, making it a
very interesting area of work. For this reason and as part of its research and development
strategy, the Swiss company ML BASEL architect launched the Machine Learning for
Biomedical Time Series Analysis project to research the application of time series in the
biomedical field and to model an interesting use case evolving around helping people with
speech difficulties.
The present work is part of my final year project realized in the company ML Basel
Architectures directed by Mr. Bassem Ben Hamed and Mr. Christian Bock as technical
manager and Mrs. Sinda Belhadj Daoud as academic manager, as indicated in the report, we
started our work with a very thorough bibliographic research that addresses the fundamentals
and advanced topics related to ML for time series and its applications. The usage of machine
learning in the biomedical field is discussed with its advantages along with a presentation of
two use cases related to the topic in question.
Next, we walked through the use case and discussed related work, then dove into our
machine learning pipeline. Throughout the pre-processing, we applied Welch's feature
extraction method to retrieve meaningful features. With the aim of reducing the dimension
of the data, we have applied PCA to identify the optimal number of features to select and
RFE to determine these features. We then presented the 6 models we worked with during
the modelling and their evaluation.
Our data are sufficient to perform binary classification tasks based on short recordings of the
brain signal from imagined speech with an accuracy of up to 87. These results promise good
performance on similar datasets with a larger number of classes.
Time series data is ever-present in healthcare and offers an exciting opportunity for
machine learning methods to extract actionable insights about human health. However, there
is a significant gap between the existing literature on time series and what is ultimately
needed to make machine learning systems practical and deployable for healthcare. Indeed,
learning from time series for healthcare is notoriously difficult: The data can be very high
66
CONCLUSIONS AND PROSPECTS
dimensional, the feature extraction step is highly complex, and this is due to trouble in
choosing which method to apply; moreover, the output of these methods made even higher
dimensional data.
67
WEBOGRAPHY
Webography
[1] Machine Learning for Biomedical Time Series Classification: From Shapelets to Deep
Learning, Christian Bock, Michael Moor, Catherine R. Jutzeler & Karsten Borgwardt.
[03/2022]
[2] Motifs and Manifolds Statistical and Topological Machine Learning for Characterising
and Classifying Biomedical Time Series , Christian Bock [03/2022]
[3] Thuy T. Pham (2019) Applying Machine Learning for Automated Classification of
Biomedical Data in Subject-Independent Settings [03/2022]
[4] Tiago H. Falk, and Ervin Sejdic (Editors) (2018) Signal processing and machine
learning for biomedical big data, CRC Press, Taylor & Francis [04/2022]
[5] Xiang-tian Yu, Lu Wang, and Tao Zeng (2018). Revisit of Machine Learning
Supported
Biological and Biomedical Studies. In Tao Huang (Editor), Computational Systems
Biology,
Methods in Molecular Biology 1754, Humana Press, pp. 183-204 [04/2022]
[6] Blank, S. C., Scott, S. K., Murphy, K., Warburton, E. & Wise, R. J. Speech production:
Wernicke, broca and beyond. Brain 125, 1829–1838 (2002). [05/2022]
[7] Timmers, I., Jansma, B. M. & Rubio-Gozalbo, M. E. From mind to mouth: event
related potentials of sentence production in classic galactosemia. PLoS One 7, e52826
(2012). [06/2022]
[8] A Survey of Heart Anomaly Detection Using Ambulatory Electrocardiogram
(ECG);Hongzu Li et al. [06/2022]
[9] EEG Sonification for Classifying Unspoken Words, Torres Garcia et al. (invasive EEG)
[07/2022]
[10] Decoding imagined, heard, and spoken speech: classification and regression of EEG
using a 14-channel dry-contact mobile headset Jonathan Clayton, Scott Wellington, Cassia
Valentini-Botinhao, Oliver Watts The University of Edinburgh SpeakUnique Limited
[07/2022]
[11] EEG-Based Brain-Computer Interfaces (BCIs): A Survey of Recent Studies on Signal

Sensing Technologies and Computational Intelligence Approaches and Their Applications
68
WEBOGRAPHY
Xiaotong Gu, Zehong Cao, Alireza Jolfaei, Peng Xu, Dongrui Wu, Tzyy-Ping Jung, and
Chin-Teng Lin [07/2022]
[12] PSD-Based Features Extraction for EEG Signal During Typing Task Wei Bin Ng, A
Saidatul, Chong Y.F and Z Ibrahim [07/2022]
[13] Open Access database of EEG signals recorded during imagined speech German A.
Pressel Corettoa, Iván E. Gareisa, b, and H. Leonardo Rufiner [07/2022]
[14] Toward asynchronous EEG-based BCI: Detecting imagined word segments in

continuous EEG signals Tonatiuh Hernández-Del-Toro, Carlos A. Reyes-Garcıa, Luis
Villasenor-Pineda [07/2022]
69
Résumé
Le projet actuel, mené à Digital Innovation Partner, s'inscrit dans le cadre d'un projet
de fin d'études pour l’obtention du diplôme national d'ingénieur.
Il s'agit d’une étude complète de lànalyse des séries temporelles, l'état de l'art des approches
d'apprentissage automatique pour les données temporelles, la présentation de certains des
cas d'utilisation de séries temporelles biomédicales les plus courants, et enfin l'application
de la modélisation pour le cas d'utilisation de la reconnaissance du parole intérieure.
Mots-clés : série temporelle, EEG, interfaces cerveau-ordinateur, parole imaginée, décodage

neuronal, apprentissage automatique, apprentissage profond, Welch, CNN, CONVNET,
EEGNET.
Abstract
The present project, completed at Digital Innovation Partner, is a final-year project

for a national engineering diploma. It entails performing a comprehensive study of time
series analysis, giving state-of-the-art machine learning approaches for temporal data,
introducing some of the most prevalent biomedical time series use cases, and finally applying
modeling for the inner speech recognition use case.
Keywords: time series, EEG, brain-computer interfaces, imagined speech, neural decoding,
machine learning, deep learning, Welch, CNN, CONVNET, and EEGNET.
70

Inner Speech Recognition

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inner Speech Recognition

Uploaded by

Copyright:

Available Formats

Signatures

Professional supervisor, Mr. BEN HAMED Bassem

Signature and Stamp

Academic supervisor, Ms. BEL HAJ DAOUD Sinda

To my academic supervisor, Ms. BEL HAJ DAOUD Sinda

To my technical supervisor Mr. BEN HAMED Bassem

To all the members of the jury,

To all the members of ESPRIT

Chapter I: GENERAL PRESENTATION............................................................................. 2

ARIMA Autoregressive integrated moving average

CNN Convolutional Neural Network

ConvNet Convolutional Neural Network

DNN Deep Neural Network

DTW Dynamic Time Wrapping

ELU Exponential Linear Unit

F1 Score Harmonic Precision-Recall Mean

FFT Fast Fourier transform

kNN k-Nearest Neighbours

PCA Principal Component Analysis

RNA seq RNA Sequencing

SVM Support Vector Machine

SARIMA seasonal ARIMA

XGBoost eXtreme Gradient Boosting

I.1: Presentation of the host organization

I.1.1: Swiss Digital Network

Figure 1: Logo of SWISS DIGITAL NETWORK

I.1.2: Machine Learning Architects Basel

Figure 2: Logo of ML ARCHITECTS BASEL

I.2: Context of the project

I.2.1: Presentation of the project

We evaluate the performance of six classification models (SVM, KNN, XGBOOST,

The following are the key goals of this paper:

1. To give a thorough examination of time series analysis.

I.2.2: Work methodology

In the real-world work environment, it's common to see a combination of

The SCRUM method consists in defining a framework allowing the realization of

Figure 3: Scrum life-cycle

The stages of the SCRUM Lifecycle

To implement the SCRUM method, we start by

● Defining the features of our project in order to build the backlog.

Three main roles are required in the Scrum method:

 Product owner: a business expert able to define the functional specifications. He

SCRUM Modeling Artifacts

I.2.2.b: CRISP-DM (Cross-Industry Standard Process for Data

CRISP-DM (Cross-Industry Standard Process for Data Mining) is a widely used

1. Business Understanding: In this phase, the project objectives and

5. Evaluation: The models developed in the previous phase are evaluated in

The CRISP-DM methodology provides a structured approach to data mining and

The Hybrid Approach: Combining Scum and CRISP-DM Methodologies for

The combination of CRISP-DM and Scrum methodologies is a powerful approach

II.1: Definition of time series

II.2: Characteristic of time series

Time-series data may contain a deterministic component that is proportional to the

A time series plot that appears to be centered on an increasing or decreasing line,

Figure 4: Time trend in a time series graph

Figure 5: Seasonal time series

II.2.3: Cyclic variations

Figure 6: Female hormone levels during the menstrual cycle

II.2.4: Irregular movements

For example : y(t) = signal(t) + noise(t)

Figure 7: Decomposition of a noisy signal

Table 1: Differences between stationary and non-stationary data

Stationary data example Non-stationary data example

Figure 10: Autocorrelation