Professional Documents
Culture Documents
I hereby authorize the student Ms. INES BOUSSELMI to submit her internship report
I hereby authorize the student Ms. INES BOUSSELMI to submit her internship report
Signature
Dedications
First and foremost, I thank God (Allah), the Almighty, for endowing his immeasurable
blessings on me at every step of my journey toward the successful completion of my
studies.
To My beloved mother Sihem
To the one person who has supported and pushed me since day one and has instilled in me
a passion for learning, words fail to describe my gratitude and love.
I always knew deep down that you wanted to see me achieve what life has denied you.
Here I am today making our dream come true.
To my beloved Father Fathi
To my hero and role model who made everything possible, I will always be most grateful
to you since everything you have done in life is for us.
To my siblings Nacer and Ameni
You are the essence of my life; I would not be me without you I wish you all the success in
the world.
To my uncle Yassine
Your support and help have been critical to my success.
To Ms. Lilia Atig
Your considerable efforts to help me have shed light on a very dark time in my life and for
that, I owe you my deepest gratitude.
To my friends Yosr and Bchira
I would like to extend my thanks to you for all that you have done to guide and encourage
me during my internship.
I would like to express my gratitude to my teachers and mentors who have guided me
through my studies especially Mr. Zakaria Jaraya, Mrs. Ikram Laataoui, Mrs. Jalila
Mouelhi, and Mr. Raouf Ben Kileni.
Last, I wish to express my sincere thanks to all my family members and friends for their
timely help, moral support and encouragement.
i
Thanks
I would like to express my gratitude to all those who contributed to the success of this project.
I would like to thank you warmly for the guidance, unwavering support, and encouragement
that you have given me throughout my project. Your constant availability, your judicious
advice, your comments and corrections were invaluable for the success of this project. I hope
that you found me to be up to your expectations.
I would like to thank you very sincerely for your pertinent advice and suggestions that guided
me during the phases of my internship. Your scientific and professional qualities helped me
to succeed in this project.
To all the managers and colleagues of Digital Solutions Ms. Chantal Ebel Mr. Christian
Bock, Mr. Moez Hanin and Ms. Yosr Halleb
I would like to express my sincere gratitude for the welcome and the pleasant integration.
Your availability, your remarks, and your commitment allowed me to follow the right path
for the realization of this project.
I would also like to thank and express my deep respect to the members of the jury for having
accepted to judge this work.
I would also like to thank the pedagogical team of ESPRIT as well as the professional
speakers in charge of the training in the IT department who gave me the theoretical notions
necessary for the elaboration of this project.
ii
Table of contents
iii
II.6.2: ARIMA models ..........................................................................................25
II.6.3: SARIMA models ........................................................................................26
II.6.4: Topical methods of time series forecasting ..................................................27
II.7: TIME SERIES CLUSTERING .........................................................................28
II.7.1: HIERARCHICAL METHODS ...................................................................29
II.7.2: Partitioning Methods...................................................................................30
II.7.3: Model Based Methods ................................................................................31
II.7.4: Density-Based Methods ..............................................................................32
II.7.5: Deep Clustering Methods for Time-Series Data ..........................................33
II.8: Machine learning for biomedical science ..........................................................34
II.8.1: Machine learning tasks in healthcare ...........................................................35
II.8.2: Benefits of Machine Learning in Healthcare ...............................................35
II.8.3: Time series for biomedicine ........................................................................36
Chapter III: INNER SPEECH RECOGNITION USE CASE ........................................................40
III.1: Contextualization .............................................................................................40
III.2: Related work ....................................................................................................41
III.3: Data description................................................................................................41
III.3.1: Data acquisition ......................................................................................41
III.3.2: Data acquisition process ..........................................................................42
III.3.3: BCI interactive conditions .......................................................................44
III.4: Data Processing and Analysis ...........................................................................45
III.4.1: Feature engineering .................................................................................46
III.4.2: Feature extraction ....................................................................................47
III.4.3: Feature selection .....................................................................................48
III.5: Modeling ..........................................................................................................50
III.5.1: Machine learning models ........................................................................50
III.5.2: Deep Learning models.............................................................................53
III.5.3: Models Evaluation ..................................................................................58
Conclusions and Prospects ...............................................................................................66
Webography ....................................................................................................................68
Résumé ............................................................................................................................70
Abstract ...........................................................................................................................70
iv
List of figures
Figure 1: Logo of SWISS DIGITAL NETWORK ..............................................................2
Figure 2: Logo of ML ARCHITECTS BASEL ..................................................................2
Figure 3: Scrum life-cycle ..................................................................................................4
Figure 4: Time trend in a time series graph ........................................................................9
Figure 5: Seasonal time series .......................................................................................... 10
Figure 6: Female hormone levels during the menstrual cycle............................................ 10
Figure 7: Decomposition of a noisy signal........................................................................11
Figure 8: Example of stationary time series ...................................................................... 12
Figure 9: Example of non-stationary time series ...............................................................12
Figure 10: Autocorrelation ...............................................................................................12
Figure 11: Time series analysis pipeline ........................................................................... 14
Figure 12: Euclidean mapping..........................................................................................15
Figure 13: DTW mapping ................................................................................................ 15
Figure 14: Methods for univariate vs multivariate time series data ................................... 18
Figure 15: Example of time series classification ...............................................................20
Figure 16: KNN and DTW classifier ................................................................................21
Figure 17: Example of a single shapelet ........................................................................... 22
Figure 18: Example of some python time series libraries ..................................................23
Figure 19: Explanation of the ARIMA abbreviation ......................................................... 26
Figure 20: A graph of a seasonal time series..................................................................... 27
Figure 21: Hierarchical clustering dendrogram .................................................................30
Figure 22: K-Means clustering algorithm graph ............................................................... 31
Figure 23: Model based clustering method plot illustration .............................................. 31
Figure 24: Density based clustering method plot illustration ............................................ 32
Figure 25: from SOM to T-DPSOM ................................................................................. 34
Figure 26: Single cell RNA seq analysis steps .................................................................. 37
Figure 27: Benefits of single-cell RNA sequencing for biological discoveries .................. 37
Figure 28: cell clustering analysis pipeline ....................................................................... 38
Figure 29: organization of the recording day for each participant ..................................... 43
Figure 30: trial workflow ................................................................................................. 44
Figure 31: EEG data processing pipeline .......................................................................... 45
Figure 32: A big picture of the idea of PCA algorithm. "Eigenstuffs" are eigenvalues and
eigenvectors. .................................................................................................................... 49
Figure 33: Hyperplan SVM .............................................................................................. 51
Figure 34: XGBoost illustration .......................................................................................51
Figure 35: KNN illustration .............................................................................................52
Figure 36: CNN architecture ............................................................................................ 54
Figure 37: Deep Convnet architecture ..............................................................................55
Figure 38: Overall visualization of the EEGNet architecture ............................................ 56
Figure 39: Model scores .................................................................................................. 58
Figure 40: Accuracy of binary classification models ........................................................59
Figure 41: Confusion matrix with two class labels ...........................................................60
Figure 42: SVM confusion matrix ....................................................................................61
Figure 43: XGBoost confusion matrix ..............................................................................61
v
Figure 44: KNN confusion matrix ....................................................................................61
Figure 45: CNN confusion matrix ....................................................................................61
Figure 46: EEGNET confusion matrix .............................................................................61
Figure 47: ConvNet confusion matrix .............................................................................. 61
Figure 48: SVM classification report................................................................................62
Figure 49: XGBoost classification report .........................................................................62
Figure 50: KNN classification report................................................................................62
Figure 51: Loss and accuracy curves of the EEGNET model ............................................64
vi
List of tables
Table 1: Differences between stationary and non-stationary data ...................................... 11
Table 2: time series univariate and multivariate model examples...................................... 19
Table 3: methods for time domain vs frequency domain analysis ..................................... 19
Table 4: EEGNet architecture .......................................................................................... 57
Table 5: Train and validation accuracies of our models .................................................... 60
Table 6: Evaluation metrics for the DL models ................................................................ 63
vii
Abbreviations and Acronyms
Artificial Intelligence
AI
DL Deep Learning
EEG Electroencephalography
viii
GENERAL INTRODUCTION
General introduction
Machine learning research has progressed to the point where specially designed
computers can outperform humans on challenging cognitive tasks. This has lately been
proven in a number of difficult fields, including self-driving vehicles, automatic language
translation, and strategic games.
One of the most difficult concerns is health care, where existing systems, equipment,
and techniques are straining to keep up with rising demand. Machine learning has significant
promise. Several converging factors are driving the creation of large-scale, complex
electronic data repositories in healthcare today.
Many biomedical data sets are available as time series, especially in the field of
public health and epidemiology, where indicators are usually collected over time. Clinical
studies with long follow-ups are also sometimes best analysed with time series methods. The
analysis of administrative health care data often gives rise to time series problems too, as
events are frequently converted to counts over a given interval. Finally, some biomedical
measurements may also be viewed as time series, such as EEG recordings.
In this context, our end-of-study project falls within the scope of acquiring the
national engineering diploma. The project's goal is to conduct a study on machine learning
techniques for time series data in the biomedical field.
This report wraps up the project's various tasks and stages. It is divided into three
chapters:
The first chapter will introduce the hosting organization and the project's setting.
We will also present the appropriate methodology for developing the project.
The second chapter will go through a comprehensive study of time series analysis
along with the state of the art of machine learning for biomedical time series.
The final chapter will describe the project's development environment and
execution steps. In this stage, we will present the EEG data processing pipeline.
Finally, we'll wrap up our report with a general conclusion that summarizes the work
while also highlighting the project's contribution and potential extensions.
1
Chapter I: GENERAL PRESENTATION
GENERAL PRESENTATION
Introduction
In this first chapter, we shall introduce the general context of the project in addition
to its main objectives. Furthermore, we will present the host company and its main areas of
activity as well as the approaches adopted in this work.
Swiss Digital Network (SDN) is the first independent and open advisory network that
cooperates with innovative IT providers to offer clients an efficient digital cloud
transformation journey. The network was created by senior consulting architects who
combine IT and emerging technologies to assist clients and customers in capitalizing on
innovative projects. The Swiss Digital Network is composed of four cells, each specialized
in key areas related to digital transformation. This project was carried out in the Machine
Learning Basel Architectures unit.
2
GENERAL PRESENTATION
For speech recognition, we use an open-access EEG dataset from a study conducted by
Nicolás Nieto et al. at Torcuato Di Tella University's Neuroscience Laboratory in
Argentina.
2. To deliver state of the art machine learning for biomedical time series data.
3. To test several machine learning models for inner speech classification using EEG.
3
GENERAL PRESENTATION
methodology for the project at hand, we can increase our chances of success and deliver
valuable results.
In our situation, we opted to combine two project management methods: Scrum and CRISP-
DM
I.2.2.a: AGILE/SCRUM
This method is becoming more and more necessary because of the permanent
evaluations it allows, which are considered very useful and effective. Indeed, the SCRUM
method has several advantages: it improves productivity and communication within the
4
GENERAL PRESENTATION
project; it is based on a fixed set of roles, responsibilities, and meetings that never change;
while ensuring flexible and adaptive project management.
SCRUM roles
The Agile philosophy is supported by a set of values, principles, and practices that are
the foundation of SCRUM artifacts.
● The Sprint Backlog is a real-time, highly visible view of the work that the Team plans
to accomplish during the Sprint.
5
GENERAL PRESENTATION
● The Product Backlog is a kind of warehouse that contains all the features of the
product. The tasks must be ordered with discretion according to the priority in which
they must be carried out.
● The product increment is one of the most important SCRUM artifacts of the Agile
culture. During each Sprint, the development team makes a product increment.
During my internship, I was given biweekly goals to achieve. To ensure that I was on track, I
was asked to break down those objectives into daily objectives and send a calendar of my
daily objectives to my supervisor. Additionally, I had daily meetings with my supervisor to
review my progress towards these objectives. At the end of each sprint, there was a review
meeting with the entire team to discuss progress and identify areas for improvement. Finally,
there was a monthly meeting to review overall progress and discuss any challenges or
opportunities for growth. These processes helped to ensure that I was accountable and working
towards achieving my goals throughout my internship.
6
GENERAL PRESENTATION
The combination of these two techniques allowed us to ensure that our project is successful,
with a focus on accomplishing the business objectives.
Conclusion
In this chapter, we presented the context of the project, the host organization, and the
objectives of the project.
We finished by deciding on the methodology that would be used throughout this
project. The following chapter is devoted to the completion of a comprehensive study of
time series analysis.
7
Chapter II:Review of Literature
REVIEW OF LITERATURE
Introduction
In this chapter, we will tackle the different concepts and definitions in order to gain
an understanding of time series, its characteristics, and its main tasks.
Furthermore, we will get an overview of the current state of the art in machine learning for
time series analysis, particularly in the biomedical industry.
Time series data is a set of values that are gathered and sorted chronologically across
even periods of time. The time interval at which data is gathered is known as the time series
frequency.
What distinguishes time series data from other types of data is that the analysis may
illustrate how variables change over time. In other words, time is a critical variable since it
indicates how the data adapts over time as well as the final results. It provides a
supplementary source of information as well as a predetermined sequence of data
dependencies.
In order to ensure consistency and reliability, time series analyses typically require a
large number of data points. A large data set guarantees that your sample size is
representative, and your analysis will be able to sift through any ambiguous data. It also
ensures that any trends or patterns discovered are not outliers and can account for seasonal
variance.
Definition of a time series Data Set: A data set D of n-1 univariate variable-length
time series is defined as follows:
[ ] [ ] [ ] [ ]
𝐷= (𝑡 , 𝑣 ), … , 𝑡 ,𝑣 , … , (𝑡 , 𝑣 ), … , 𝑡 ,𝑣 = 𝑇 ,…,𝑇 (1)
In this section, we'll go over time series characteristics and illustrate them with plots.
Observed values are plotted on the y-axis of a time series graph against a time increment on
8
REVIEW OF LITERATURE
the x-axis. These graphs can serve as the basis for creating a solid model by visually
highlighting the behavior and patterns of the data.
II.2.1: Trends
Time trends in time series data also have testing and modeling implications. A time
series model's reliability is dependent on properly identifying and accounting for time trends.
II.2.2: Seasonality
Seasonality is another time-series data feature that can be seen visually in time-series
plots. When time series data exhibits regular and predictable patterns at time intervals less
than a year, this is referred to as "seasonality."
Retail sales are an example of a time series with seasonality because they typically
increase from September to December and decrease from January to February.
9
REVIEW OF LITERATURE
10
REVIEW OF LITERATURE
II.2.5: Stationarity
When all statistical characteristics of a time series remain unchanged by time shifts,
the series is said to be stationary. In technical terms, strict stationarity implies that the joint
distribution of (𝑦 , . . . . . . . . . , 𝑦 ) depends only on the lag, h, and not on the time period, t.
Strict stationarity is not widely necessary in time series analysis. This is not to imply
that stationarity does not play a role in time series analysis. Many time series models are
valid only under the assumption of weak stationarity (also known as covariance stationarity).
Weak stationarity, henceforth stationarity, requires only that:
● A series has the same finite unconditional mean and finite unconditional variance
over all time periods.
● The series autocovariance is time-independent.
Nonstationary time series are any data series that do not satisfy the weakly stationary
time series conditions.
11
REVIEW OF LITERATURE
Figure 8: Example of stationary time series Figure 9: Example of non-stationary time series
II.2.6: Autocorrelation
The degree of similarity between a given time series and a lagged version of itself
over successive time intervals is referred to as "autocorrelation." In other words,
autocorrelation is used to measure the relationship between a variable's current value and
any previous values to which you have access.
For the sake of comparison, autocorrelation is essentially the same process that you would
go through when calculating the correlation between two different sets of time series values
on your own. The main distinction here is that autocorrelation employs the same time series
twice: once in its original values and again after a few different time periods have passed.
Serial correlation, time series correlation, and lagged correlation are all terms for
autocorrelation. Autocorrelation, in whatever form it is used, is an excellent method for
discovering trends and patterns in time series data that would otherwise go unnoticed.
12
REVIEW OF LITERATURE
There are three major difficulties with time-series analysis. First, many techniques can
only accept input data in the form of a vector of features. Sequence data, regrettably, lack
explicit features. Second, selecting features can be challenging due to the high
13
REVIEW OF LITERATURE
dimensionality and expensive computation of the feature space. Third, creating a partitioning
task can be challenging in some applications because the raw data lacks explicit features. In
order to reduce dimensionality and provide representative features of such data, feature
extraction and similarity measures must be used in order to efficiently handle the raw data
in time series.
These difficulties prompted the development of the traditional time-series analysis
pipeline, which consists of three different viewpoints: time-series data, similarity metrics,
and feature extraction.
ED is a commonly used metric for time series. It is defined between two-time series
X and Y having length L; therefore, the Euclidean distance, between each pair of
corresponding points X and Y, is the square root of the sum of the squared differences. As a
result, the length of the two-time series under comparison must be equal, and the
computational cost scales linearly with the length of the temporal sequence. The distance
14
REVIEW OF LITERATURE
between the two-time series is determined along the horizontal axis by matching the
corresponding points. The Euclidean distance metric is extremely susceptible to noise and
distortion and is unable to cope with one of the elements being compressed or stretched. This
method is therefore unreliable, particularly when comparing time series with different time
durations.
15
REVIEW OF LITERATURE
For instance, we have two distinct, varying-length curves: red and blue.
The two curves follow the same pattern; however, the blue curve is longer than the red. If
we apply the one-to-one Euclidean match, the mapping is not perfectly synced up, and the
tail of the blue curve is being left out.
II.4.1.c: Correlation
II.4.1.d: Cross-correlation
Cross-correlation is the correlation between two signals that shape a new signal, and
its peaks can indicate the similarity between the original signals; it is used as a distance
metric. However, cross-correlation can be carried out more efficiently in the frequency
domain. Autocorrelation occurs when the signal is correlated with itself, which is useful for
finding repeating patterns. Cross-correlation might be a slow operation in time-series space,
but it corresponds to point-wise multiplication in frequency space. It is also considered the
best distance measure to detect a known waveform in random noise. When processing the
signal, the correlation has a linear complexity in frequency space implementation, which
cannot be achieved by DTW.
16
REVIEW OF LITERATURE
II.4.2.c: K-grams
Transforming time-series data into a set of features does not fully reflect the series'
sequential nature. K-gram is an example of a feature-based approach that uses short sequence
segments of k-consecutive symbols to retain the order of components in a series. In time-
series data, K-grams presents a feature vector of symbolic sequences of K-grams. This
feature vector can express the frequency of K-grams given a set of K-grams.
DFT is one of the most common transformation methods. It has been used to convert
original time series data into low-dimensionality time-frequency characteristics and index
them in order to perform an effective similarity search.
DFT is used to reduce dimensionality and extract features into an index that can be used for
similarity searching. This technique is constantly being improved, and some of its
shortcomings have been overcome.
17
REVIEW OF LITERATURE
This technique has also been used to transform the original time series and obtain
low-dimensional features that efficiently represent the original time series data.
Analysis tasks with a large set of time-series data face certain challenges in defining
matching features; thus, using wavelet decomposition to reduce data dimensionality is
beneficial. The discrete wavelet transform technique can be used to accurately perform the
analysis task.
II.4.2.f: Shapelets
There are two types of time series models: univariate time series models and
multivariate time series models. Univariate time series models are used when the dependent
variable is a single time series. A univariate time series model is one that attempts to model
an individual's heart rate per minute using only past observations of heart rate and exogenous
variables.
Multivariate time series models are used when there are multiple dependent
variables. Each series may rely on the past and present values of the other series in addition
to their own past and present values.
18
REVIEW OF LITERATURE
For modeling time series data, two broad approaches have emerged: the time-domain
approach and the frequency-domain approach.
The time-domain approach predicts future values based on past and present values.
The time series regression of a time series' present values on its past values and the past
values of other variables forms the basis of this approach. These regression estimates are
frequently used for forecasting, and this method is popular in time series econometrics.
The idea behind frequency domain models is that time series can be represented as a
function of time using sines and cosines. These are referred to as Fourier representations.
To model the behavior of the data, frequency domain models use regressions on sines and
cosines rather than past and present values.
19
REVIEW OF LITERATURE
There are numerous algorithms dedicated to time series classification; in this section, we
will introduce four types of time series classification algorithms.
20
REVIEW OF LITERATURE
1. Divide the series into random intervals of varying lengths and start positions.
2. Summary features (mean, standard deviation, and slope) are extracted from each
interval and combined into a single feature vector.
3. Train a decision tree on the extracted features.
4. Steps 1-3 must be repeated until the required number of trees are built or time runs
out.
A majority vote of all the trees in the forest is used to classify new series. (In a majority
vote, the class predicted by the most trees is the forest's prediction.)
Experiments have shown that time series forests outperform baseline competitors such as
nearest neighbours with dynamic time warping.
In addition, time series forest is computationally efficient.
Frequency-based classifiers use frequency data extracted from series to train their
models. Random Interval Spectral Ensemble (RISE) is a popular variant of time series forest.
In two ways, RISE differs from time series forest. First, it employs a single time series
interval per tree. Second, rather than summary statistics, it is trained using spectral features
extracted from the series.
21
REVIEW OF LITERATURE
5. Ensemble 1–4
Class probabilities are calculated as a proportion of base classifier votes. RISE manages
the run time by constructing an adaptive model of the time required to build a single tree.
This is critical for long series (such as audio), where very large intervals can result in a small
number of trees.
22
REVIEW OF LITERATURE
A single "shapelet" is an interval in a time series. The intervals in any series can be
enumerated. For example, [1,2,3,4] has 5 intervals: [1,2], [2,3], [3,4], [1,2,3], and [2,3,4].
Shapelet-based classifiers search for shapelets with discriminatory power. These shapelet
features can then be used to interpret a shapelet-based classifier. The presence of certain
shapelets increases the likelihood of one class over another. The Shapelet Transform
Classifier begins by identifying the top k shapelets in the dataset. The new dataset's k features
are then computed. Each feature is calculated as the series' distance from each of the k
shapelets, with one column for each shapelet. Finally, the shapelet-transformed dataset can
be subjected to any vector-based classification algorithm.
There are more great Python time series libraries to discover such as : Sktime,
Tslearn, Tsfresh, Prophet, and Pyts. Each of the above libraries has a unique approach to
dealing with time series learning problems such as regression, classification, and forecasting.
Their differences lie in the methodologies they employ, which range from standard statistical
approaches (e.g., ARIMA) to dynamic time series warping, symbolic time series
approximations, and others.
In this section, I will discuss two popular machine learning approaches for time series
classification. The first is called ROCKET, and the second is called deep learning.
Deep learning approaches, on the other hand, tend to borrow or modify structures
often related to computer vision and natural language processing (NLP). People are
experimenting with ResNets, Transformers, LSTMs, CNNs, Temporal Convolutional
Networks, Wavelet-based approaches, and other combinations of these methods. These can
take a lot longer to train, not to mention the time spent on hyperparameter tuning. Still, it
may be worthwhile to investigate how deep learning models perform on your data.
Tsai is an open-source deep learning package built on PyTorch and Fastai that
focuses on state-of-the-art approaches for time series problems such as classification,
regression, forecasting, and imputation. So, what does Tsai have to offer? It includes a
variety of deep learning architectures built with the PyTorch and Fastai libraries, as well as
ROCKET and MiniROCKET classification and regression models.
While time series analysis focuses on understanding the dataset, forecasting focuses
on predicting it. Time series analysis comprises methods for analyzing time series data in
order to extract meaningful statistics and other characteristics of the data. Time series
forecasting is the use of a model to predict future values based on previously observed
values.
In other words, time series forecasting is a technique for predicting events over a
period of time. It forecasts future events by analyzing historical trends, with the assumption
that future trends will be similar to historical trends.
24
REVIEW OF LITERATURE
Autoregression is a time series model that predicts the value at the next time step by
using observations from previous time steps as input to a regression equation. In
autoregressive models, we assume a linear relationship between the value of a variable at
time t and the value of the same variable in the past, that is time 𝑡 − 1, 𝑡 − 2, . . . , 𝑡 −
𝑝, . . . ,2,1,0
𝑦 =𝑐+𝛽 𝑦 +𝛽 𝑦 + ⋯+ 𝛽 𝑦 +𝜖 (3)
Here, 𝑝 denotes the autoregressive model's lag order (the number of times that the
raw observations are different; also known as the degree of differencing).
For an AR model,
● The integrated component refers to the use of data transformation to make the data
stationary by subtracting past values of a variable from current values of a variable.
● The moving average component denotes the relationship between the dependent
variable and previous values of a stochastic term.
25
REVIEW OF LITERATURE
The order of these components is used to describe the ARIMA data, which is denoted by
the notation ARIMA (p, d, q), where:
The Box-Jenkins method for estimating ARIMA models is made up of several steps:
26
REVIEW OF LITERATURE
How do we know we should use the seasonal ARIMA (SARIMA) model? The above
is drawn to show seasonality. We see a very clear W-type pattern repeating, so we clearly
have seasonality.
In SARIMA (P, Q, D) m: m is the seasonal factor. It denotes the number of time steps
in a single seasonal period. Consider that each year is divided into four quarters in the graph
above. Now we'll have an m value of 4. Except for the seasonal components, the (P, D, Q)
are analogs of (p, q, d) in ARIMA model.
Although the analysis of image datasets is considered their main field of application,
convolutional neural networks can show even better results than RNNs in time series
prediction cases involving other types of spatial data. For one thing, they learn faster,
boosting the overall data processing performance. However, CNNs can also be joined with
27
REVIEW OF LITERATURE
RNNs to get the best of both worlds, i.e. CNN easily recognizes spatial data and passes it to
an RNN for temporal data storage.
II.6.4.c: LightGBM
This is a widely used ML algorithm that is mostly focused on capturing complex patterns
within tabular datasets. In some cases, LightGBM outperforms the traditional ARIMA
approach when it comes to making tabular-based predictions.
ML-based decision trees are used to classify items in the database. Generated classes get
dedicated multivariate time series models that help predict the future price of a certain item.
II.6.4.e: XGBoost
This is the machine learning algorithm that is being used, and it works with tabular and
structured data. In its core, lie gradient-boosted decision trees.
II.6.4.f: AdaBoost
28
REVIEW OF LITERATURE
several algorithms have been improved to deal with time-series data. Most works involving
time series clustering fall into one of three categories.
The first is whole time-series clustering, in which a set of individual time series is
given and the goal is to group similar time series into clusters based on their similarity.
The third category is a grouping of time points based on their temporal proximity and
similarity of corresponding values. Some points may not be assigned to any clusters and are
thus classified as noise.
The choice of distance measure is critical for most time-series analysis techniques,
including clustering. The choice of distance measure is widely regarded as more important
than the clustering algorithm itself. The choice of feature extraction technique also has a
significant impact on the quality of clustering methods.
As a result, time-series clustering relies primarily on traditional clustering methods,
either by replacing the default distance measure with one more appropriate for time series or
by transforming time series into "flat" data so that existing clustering algorithms can be used
directly.
Next, we will be presenting various types of methods and clustering algorithms used
for time-series data.
Hierarchical clustering defines a tree structure for unlabeled data by aggregating data
samples into a tree of clusters. Unlike k-means, this method does not assume a value for K.
Hierarchical clustering methods are classified into two types: agglomerative (bottom-up) and
divisive (top-down).
Hierarchical clustering is typically accomplished by sequentially merging similar
clusters, as illustrated in the figure below. This is referred to as agglomerative hierarchical
clustering. In theory, it is also possible to accomplish this by first grouping all of the
29
REVIEW OF LITERATURE
observations into a single cluster and then successively splitting these clusters. This is
referred to as divisive hierarchical clusters. In practice, dividing clustering is rarely used.
30
REVIEW OF LITERATURE
31
REVIEW OF LITERATURE
Density-based clustering connects areas of high density into clusters. This permits
distribution of any shape as long as dense regions can be connected. High-dimensional data
and data with varying densities present challenges for these algorithms. Furthermore, these
algorithms are not intended to assign outliers to clusters. Thus, objects in sparse areas are
usually considered to be noise and border points.
Density-based clustering for time-series data has some advantages; it is a fast
algorithm that does not require pre-setting the number of clusters, is able to detect arbitrary
shaped clusters as well as outliers, and uses easily comprehensible parameters such as spatial
closeness.
Last but not least, there is the DENCLUE algorithm, which uses the kernel density
estimation technique to calculate the undefined probability density function of a random
variable that creates the data sample.
32
REVIEW OF LITERATURE
The Self-Organizing Map (SOM), on the other hand, is a clustering method that
provides such an interpretable representation. By inducing a flexible neighborhood structure
over the clusters, it generates a low-dimensional (typically 2-dimensional) discretized
representation of the input space. Unfortunately, its performance is strongly dependent on
the complexity of the data sets used, and, like other classical clustering methods, it typically
performs poorly on complex, high-dimensional data. While the SOM is extremely useful for
data visualization, only a few methods have attempted to combine it with DNNs. To address
the aforementioned issues, we propose Probabilistic SOM (PSOM), a novel method of fitting
SOMs with probabilistic cluster assignments. We also extend this PSOM to a deep
architecture, the Deep Probabilistic SOM (DPSOM), which trains a VAE and a PSOM
simultaneously to achieve an interpretable discrete representation while exhibiting cutting-
edge clustering performance.
Instead of assigning data points to specific clusters, our model employs centroid-
based probability distributions. It reduces their Kullback-Leibler divergence with respect to
auxiliary target distributions while enforcing a SOM-friendly space. To emphasize the
33
REVIEW OF LITERATURE
Deep clustering:
Recent clustering analysis work has shown that using deep neural networks (DNNs)
in conjunction with clustering algorithms significantly improves clustering performance. In
that case, DNNs are used to embed the data set into a space that is better suited for clustering.
Similarly, DCN combines a k-means clustering loss with the reconstruction loss of
SAE to produce an end-to-end architecture that trains representations and clusters
concurrently. These models achieve state-of-the-art clustering performance, but they do not
investigate the relationship among clusters.
Biomedical science is also an industry that evolves with the times. With the amount
of data generated for each patient, machine learning algorithms in biomedicine have great
potential. It is no surprise, then, that there are numerous successful machine learning
applications in healthcare right now.
From the large-scale analysis of genomic data advancing personalized medicine to
the solving of a 50-year-old challenge in biology by predicting protein folding from amino
acid sequences, there’s no doubt machine learning is enabling breakthroughs that are shaping
the future of biomedical research.
34
REVIEW OF LITERATURE
Machine learning techniques can be applied to solve a wide variety of tasks. When it
comes to applications of machine learning in biomedicine, these tasks include:
● Classification: can help to determine and label the kind of disease or medical case
you’re dealing with;
● Recommendations: can offer necessary medical information without the need to
actively search for it;
● Prediction: using current data and common trends, machine learning can make a
prognosis on how the future events will unfold;
● Clustering: can help to group together similar medical cases to analyse the patterns
and conduct research in the future;
● Anomaly detection: Using machine learning in healthcare, you can identify things
that deviate from common patterns and determine whether any actions are required.
● Automation: machine learning can handle standard repetitive tasks that take too
much time and effort from doctors and patients, like data entry, appointment
scheduling, inventory management, etc.;
● Ranking: machine learning can put the relevant information first, making the search
for it easier.
The increase in diagnostic accuracy is the second important role of machine learning
in healthcare. Machine learning, for example, has been shown to be 92% accurate in
predicting the mortality of COVID-19 patients.
Third, applying machine learning to medicine can aid in the development of a more
precise treatment plan. A lot of medical cases are unique and require a special approach for
effective care and side-effect reduction. Machine learning algorithms can simplify the search
for such solutions.
35
REVIEW OF LITERATURE
Machine learning was designed to deal with large data sets, and patient files are
exactly that: a large number of data points that require careful analysis and organization.
Another reason for using machine learning techniques in healthcare is that they
eliminate human involvement to some degree, which reduces the possibility of human error.
This especially concerns process automation tasks, as tedious routine work is where humans
err the most.
We will now concentrate on time series and what time is, because medicine is
essentially a time series problem, and clinical professionals frequently refer to it.
In this section, I will discuss a time series clustering use case, cell clustering analysis for
single cell sequencing, and Zhuo Wang et al.'s study.
Single-cell RNA sequencing is a technique that extracts RNA from all cells and
quantifies the sequence RNA as well as the expression for each cell, providing us with
granular resolution expression profiles at the cellular level and allowing us to compare
expression between cells. Single-cell RNA sequencing is a technique that extracts RNA from
all cells and quantifies the sequence RNA as well as the expression for each cell, resulting
in granular resolution expression profiles at the cellular level and the ability to compare
expression between cells.
The development of single-cell RNA sequencing has allowed for profound biological
discoveries ranging from the dissection of complex tissue composition to the identification
of novel cell types and dynamics in some specialized cellular environments.
36
REVIEW OF LITERATURE
The study presents an algorithm based on the Dynamic Time Warping score
(DTWscore) combined with time-series data that enables the detection of gene expression
changes across scRNA-seq samples and the recovery of potential cell types from complex
mixtures of multiple cell types.
The method pipeline is described and illustrated in the figure below. To begin,
perform a traditional filter step to remove low-quality cells. Second, calculate the mean
37
REVIEW OF LITERATURE
DTW distance between all pairs of cells as an index for detecting a specific set of genes for
heterogeneity analysis. To reduce the bias toward extreme values, we must normalize the
DTW distance index values. Following normalization, the gene with the highest DTWscores
is selected for further study and is referred to as the most significantly highly variable gene.
The output could be used to categorize various types of cells.
In this section, I will present a time series anomaly detection use case, ECG anomaly
detection use case, and a study that was conducted by Hongzu Li et al. using heartbeats,
which is one of the important vital signs that is typically collected in such a medical sensor
monitoring system.
Automatic collection of vital sign data enables remote medical monitoring and
diagnosis for improved energy efficiency. These real-time streams can be processed by
intermediate storage nodes to detect any anomalies. Once identified, only the abnormal data
must be sent to the physician for further diagnosis, while the rest of the normal data can be
archived at the local storage nodes. An anomaly detection scheme is required to determine
whether a real-time sensor data stream contains abnormal data.
38
REVIEW OF LITERATURE
Conclusion
This chapter provided an overview of the characteristics of time series and their
various tasks. We focused on three fundamental tasks, namely time series classification,
forecasting and clustering. We conclude by discussing two biomedical time series
applications.
39
Chapter III: INNER SPEECH RECOGNITION
USE CASE
INNER SPEECH RECOGNITION USE CASE
Introduction
In the previous part, we have provided an overview of time series analysis and ML
with biomedical time series. In this part, we address the last key goal of our internship,
which is to present the inner speech recognition use case. We begin in this chapter by stating
and further clarifying the problem we are trying to solve. We then examine the relevant
literature and the available remedies, focusing on their flaws and limitations. Next, in this
chapter, we begin to describe our dataset and its acquisition process. Finally, we will present
the ML pipeline we followed in order to create a model that is able to classify EEG signals
III.1: Contextualization
Neural engineering research has made tremendous strides in decoding motor or visual
neural signals in order to assist and restore lost function in patients with disabling
neurological diseases. The development of assistive devices that restore natural
communication in patients with intact language systems but limited verbal communication
due to a neurological disorder is an important extension of these approaches. Several brain-
computer interfaces have enabled relevant communication applications, such as moving a
cursor on the screen and spelling letters.
Although this kind of interface has been found to be helpful, patients have had to learn
to modulate their brain activity in an unnatural and counterintuitive manner, i.e., performing
mental tasks like spinning a cube, making calculations in their heads, moving in order to
operate an interface, or identifying letters presented quickly on a screen, as in the P300-
speller.
A communication system that can directly infer inner speech from brain signals would
be advantageous for people with speech impairments as a substitute, enabling them to
communicate with the outside world in a more natural way. This definition of inner speech,
also known as imagined speech, inner speech, hidden speech, silent speech, speech imagery,
or verbal thoughts, refers to the capacity to produce inner speech representations in the
absence of external speech stimulation or internally produced open speech.
40
INNER SPEECH RECOGNITION USE CASE
People with paralysis may be able to type sentences letter by letter at up to 10 words
per minute with the aid of assistive devices, but that is a far cry from the 150 words per
minute average of everyday conversation.
The data we will be working with comes from a study conducted by Nicolás Nieto,
Hugo Leonardo Rufiner, Victoria Peterson, and Ruben Spies at Torcuato Di Tella
University's Neuroscience Laboratory in Argentina. The data is a multi-speech-related BCI
dataset consisting of EEG recordings with 128 active EEG channels and 8 external active
41
INNER SPEECH RECOGNITION USE CASE
EOG/EMG channels having a 24 bits resolution and a sampling rate of 1024 Hz, from ten
naïve BCI users, performing four mental tasks in three different conditions: inner speech,
pronounced speech, and visualized condition. In a single day of recording, each participant
completed between 475 and 570 trials, yielding a dataset with more than 9 hours of nonstop
EEG data collection and more than 5600 trials.
The participants are ten healthy right-handed individuals with a mean age of 34
(standard deviation: 10 years), four females and six males. None of the participants have any
speech or hearing impairments, and none have any neurological, motor, or psychiatric
disorders.
As depicted in Fig., each subject took part in a single recording day that included
three separate sessions. To avoid boredom and fatigue between sessions, a self-selected
break period (inter-session break) was provided. Each session began with a baseline of
fifteen seconds, during which the participant was instructed to unwind and maintain as much
stillness as possible.
Within each session, five stimulation runs were presented. Those runs match the
various conditions that have been put forth, including the pronounced speech, inner speech,
and visualized conditions. At the beginning of each run, the condition was announced on the
computer screen for a period of 3 seconds. In all cases, the order of the runs was: one
pronounced speech, two inner speeches, and two visualized conditions. Runs were separated
by an inter-run break of one minute.
42
INNER SPEECH RECOGNITION USE CASE
The classes were specifically selected taking into account a natural BCI control
application with the Spanish words: "Arriba”, “Abajo”, “Derecha" and "izquierda" (i.e.
"top”, “bottom”, “right” and "left” respectively). The trial's class (word) was chosen at
random. In the first and second sessions, each participant had 200 trials. Nonetheless,
depending on their willingness and fatigue, not all participants completed the same number
of trials in the third session.
Figure 29 depicts the trial composition as well as the relative and cumulative times.
Each trial began at time t = 0 seconds and had a concentration interval of 0.5 seconds. A new
visual cue would be presented to the participant shortly. The participant was instructed to
maintain a fixed gaze on a white circle that appeared in the center of the screen and not blink
until the trial's conclusion. The cue interval started at time t = 0.5 seconds. A white triangle
with an arrow pointing in one of four directions was displayed. The direction of the cue
pointing corresponded to each class. After 0.5 seconds, or at t = 1 second, the triangle
disappeared from the screen, at which point the action interval began. As soon as the visual
cues vanished and the white circle appeared on the screen, participants were instructed to
start completing the indicated task. The white circle turned blue and the relaxation interval
started after 2.5 seconds of the action interval, or at t = 3.5 seconds. The participant was told
in advance to stop the activity at this point but not to blink until the blue circle vanished. At
t = 4.5 seconds, the blue circle disappeared, signifying that the trial was over. A rest interval,
varying in length from 1.5 seconds to 2 seconds, was allowed between trials.
43
INNER SPEECH RECOGNITION USE CASE
The dataset was designed with the primary goals of decoding and understanding the
processes involved in the generation of inner speech, as well as analyzing its potential use
in BCI applications, in mind. As described in the “Background & Summary” Section, the
generation of inner speech involves several complex neural network interactions. To localize
the main activation sources and analyse their connections, we asked the participants to
perform the experiment under three different conditions: inner speech, pronounced speech,
and visualized condition.
Inner speech condition is the primary condition of the dataset, and it seeks to identify
the electrical activity in a participant's brain associated with their thought about a specific
word. During the inner speech runs, participants were instructed to imagine their voice as if
they were giving a direct order to the computer, repeating the corresponding word until the
white circle turned blue. Each participant was explicitly instructed not to concentrate on the
articulation gestures. In addition, each participant was instructed to remain as still as
possible, with no movement of the mouth or tongue. For the sake of natural imagination, no
rhythm cue was provided.
Although motor activity is mainly related to the imagined speech paradigm, inner
speech may also show activity in the motor regions. The pronounced speech condition was
proposed to identify motor regions involved in pronunciation that matched those activated
during the inner speech condition. During the pronounced speech runs, each participant was
instructed to repeat aloud the word corresponding to each visual cue, as if giving a direct
44
INNER SPEECH RECOGNITION USE CASE
order to the computer. No rhythm cue was provided, as was the case with the inner speech
runs.
This condition was proposed because the selected words have a high visual and
spatial component, and with the goal of finding any activity related to that being produced
during inner speech. Participants in the visualized condition runs were instructed to
concentrate on mentally moving the circle in the center of the screen in the direction
indicated by the visual cue.
As indicated in Figure 31, the majority of recent applications follow a consistent path
for EEG data processing. Raw EEG data are preprocessed primarily to remove artifacts and
noise. Then, pertinent characteristics of brain activity are retrieved, and these characteristics
are classified to define a mental state.
45
INNER SPEECH RECOGNITION USE CASE
The preprocessing phase may include signal acquisition, artifact removal, averaging,
thresholding of the output, signal augmentation, and edge detection. The elimination of
artifacts is the most crucial phase in this stage and many other signal processing applications.
There are several sources of artifacts in raw EEG signal recordings. They are disruptions
that can arise during signal collection and affect the interpretation of the signals themselves.
If noise is not appropriately addressed, it might have a negative impact on the useful
characteristics of the original signal. Muscular activity, eye blinking during the signal
collecting operation, and power line electrical noise might be causes of artifacts. Thus, a
transformation procedure was created to restructure the continuous raw data into a more
compact dataset and to make their use easier. Such processing was performed in Python,
primarily with the MNE library. A function was created that allows for the rapid loading of
raw data corresponding to a specific participant and session.
The first step in the signal processing procedure was to ensure that the events in the
signals were correctly tagged. Missing tags were identified, and a method for correcting
them was proposed. Because the BioSemi acquisition system is "reference-free," the
Common-Mode (CM) voltage is recorded in all channels, necessitating a re-reference step.
This procedure was carried out using the MNE reference function and channels EXG1 and
EXG2. This step removes the CM voltage and aids in the reduction of line noise (50 Hz) and
body potential drifts. A zero-phase bandpass finite impulse response filter was used to filter
the data. The lower and upper bounds were both set to 0.5 and 100 Hz. A 50 Hz Notch filter
was also used. The data was decimated four times, resulting in a final sampling rate of 254
Hz. The dimension [channels x samples] matrices corresponding to each trial were stacked
in a final tensor of size [trials x channels x samples].
The continuously recorded data were then extracted, retaining only the 4.5 s signals
corresponding to the time window between the start of the concentration interval and the end
of the relaxation interval.
The art of creating useful features from existing data is known as "feature
engineering." It entails transforming data into forms that are more closely related to the
46
INNER SPEECH RECOGNITION USE CASE
underlying target to be learned. When done correctly, feature engineering can add value to
your existing data while also improving the performance of your machine learning models.
This phase can be broken down into two steps: first, extract more significant features
from the original input data; second, choose the best features from the new dataset.
There are three primary information sources that may be derived from EEG readings:
spatial information (for multichannel EEG), spectral information (power in frequency
bands), and temporal information (time windows-based analysis).
Fourier analysis is a common signal processing technique for transforming data from
the time domain to the frequency domain or vice versa. Both continuous and discrete
temporal signals are applicable to this technique. It is based on the premise that every signal
may be approximated or represented by the sum of trigonometric functions. FFT is a method
that calculates the Discrete Fourier Transform (DFT) or inverse of a sequence. It yields the
exact same result as evaluating the DFT definition directly, but considerably more quickly.
𝑋 = 𝑥 𝑒 𝑘 = 0, … , 𝑁 − 1 (4)
Where:
47
INNER SPEECH RECOGNITION USE CASE
𝑋 = DFT of 𝑥
𝑥 = input sequence
In this work, the power spectral density (PSD) was employed as the feature extraction
approach. PSD is a typical signal processing approach that displays the signal's energy as a
function of frequency by distributing its power over frequency. We have employed the
Welch technique in accordance with the PSD.
The Welch technique is a modified segmentation scheme used to assess the average
periodogram. In general, the Welch technique of the PSD may be expressed by the following
equations; the power spectra density equation is first defined. The Welch Power Spectrum
is then represented as the mean average of the periodogram for each interval.
1
𝑃 (𝑓 ) = 𝑥 (𝑛 ) 𝑤 (𝑛 ) 𝑒 (5)
𝑀𝑈
1
𝑃 (𝑓 ) = 𝑃 (𝑓 ) (6)
𝐿
After this procedure, each signal instance of the trial will be converted into a feature
vector of size 1 × m where m is the number of features extracted. The final dataset is a matrix
of shape n × m where n is the number of trials for all the subjects.
III.4.3.a: PCA
48
INNER SPEECH RECOGNITION USE CASE
superfluous factors. PCA is a straightforward method for reducing the number of variables
in a data collection while retaining as much information as feasible.
In our work, we used PCA for dimension reduction. The objective function of PCA
is, 𝑚𝑎𝑥 𝑢 𝐶𝑢 subject to 𝑢 𝑢 = 1, where C is the covariance matrix obtained and
vector 𝑢𝜖𝑅
Figure 32: A big picture of the idea of PCA algorithm. "Eigenstuffs" are eigenvalues and eigenvectors.
RFE requires the retention of a specific number of features, but it is frequently unknown in
advance how many features are valid. Cross-validation is used with RFE to score several
feature subsets and pick the highest-scoring collection of features in order to determine the
optimal number of features. The RFECV visualizer plots the number of features in the model
together with their cross-validated test score and variability, and visualizes the number of
features picked. In the end, we only choose features that have been ranked as essential by at
least two different models.
49
INNER SPEECH RECOGNITION USE CASE
III.5: Modeling
After analysing our data, it's time to develop a classification model that can predict
to which class an EEG signal belongs. We examine the potency of six classification models
in this section: three deep neural networks and three machine learning models.
The goal of the method for the support vector machine is to locate, in a space of N
dimensions (where N is equal to the number of features), a hyperplane that categorizes the
data points in a manner that is unique. It is possible to choose from a wide variety of
hyperplanes in order to differentiate between the two categories of data points.
The goal is to locate a plane that has the greatest possible margin, which may be
understood as the greatest possible distance between data points of both classes. When the
margin distance is maximized, some reinforcement is provided, allowing for subsequent data
points to be categorized with a greater level of confidence.
50
INNER SPEECH RECOGNITION USE CASE
Gradient Boosting is a specific kind of the boosting algorithm family. Boosting is the
process of combining a number of "weak learners" into a single "strong learner," which
means combining a number of algorithms that have poor performance into a single algorithm
that is significantly more effective and satisfying. The transformation from "poor learners"
into "strong learners" is accomplished by repeatedly calling them to estimate a variable of
interest.
Within the context of a classification, each individual is assigned a weight that is
constant at the outset and that, if a model is incorrect, is increased prior to estimating the
next model (which will thus take these weights into account). The update of the weights will
be computed using the stochastic gradient descent method.
51
INNER SPEECH RECOGNITION USE CASE
The graphic above shows that comparable data points are usually adjacent to one
another. The KNN algorithm is based on the premise that this assumption is true enough for
the algorithm to be beneficial. The KNN method combines the concept of similarity (also
known as distance or proximity) with some elementary mathematics, especially calculating
the distance between points on a graph.
● Determine the distance between our query and the current iterated
observation in the data loop.
● To an ordered collection of data, add the distance and index of the appropriate
observation.
● Sort this ordered collection of distances and indices from least to greatest
distance (in ascending order).
● Choose the first k elements from the sorted data set (equivalent to the k
nearest neighbours)
52
INNER SPEECH RECOGNITION USE CASE
Three deep learning models were utilized to tackle the task of decoding EEG signals:
a standard convolution neural network, a deep Convnet architecture inspired by computer
vision models, and a compact CNN specifically created for EEG-based BCIs.
Deep learning (DL) has sparked interest in numerous fields because of its better
performance. DL is capable of dealing with nonlinear and nonstationary data and learning
underlying characteristics from signals. For the categorization of EEG signals, certain deep
learning approaches are used. Because of their capacity to learn characteristics from small
receptive fields, CNNs have been frequently employed in EEG categorization. CNNs are
appropriate for complex EEG recognition tasks because the trained detector may be utilized
to identify abstract characteristics via convolutional layer repetition. They have obtained
good results and are widely used by many researchers.
Due to its structure, CNN can imitate the human brain's complex cerebral cortex.
It simply requires a large training dataset to train a complicated model that learns
features using backpropagation and the gradient descent optimization technique and extracts
features using a sequence of filtering, normalization, and nonlinear activation procedures.
53
INNER SPEECH RECOGNITION USE CASE
Recent literature has indicated that there is promise in using Convolutional neural
networks (deep ConvNets) for EEG classification. Effective computer vision architectures
served as inspiration for our ConvNet model.
Our deep ConvNet featured four convolution-max-pooling blocks, as shown in the picture
below, with a specific initial block created to accommodate EEG input, followed by three
standard convolution-max-pooling blocks and a dense softmax classification layer.
Due to the high number of input channels, the first convolutional block was divided into two
convolutional layers. The convolution was divided into a first convolution in time and a
second convolution in space; each filter in these steps has weights for all electrodes and for
the filters in the previous time convolution.
The model uses exponential linear units as activation functions. (ELUs, 𝑓 (𝑥 ) = 𝑥 𝑓𝑜𝑟 𝑥 >
0 𝑎𝑛𝑑 𝑓(𝑥) = 𝑒 − 1 𝑓𝑜𝑟 𝑥 ≤ 0 ).
54
INNER SPEECH RECOGNITION USE CASE
III.5.2.c: EEGNET
Because the EEG signal pre-processing steps are often very specific to the EEG
feature of interest, it is possible that other potentially relevant EEG features could be
excluded from analysis. An EEG-specific model that incorporates well-known EEG feature
extraction ideas for BCI is thus required.
Here, we introduce EEGNet, a compact CNN for classifying and interpreting EEG-
based BCIs. We present the use of Depth wise and Separable convolutions, previously used
in computer vision, to build an EEG-specific network that covers numerous well-known
EEG feature extraction techniques, such as optimal spatial filtering and filter-bank building,
55
INNER SPEECH RECOGNITION USE CASE
while simultaneously reducing the number of trainable parameters to fit when compared to
existing approaches.
Figure 40 and table 4 show a visualization and full description of the EEGNet model
for EEG trials collected at a 1024 Hz sampling rate, with C channels and T time samples, F1
= number of temporal filters, D = depth multiplier (number of spatial filters), F2 = number
of pointwise filters, and N = number of classes, respectively.
The network first learns frequency filters using a temporal convolution, then learns
frequency-specific spatial filters using a depth wise convolution coupled to each feature map
independently. The separable convolution is made up of a depth wise convolution that learns
a temporal summary for each feature map independently, followed by a pointwise
convolution that learns how to optimally mix the feature maps. The table 4 below contains
further information on the model's architecture.
56
INNER SPEECH RECOGNITION USE CASE
1 Input _ _ CxT _
Reshape _ _ 1xCxT _
Batch Norm _ _ F1 x C x T _
Flatten _ _ F2×(T//32) _
57
INNER SPEECH RECOGNITION USE CASE
Models in machine learning are only as valuable as their predictive ability; hence,
our fundamental goal is to build high-quality models with potential predictive power. We
will now look at ways of evaluating the quality of models created by our machine learning
and deep learning algorithms. In order to improve our model's overall predictive capacity,
we should evaluate our model's performance using a variety of metrics before we deploy it
on real data.
III.5.3.a: Accuracy
The number of samples successfully categorized among the total number of samples
in the test set. We compute the accuracy as follows:
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (7)
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
From the results, it is validated that EEGNET and ConvNet algorithms offer the best
outcomes. However, looking at the range of accuracy values, we can easily see that our
58
INNER SPEECH RECOGNITION USE CASE
models are not effective at discriminating between classes, since it is more difficult to
classify more than two classes.
Another approach to consider is in terms of decision boundaries. More classes mean
more boundaries. With four classes, there will generally be a boundary between every pair
of classes (six boundaries). The number of boundaries increases as the number of classes
increases, and because our work is in a high-dimensional space, there is a lot of opportunity
for some classes to be adjacent to multiple others, which increases the amount of space that
is close to a boundary and the number of data that are close to a boundary whose position
may not have been perfectly estimated. Also, with fewer data points per class, we may have
a less accurate estimate of where the boundary lies.
To address this problem, we decided to take advantage of the benefits of binary
classification for our multi-class classification task by splitting the related dataset into
numerous binary classification datasets and training a binary classification model for each.
We now have four expert binary classifiers that are really good at recognizing one pattern
from all the others.
As shown in the graph above, EEGNET and Convnet have the highest accuracy for
all class labels. The remainder of the analysis will concentrate on a single binary
59
INNER SPEECH RECOGNITION USE CASE
classification task that belongs to the class "UP," as all binary classification tasks will use
the same result discussion.
SVM 72 72
XGBoost 84 80
KNN 80 80
1D-CNN 61 63
EEGNet 86 84
ConvNet 87 85
60
INNER SPEECH RECOGNITION USE CASE
True Negative (TN) refers to a sample belonging to the negative class being classified
correctly.
False Positive (FP) refers to a sample belonging to the negative class but being classified
wrongly as belonging to the positive class.
False Negative (FN) refers to a sample belonging to the positive class but being classified
wrongly as belonging to the negative class.
Below are the confusion matrices created from our models.
Figure 42: SVM confusion matrix Figure 43: XGBoost confusion matrix Figure 44: KNN confusion matrix
Figure 45: CNN confusion matrix Figure 46: EEGNET confusion matrix Figure 47: ConvNet confusion matrix
As can be seen from the color-bar on the side, the number increases as the color
becomes darker. As a result, we would intuitively conclude that darker colors on diagonal
parts and lighter colors on the others indicate that our model is performing well, and vice
versa. Despite their high accuracy, machine learning models are poor at identifying the class
in question. The Convent and EEGNET models generated the best results, as seen in the
figures; a darker colour for the diagonal element (1,1) implies our models perform really
well for class "UP".
61
INNER SPEECH RECOGNITION USE CASE
We can evaluate the model more closely using the four different numbers from the
matrix. In general, we can get the following quantitative evaluation metrics from this binary
class confusion matrix:
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (8)
𝑇𝑃 + 𝐹𝑃
Recall: The ratio of correct positive predictions to total actual positives. The Recall
formula is as follows:
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (9)
𝑇𝑃 + 𝐹𝑁
F1 Score: A precision and recall weighted harmonic mean. The closer the model is
to one, the better it is. The F1 Score Evaluation Metric has the following formula:
2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 = (10)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Fortunately, when building a classification model in Python, we can use the sklearn
library's classification report() method to generate all three of these metrics. The
classification reports for our three machine learning models (SVM/XGBoost/KNN) are
provided below.
Figure 48: SVM classification report Figure 49: XGBoost classification Figure 50: KNN classification
report report
62
INNER SPEECH RECOGNITION USE CASE
1 0 1 0 1 0
1D-CNN 61 68 68 60 64 64
EEGNet 76 94 96 67 85 78
ConvNet 77 75 78 74 78 75
Here, we can see that the XGBoost classifier does a great job at identifying the
"UP" class, with a precision of 0.98, but it has a very low recall (0.59), which means that
the model has a 59% chance of predicting "1" when the actual value is 1. It would be far
less problematic to classify the word as "0" when its true value is "1" than to classify it as
"1" when it actually belongs to another class.
That being mentioned, we shall strive for the best precision value while keeping the
recall value into account. For the time being, we will assume that EEGNET will be the
model chosen for the first binary-classification. Let's now examine the evolution of our
model's performance over 200 epochs.
For algorithms like deep learning that learn progressively, learning curves are a
frequently used diagnostic tool in machine learning. We evaluate model performance
throughout training time on the hold-out validation dataset as well as the training dataset,
and we plot this performance for each epoch.
63
INNER SPEECH RECOGNITION USE CASE
As we can see in the right diagram, the accuracy increases rapidly in the first twenty
epochs, indicating that the network is learning fast. Afterwards, the curve flattens, indicating
that not too many epochs are required to train the model further. Furthermore, we can clearly
see that accuracy curves exhibit a small amount of overfitting, as evidenced by the small
difference between training and validation accuracy.
The loss curve shows that we have minimized overfitting, yet some overfitting still
exists. Because, while we can minimize overfitting in most cases, we rarely can completely
eliminate it while still minimizing our loss.
III.5.3.e: Results
In this project, the goal was to classify an EEG signal into four different classes (4
different Spanish words). However, traditional multiclass classification methods were not
providing satisfactory results due to the high dimensionality of the input data. Therefore,
the decision was made to divide the multiclass classification problem into four separate
binary classification problems, each focused on distinguishing between two classes.
After conducting a single binary classification study, it has been concluded that
EEGNet is the best classifier for performing the binary classification problem at hand. The
study involved comparing the performance of several popular classification algorithms,
including SVM,KNN,XGBoost,1D CNN and ConvNet, against EEGNet .
The results showed that EEGNet consistently outperformed the other classifiers in terms of
accuracy, precision, and recall, making it the most reliable and efficient option for binary
classification tasks. The success of EEGNet can be attributed to its ability to effectively
capture the spatial and temporal features of EEG signals, making it a powerful tool for
analyzing brain activity and detecting abnormalities or patterns in the data.
64
INNER SPEECH RECOGNITION USE CASE
Finally, the input signal was passed through each of the four binary classifiers, and the
final classification decision was made based on the class with the highest probability.
Each binary classifier was properly trained and assessed individually to ensure accurate
results, and the overall performance of the classification model was monitored and
evaluated for fine-tuning. By employing this approach, we aimed to improve the accuracy
and efficiency of our classification model for EEG signal classification.
Conclusion
We addressed the final key goal of our internship—presenting the inner voice
recognition use case. The problem was stated and further explained in the beginning of this
chapter. Following that, we have explained our dataset and its acquisition process. In the last
section, we demonstrate the machine learning pipeline we created to classify the signals,
starting with the feature pre-processing phase and concluding with a discussion of the
modelling task's outcomes.
65
CONCLUSIONS AND PROSPECTS
The present work is part of my final year project realized in the company ML Basel
Architectures directed by Mr. Bassem Ben Hamed and Mr. Christian Bock as technical
manager and Mrs. Sinda Belhadj Daoud as academic manager, as indicated in the report, we
started our work with a very thorough bibliographic research that addresses the fundamentals
and advanced topics related to ML for time series and its applications. The usage of machine
learning in the biomedical field is discussed with its advantages along with a presentation of
two use cases related to the topic in question.
Next, we walked through the use case and discussed related work, then dove into our
machine learning pipeline. Throughout the pre-processing, we applied Welch's feature
extraction method to retrieve meaningful features. With the aim of reducing the dimension
of the data, we have applied PCA to identify the optimal number of features to select and
RFE to determine these features. We then presented the 6 models we worked with during
the modelling and their evaluation.
Our data are sufficient to perform binary classification tasks based on short recordings of the
brain signal from imagined speech with an accuracy of up to 87. These results promise good
performance on similar datasets with a larger number of classes.
Time series data is ever-present in healthcare and offers an exciting opportunity for
machine learning methods to extract actionable insights about human health. However, there
is a significant gap between the existing literature on time series and what is ultimately
needed to make machine learning systems practical and deployable for healthcare. Indeed,
learning from time series for healthcare is notoriously difficult: The data can be very high
66
CONCLUSIONS AND PROSPECTS
dimensional, the feature extraction step is highly complex, and this is due to trouble in
choosing which method to apply; moreover, the output of these methods made even higher
dimensional data.
67
WEBOGRAPHY
Webography
[1] Machine Learning for Biomedical Time Series Classification: From Shapelets to Deep
Learning, Christian Bock, Michael Moor, Catherine R. Jutzeler & Karsten Borgwardt.
[03/2022]
[2] Motifs and Manifolds Statistical and Topological Machine Learning for Characterising
and Classifying Biomedical Time Series , Christian Bock [03/2022]
[3] Thuy T. Pham (2019) Applying Machine Learning for Automated Classification of
Biomedical Data in Subject-Independent Settings [03/2022]
[4] Tiago H. Falk, and Ervin Sejdic (Editors) (2018) Signal processing and machine
learning for biomedical big data, CRC Press, Taylor & Francis [04/2022]
[5] Xiang-tian Yu, Lu Wang, and Tao Zeng (2018). Revisit of Machine Learning
Supported
Biological and Biomedical Studies. In Tao Huang (Editor), Computational Systems
Biology,
Methods in Molecular Biology 1754, Humana Press, pp. 183-204 [04/2022]
[6] Blank, S. C., Scott, S. K., Murphy, K., Warburton, E. & Wise, R. J. Speech production:
Wernicke, broca and beyond. Brain 125, 1829–1838 (2002). [05/2022]
[7] Timmers, I., Jansma, B. M. & Rubio-Gozalbo, M. E. From mind to mouth: event
related potentials of sentence production in classic galactosemia. PLoS One 7, e52826
(2012). [06/2022]
[8] A Survey of Heart Anomaly Detection Using Ambulatory Electrocardiogram
(ECG);Hongzu Li et al. [06/2022]
[9] EEG Sonification for Classifying Unspoken Words, Torres Garcia et al. (invasive EEG)
[07/2022]
[10] Decoding imagined, heard, and spoken speech: classification and regression of EEG
using a 14-channel dry-contact mobile headset Jonathan Clayton, Scott Wellington, Cassia
Valentini-Botinhao, Oliver Watts The University of Edinburgh SpeakUnique Limited
[07/2022]
68
WEBOGRAPHY
Xiaotong Gu, Zehong Cao, Alireza Jolfaei, Peng Xu, Dongrui Wu, Tzyy-Ping Jung, and
Chin-Teng Lin [07/2022]
[12] PSD-Based Features Extraction for EEG Signal During Typing Task Wei Bin Ng, A
Saidatul, Chong Y.F and Z Ibrahim [07/2022]
[13] Open Access database of EEG signals recorded during imagined speech German A.
Pressel Corettoa, Iv´an E. Gareisa, b, and H. Leonardo Rufiner [07/2022]
69
Résumé
Le projet actuel, mené à Digital Innovation Partner, s'inscrit dans le cadre d'un projet
de fin d'études pour l’obtention du diplôme national d'ingénieur.
Il s'agit d’une étude complète de l`analyse des séries temporelles, l'état de l'art des approches
d'apprentissage automatique pour les données temporelles, la présentation de certains des
cas d'utilisation de séries temporelles biomédicales les plus courants, et enfin l'application
de la modélisation pour le cas d'utilisation de la reconnaissance du parole intérieure.
Abstract
Keywords: time series, EEG, brain-computer interfaces, imagined speech, neural decoding,
machine learning, deep learning, Welch, CNN, CONVNET, and EEGNET.
70