You are on page 1of 14

Knowledge-Based Systems 221 (2021) 106982

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Supporting unknown number of users in keystroke dynamics models



Itay Hazan a,b , , Oded Margalit c ,1 , Lior Rokach b
a
IBM Cybersecurity Center of Excellence, Beer-Sheva, Israel
b
Department of Software & Information Systems Eng., Ben-Gurion University of the Negev, Beer-Sheva, Israel
c
Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, Israel

article info a b s t r a c t

Article history: In recent years, keystroke dynamics has gained popularity as a reliable means of verifying user identity
Received 15 July 2020 in remote systems. Due to its high performance in verification and the fact that it does not require
Received in revised form 25 December 2020 additional effort from the user, keystroke dynamics has become one of the most preferred second
Accepted 18 March 2021
factor of authentication. Despite its prominence, it has one major limitation: keystroke dynamics
Available online 20 March 2021
algorithms are good at fitting a model to one user and one user only. When such algorithms try
Keywords: to fit a model to more than one user, the verification accuracy decreases dramatically. However, in
Behavioral biometrics real-world applications it is common practice for two or more users to use the same credentials, such
Keystroke dynamics as in shared bank accounts, shared social media profiles, and shared streaming licenses which allow
Multi-user model multiple users in one account. In these cases, keystroke dynamics solutions become unreliable. To
X-means
address this limitation, we propose a method that can leverage existing keystroke dynamics algorithms
to automatically determine the number of users sharing the account and accurately support accounts
that are shared with multiple users. We evaluate our method using eight state-of-the-art keystroke
dynamics algorithms and three public datasets, with up to five different users in one model, achieving
an average improvement in verification of 9.2% for the AUC and 8.6% for the EER in the multi-user
cases, with just a negligible reduction of 0.2% for the AUC and 0.3% for the EER in the one-user cases.
© 2021 Elsevier B.V. All rights reserved.

1. Introduction In recent years, behavioral biometrics received attention as


a means of second factor of authentication for verifying users’
User authentication is one of the most important security identity in online systems. It is often considered seamless and
aspects in online systems. Identity theft and data leakage are non-invasive [2,3], especially when compared to physical biomet-
major concerns of organizations and companies who consistently rics, such as fingerprints or iris scans. Therefore, many service
seek relevant security solutions to address these threats. In addi- providers seek to develop such solutions in-house or use the
tion, the recent events of the worldwide pandemic have strived services of a third party to verify their users’ identities [4].
working from home and the importance of authenticating and Keystroke dynamics is a behavioral biometric modality based
verifying users while accessing organizational systems and in- on a user’s keyboard typing pattern. The main advantages of
formation remotely. And not just working from home, being keystroke dynamics are that it is relatively easy to implement [5]
able to perform daily tasks remotely such as signing insurance and very effective in detecting impostors. There are two main
documents and renewing of prescriptions became relevant more types of keystroke dynamics: free text and fixed text. Free text
than ever. This is exacerbated by the fact that cyber threats have keystroke dynamics focuses on verifying user identity based
been on the rise since the recent events began [1]. Therefore, on unexpected and spontaneous text, which usually requires
employers and service providers need to have a trusted authen- long text samples and an extended training period. Fixed text
tication mechanism to identify their users and may even need to keystroke dynamics, on the other hand, focuses on verifying user
employ several parallel authentication schemes to ensure their identity given a defined repeatable text, which is usually brief and
safety. requires a much shorter training period.
Because of their practical advantages, fixed text keystroke
∗ Corresponding author at: IBM Cybersecurity Center of Excellence, dynamics solutions are popular among service providers for ver-
Beer-Sheva, Israel.
ifying a user’s identity as he/she types his/her username and
E-mail addresses: itayha@il.ibm.com (I. Hazan), odedm@post.bgu.ac.il password. These solutions, in turn, help in cases of credential
(O. Margalit), liorrk@bgu.ac.il (L. Rokach). leakage, theft, and brute-force attacks, and provide a second layer
1 Most of the work was done while the author was working at IBM. of protection without the need for any additional effort on the

https://doi.org/10.1016/j.knosys.2021.106982
0950-7051/© 2021 Elsevier B.V. All rights reserved.
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

part of the user, who is typing their username and password In our experiments, we assess both one-user models and
anyway. multi-user models with up to five different users, with data
Keystroke dynamics solutions are usually comprised of the obtained from three widely used public datasets: CMU 2009 [10],
following steps: collect samples of keystroke events (i.e., presses Greyc 2009 [15] and Greyc-Web 2012 [16]. The proposed method
and releases) which include the exact keycode and timestamp; yielded a significant improvement: in terms of the AUC (Area
extract features from the keystroke samples, which transform Under the ROC Curve) metric, in multi-user cases there was an
the samples into a uniform set of feature vectors; build a model average increase of 9.2%, with a maximal increase of 21.7% in the
for each user using the set of feature vectors and one of the best case; using the proposed method in one-user cases resulted
various existing machine learning algorithms [2]. Although there in an insignificant average decrease of 0.2% in the AUC, with
are many different algorithms, the commonly used descriptive a maximal 1.0% reduction in the worst case. In terms of the
features for keystroke dynamics are di-graph features, which are EER (Equal Error Rate) metric, in multi-user cases there was an
based on the time elapsed between every two ordered key events average decrease of 8.6%, with a maximal decrease of 18.2%; using
(explained further in Section 3.2.1). Once a model has been estab- the same method in one-user cases resulted in an insignificant
lished for a user, new samples can be tested against the model average increase of just 0.3%, with a maximal increase of 1.2% in
which in turn produces an anomaly score. Using this score, the the worst case.
online system can either approve or decline the authentication The remainder of this paper is organized as follows: related
request. Altogether, keystroke dynamics solutions can produce work is presented in Section 2. A thorough explanation of the
highly accurate models for user verification. proposed method is provided in Section 3. Section 4 contains a
Although accurate, keystroke dynamics algorithms rely on one description of the entire experimentation process, including the
important assumption: a model is built for one user only. This evaluated algorithms, the datasets, the results, and the limitations
assumption, to the best of our knowledge, is a foundation in of the proposed method. Finally, in Section 5, we discuss our
all keystroke dynamics methods. Nevertheless, in real-world ap- conclusions.
plications, multi-user accounts are common. In many domains,
such as finance, social media, government, and streaming, one 2. Related work
account can frequently serve more than one user (even if doing
so is against the official terms of use). Married couples, business 2.1. Initial research
partners, roommates, managers and assistants, etc. often share or
use the same account. Despite this reality, in our search of the
Keystroke dynamics has been studied for several decades. In
academic literature, we were unable to locate any research on
the 1980s, first Gaines et al. [17] and then Leggett et al. [18]
keystroke dynamics algorithms that generate multi-user models,
began investigating the ability to verify user identity based on
and when we tried to apply known keystroke dynamics algo-
keyboard typing patterns, by profiling the latency between key
rithms on multi-user training data, we saw a dramatic reduction
events. Both studies presented only limited results on very small
in the verification performance; moreover, as the number of users
datasets, leaving room for future work.
increased, the performance deteriorated further.
Therefore, in this study, our research question is how to de-
2.2. Different research questions
velop keystroke dynamics models that support the verification of
both multiple users and single user accounts. We aim to generate
Since then, keystroke dynamics research has developed, and
models that are capable of verifying any user the model was
many different studies have been performed addressing various
trained on and rejecting other users that it was not trained on,
research questions in the field. Montalvão et al. [19] focused on
without the need for each specific user to identify him/herself or
the number of users to be supplied in advance. And therefore, two questions: (1) How does a subject develop a stable rhythmic
securing these accounts from impostors. keystroke dynamics behavioral profile? The authors showed that
However, developing a new keystroke dynamics algorithm there are different stabilization processes within sessions and
that will generate models that support multiple users is not suf- between sessions (each session contains several samples in a
ficient. There are many keystroke dynamics algorithms; Banerjee row). The samples in the first sessions were noisy and unstable,
et al. [6] surveyed more than 100 such algorithms, and not one whereas the samples in the latter sessions were only noisy at
of them always outperformed all of the others, a phenomenon the beginning but quickly managed to stabilize. (2) How does the
known as the no free lunch theorem. Some algorithms perform length of the text affect the verification performance? The authors
better than others in given circumstances. Thus, the goal of our showed that the length has a dramatic effect on the performance,
study is not to create a new algorithm for multi-user keystroke with a greater effect on the letters typed first and less effect
dynamics but rather to develop a method that will leverage ex- on the letters subsequently typed. Pisani et al. [20] investigated
isting keystroke dynamics algorithms, allowing them to support adaptive approaches for keystroke dynamics models, so they can
multi-user accounts without explicit knowledge of the number of better handle user behavior changes over time. Several types
users or the sample-to-user association. of adaptation mechanisms were proposed and evaluated, show-
In this paper, we propose a method that can wrap any ing that outdated keystroke dynamics models can significantly
keystroke dynamics algorithm and enable it to generate multi- improve scoring if they use efficient adaptation mechanisms.
user models in addition to one-user models, according to the Serwadda et al. [21] examined the robustness of keystroke dy-
detected number of users. The method has four steps: feature ex- namics models against statistical attacks. The authors collected
traction, quantile transformation, data clustering, and sub-model keystroke events from more than 3000 users and developed a bot
training. We evaluate our method using eight state-of-the-art that mimicked typical user behavior; the bot was used against
algorithms that have been proven to work well with fixed text password-based keystroke dynamics models, demonstrating a
keystroke dynamics. We divide these algorithms into two main reduction in the performance of keystroke dynamics models.
types: distance-based and model-based. The distance-based algo- Bours et al. [22] studied how keystroke dynamics models perform
rithms are the M2005 [7], Outlier Count z-score [8], Manhattan when they are trained using one keyboard and tested using
scaled [9], and Statistical model [10] algorithms; the model-based another keyboard. The authors showed that doing so can cause
algorithms are the Isolation Forest [11], One Class SVM [12], performance deterioration in terms of the FPR (False Positive
HBOS [13] and Autoencoder [14] algorithms. Rate) and FNR (False Negative Rate) of up to 8.1% compared to
2
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

a scenario in which the models are trained and tested using 2.4. Detect account sharing
the same keyboard. Ho et al. [23] evaluated the use of prepro-
cessing techniques to improve keystroke dynamics models. The Hwang et al. [33] examined the use of keystroke dynamics to
authors showed that a bagging-like technique, called MINIBAG,
detect account sharing in order to prevent account abuse. The
that splits the data into mini-batches and creates an ensemble
authors suggested using a VB-GMM (Variational Based Gaussian
of classifiers can improve the produced score. When comparing
Mixture Model) on top of keystroke dynamics’ raw features. This
different distance-based techniques, the authors were able to
show an improvement of 2.4% in terms of the EER when us- algorithm tries to fit one or more Gaussians to the data according
ing MINIBAG. Raul et al. [24] tackled the problem of keystroke to its distribution, and in case the number of fitted Gaussians
dynamics in small training sets (i.e., 10 samples). The authors is greater than one, the account is classified as shared. Their
suggested eliminating outliers in the collection phase, using the basic assumption is that patterns of users can be found in the
nearest neighbors algorithm; then the data collected was aligned Euclidean space using Gaussians. The authors tested their method
and normalized according to the start and end time, while using on a self-collected, non-public dataset twice, with two different
the ratio of delta time from the total time instead of the raw goals. When the goal was to detect whether the account is shared
delta time, as usually done in keystroke dynamics. Migdal et or not, regardless of the number of users, the authors were able
al. [25] tried to overcome problems associated with collecting to obtain an average error rate of 2%. However, when the goal
realistic keystroke dynamics datasets, such as data privacy and changed to the detection of the exact number of users in shared
time constraints, by synthetically generating realistic keystroke accounts of 2–4 users, the performance dramatically dropped to
event samples. an average error rate of 34% (22%, 39%, and 41% for two, three,
Other works focused on improving keystroke dynamics mod- and four users respectively). When we tried to evaluate the same
els using additional input sensors. Mondal et al. [26] combined
algorithm on the three public datasets used in our study (pre-
keystroke dynamics with mouse movement in order to improve
sented in Section 4.2) to leverage keystroke dynamics algorithms
keystroke verification models. Giuffrida et al. [27] and Antal
to support multi-user models, the results were not applicable as
et al. [28] used smartphone sensors to improve keystroke dynam-
ics models through touch-based methods performed on PIN codes the number of Gaussians detected and their correctness were not
or passwords. valid for multi-user models, even after an exhaustive search on
the VB-GMM hyperparameters.
2.3. Verification algorithms In summary, the studies performed thus far have examined
different aspects of keystroke dynamics, such as how subjects
Nevertheless, the question examined most extensively in develop their behavior pattern or how keystroke dynamics can
keystroke dynamics research is which algorithm is better for be combined with other sensors in order to improve user verifi-
user verification. This question remains open as the results vary. cation. However, most of the studies in the field have focused on
Banerjee et al. [6] surveyed years of keystroke dynamics re- developing keystroke dynamics algorithms and evaluating their
search and categorized methods by algorithm, features, text type accuracy in different circumstances, such as with a different num-
(fixed/free), environment (controlled/uncontrolled), number of ber of training samples, different text inputs, different keyboards,
subjects, number of samples, and the results, if applicable. The etc. Just one study tried to address the issue of shared accounts,
authors divided the methods into four main categories: statistical, but it only obtained valid results for the question of whether the
neural network, pattern recognition (including learning-based
account is shared or not. The authors were unable to correctly
algorithms), and heuristics search (including combinations of
detect the number of users in the shared account and did not
algorithms) and described each category’s advantages and disad-
discuss how to correctly divide the data into user clusters. To
vantages. The authors showed that different algorithms showed
different performance in different circumstances. Teh et al. [2] the best of our knowledge, we are the first to try and leverage
documented the increased interest in keystroke dynamics meth- keystroke dynamics algorithms to generate both one-user and
ods over the years based on the number of publications and multi-user models that enable the verification of users’ identity
divided them into several types, showing that distance-based using keystroke dynamics algorithms even when the number of
techniques were the most popular. Pisani et al. [29] performed users is unknown.
a systematic review of keystroke dynamics methods, focusing
on 16 leading publications; their review included both one-class
3. Proposed method
classification algorithms, such as Manhattan scaled and Outlier
Count z-score, and binary classification algorithms, such as SVM
and Random Forest. Monaco et al. [30] developed and compared 3.1. Overview
several algorithms for fixed text keystroke dynamics, such as One
Class SVM, and several types of Autoencoders and ensembles. Our proposed method is aimed at wrapping any existing
The authors showed that the simple method of Manhattan dis- keystroke dynamics algorithm selected. Since keystroke dynam-
tance with score normalization outperformed the other, more ics algorithms have both training and testing phases, so does
complicated, methods. We used several of the best performing our method, that wraps them both. In the training phase, our
algorithms noted in these surveys in our evaluations, as discussed
method’s task is to use clustering to divide the training instances
in Section 4.3. Deng et al. [31] developed and compared several
into subsets, hopefully according to the users that produced the
keystroke dynamics methods and found that the best performing
instances. We do this without knowing the number of users or the
method is the DBN (Deep Belief Network), which is a probabilistic
multi-layer network with two hidden layers. Bhatia et al. [32] instance-to-user association. After doing so, a sub-model is built
also performed a comparison of keystroke dynamics methods and for each subset using the keystroke dynamics algorithm selected.
observed that the GFM (Generalized Fuzzy Models) performed the In the testing phase, our method’s task is to associate the test
best. However, both the DBN and the GFM methods require much instance to the relevant training subset, again, hopefully to the
larger training sets (200 samples and 300 samples respectively) user that produced the instance, without knowing to which user
than the training sets used in our evaluations, and therefore, we or subset it belongs. Finally, the associated sub-model is applied,
did not use them in our evaluation. and its test score is used as our proposed method’s score.
3
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

3.2. Training phase

Before we provide a detailed description of the training phase,


we define the task for this phase more formally. In the training
phase we receive n raw training instances T r = [tr1 , . . . , trn ] that
are produced by a set of users U = [u1 , . . . , uk ] of an unknown
size k; the superscript r is used to indicate that they are raw
instances that did not undergo any processing. The association
between U and T r is denoted by C , which is the vector C = Fig. 1a. Projections of three users’ training instances for each dataset, without
[C1 , . . . , Ck ] which contains k lists, such that each Cj of 1 ≤ j ≤ k any preprocessing; each color represents a different user.
contains the indices of T r belonging to user uj . However, C is
unknown to us, which create two challenges. First, the size of
|C | = k is unknown, and it can be any number between 1 (all
training samples produced by one user) and n (i.e., each sample
performed by a different user). Second, the association between
the instances and users is unknown to us as well. Therefore, our
task is to produce an approximation of C , which we will refer
to as Ĉ , that must be as accurate as possible, addressing the two
challenges.
Ĉ is a vector of m lists Ĉ = [Ĉ1 , . . . , Ĉm ], such that each Ĉi of
Fig. 1b. Projections of three users’ training instances for each dataset, using
1 ≤ i ≤ m is a list containing the indices of T r . Our goal is for Ĉ quantile transformation; each color represents a different user.
to optimize the two following conditions: (1) m = k, and (2) for
all i of Ĉi ∈ Ĉ there exists j of Cj ∈ C , such that Ĉi = Cj . If we
managed to generate such a Ĉ , we can use the relevant instances the clustering performed, as clustering methods are prone to be
from T r according to the indices in Ci , or in short T r [Ci ], and build misled by outliers (i.e., data points with extreme values), which
a sub-model using the keystroke dynamics algorithm selected to can be extremely problematic when there are multiple users in
verify instances in the testing phase. one model. Another advantage of quantile transformation is that
Our method’s training phase has four steps: (1) feature ex- it obliviate the need to remove or ignore outliers. In the world
traction, including the replacement of missing values, (2) quantile of behavioral biometrics where the training set is usually limited,
transformation of each feature to a uniform distribution, (3) data removing even a handful of instances can mean removing a large
clustering, which will uncover the users in the shared account, portion of the training set. Therefore, no instances are removed
and (4) training sub-models, as usually done, using any keystroke from the training set at all when using the proposed method.
dynamics algorithm. To demonstrate the importance of quantile transformation, in
Fig. 1a and Fig. 1b we present three examples, one from each
3.2.1. Step 1: feature extraction of the datasets (described in Section 4.2) used to evaluate our
In the first step, we extract features for any instance in the method. Each example presents the data of three users from
training set T r = [tr1 , . . . , trn ]. As mentioned earlier, the most one of the datasets. For each user we projected the first three
commonly used features in the field of keystroke dynamics are sessions (each session contains several samples entered in a row),
di-graphs, as defined by Mhenni et al. [34]. Since each di-graph taking 10 instances from each session. For each instance, we
is assembled from two keyboard events of either press or re- extracted the di-graph features and plotted them using the PCA
lease, there are four types of di-graphs: press-to-release (a.k.a. algorithm [36] with two components. The same data is seen in
dwell/hold time), release-to-press (a.k.a. flight time), press-to- two forms; the first, in Fig. 1a, is without any preprocessing, and
press, and release-to-release. Note that in some cases, press-to- the second, in Fig. 1b, is with the use of quantile transformation.
release using the same key is referred to as monographs [35], In Fig. 1a and Fig. 1b each user’s instances are presented in a
however we include them in our press-to-release definition. Al- different color. In Fig. 1b we can see that the use of quantile
though we could use all four types, we found that clustering transformation serves to diffuse dense areas and bring the out-
(i.e., step 3) improved slightly when just the subset of F = {press- liers of users closer to the non-outliers, therefore making it much
to-release, release-to-release} was used. The output of the feature easier for the clustering to detect the users. We can see that even
extraction step [is a list of n] feature vector instances in which we when the instances of the three users are mixed together, once
denote T F = tF 1 , . . . , tn . The superscript F mark that these
F
quantile transformation has been applied, the correct clusters can
instances are now feature vectors. easily be detected. To the best of our knowledge, we are the
first to use quantile transformation for keystroke dynamics data
3.2.2. Step 2: quantile transformation preprocessing.
1 , . . . , tn , to
[ ]
In the second step, we preprocess T F = tF F
We use T F Q = tF , . . . , tF
[ Q Q
]
1 n to mark the output of the
ensure that the third step (clustering) is as accurate as possible. quantile transformation so as to indicate that these instances
After evaluating several preprocessing techniques, we found that have undergone both feature extraction and quantile transforma-
the best one in our case is quantile transformation. tion. In addition to T F Q , we generate transformer Q; later, in the
Quantile transformation means taking a continuous random test set, transformer Q can take an instance tF and transform
variable of any existing distribution and transforming it into a it into tF Q according to the distribution in the training set. We
uniform distribution variable. The values in the transformed vari- return transformer Q and it is saved for later use in the testing
able are the points in the distribution that relate to their ranking phase.
in the original variable. We perform quantile transformation on
each feature separately. Quantile transformation is an efficient 3.2.3. Step 3: X-means clustering
In the third step, we use T F Q = tF , . . . , tF
[ Q Q
]
technique for dispersing close values, and more importantly, for 1 n to per-
reducing the impact of extreme values. As we will show through- form the clustering which will hopefully divide the data into
out this paper, quantile transformation has a crucial influence on clusters, such that each cluster contains the data of exactly one
4
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

user. Perhaps the most well-known method for clustering is K-


means [37]. But as described by Pelleg et al. [38], K-means has
two main drawbacks: first, it is prone to local minima because of
the random initialization; and second, and more importantly in
our case, it requires knowing the number of clusters in advance;
this would be a major obstacle for us, since we do not know how
many users make up the training set. Therefore, we sought a non-
parametric clustering method that does not require the number
of clusters in advance and is also less prone to local minima. We Fig. 2a. Projection of three users’ instances for each dataset, without any
tried various non-parametric clustering algorithms, such as GMM, preprocessing; each color represents a cluster detected using X-means. Score
Affinity Propagation, DB-SCAN, HDB-SCAN, and we found that the above represents ARI.
best performing clustering algorithm was X-means [38].
X-means was developed by Pelleg et al. [38] to overcome the
inherent problems in the well-known K-means algorithm. The
algorithm starts with one cluster that contains the entire set of
instances and iteratively splits each cluster into two clusters using
an optimization criterion (discussed below) until no improving
split can be done or a predefined maximal number of clusters
has been reached. X-means’ advantages are that it is deterministic
and that it does not require to be given the number of clusters.
X-means’ disadvantage, similar to K-means, is its sensitivity to Fig. 2b. Projection of three users’ instances for each dataset, using quantile
outliers. However, when combined with quantile transformation, transformation; each color represents a cluster detected using X-means. Score
the effect of the outliers is dramatically reduced, which allows above represents ARI.
X-means to easily divide the data into the correct clusters, which
is in our case according to the users producing them.
The optimization criterion used in X-means is BIC (Bayesian the use of quantile transformation in Fig. 2b; each cluster de-
Information Criterion) which was presented by Kass et al. [39]. tected appears in a different color. In the figures it can be seen
In each iteration of X-means, the BIC is used to assess all the that without any preprocessing, X-means struggles to correctly
possible strategies in this stage. The BIC scores each strategy, identify the three clusters and mistakenly divides the data into no
which include the current state (i.e. halt the clustering) and the less than seven clusters although there are only three users. How-
various splitting options to K clusters, and chooses the best one. ever, once quantile transformation is applied, the data is divided
Consider a set D and a strategy Sj , We want to assess the goodness correctly, with one cluster per user. The score that appears above
of strategy Sj so we will be able to compare to any other strategy. each scatter plot is the ARI (Adjusted Rand Index) [40] which is a
The authors suggested scoring strategy Sj according to formula metric used for comparing clustering divisions (explained further
(1): in Section 4.4). When preprocessing is applied, we can see that
pj the ARI improves from 0.3–0.5 to 1.0, which represents perfect
lj (D) −
BIC(Sj ) = ˆ · log R (1) detection of the correct clusters. In Section 4, we provide a more
2
in-depth evaluation using additional user combinations.
where ˆ lj (D) is the log likelihood of D according to Sj , R is the At the end of step three, we obtain Ĉ = [Ĉ1 , . . . , Ĉm ], that
number of instances that are currently under consideration in D, hopefully addressed the two conditions mentioned earlier, and
and pj is the number of free parameters in strategy Sj . If this is the clustering model, which we label X . Just like transformer
a strategy of halting, then pj is the number of features +1, but if Q, the clustering model, X , also needs to be used in the testing
this is a strategy of splitting into K clusters, then pj is K times the phase, so we can associate a test instance tF Q with a cluster
number of features +1. Ĉi ∈ Ĉ . Thus, we return and save the trained clustering model
X-means receives two parameters: τ , which is the maximal X as well.
number of iterations; and γ , which is the maximal number of
clusters. The τ parameter is a time optimization parameter that 3.2.4. Step 4: Building sub-models
is responsible for stopping the algorithm in case it does not In the fourth step, we move on to building the sub-models
converge. Therefore, τ should be set according to the maximal using the keystroke dynamics algorithm selected earlier. For each
possible number of iterations for the run-time constraints. The γ T r [Ĉi ] where 1 ≤ i ≤ m, we run the keystroke dynamics
parameter is more important, as setting it incorrectly can harm algorithm and produce a model. Note that in the algorithm we
the results. Setting it too low will result in under-clustering and use the raw instances T r , so the algorithm can apply any fea-
a situation in which several users are combined in one sub- ture extraction and preprocessing. Eventually, we obtain a list
model, whereas setting it too high can cause over-clustering, of keystroke dynamics models KDM = [kdm1 , . . . kdmm ] that
which may lead to a situation in which a user’s data is divided we refer to as sub-models, where each kdmi of 1 ≤ i ≤ m
in several sub-models. Over-clustering is considered better than corresponds to a cluster Ĉi . The list of sub-models KDM is also
under-clustering, but we would prefer to avoid both. Therefore, γ returned and saved, in addition to transformer Q and clustering
should be set as close as possible to the maximal possible number model X , so they can be used in the testing phase.
of users in one model. If there are several estimates for this
number, the highest number should be used. These parameters 3.3. Testing phase
become part of the input to our method, in which we explain its
settings in Section 4.6. After describing the training phase, the testing phase is far
Following the examples presented in Fig. 1a and Fig. 1b, Fig. 2a easier to explain, as it follows almost the same four steps. In
and Fig. 2b present the clusters detected by the X-means algo- the testing phase, the formal problem is as follows: we receive a
rithm for the same data. Each figure shows the clusters detected test instance tr and want to associate it with one of the existing
using X-means — without any preprocessing in Fig. 2a and with training clusters Ĉi ∈ Ĉ . Once we do that, the relevant sub-model
5
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

kdmi is used to provide the anomaly score p. This p score is finally instances Ĉ3 = [4,8,9]. In line 4 we initiate an empty list of sub-
used by the entire multi-user model to reflect the probability that models called KDM and in line 5 (including lines 5.1 and 5.2), we
the test tr instance does not belong to one of the training users build the sub-models for each cluster by iterating on the clusters
U = [u1 , . . . , uk ]; or in other words, p is the probability of tr being vector Ĉ = [Ĉ1 . . . Ĉm ]. For each cluster Ci of 1 ≤ i ≤ m, we
an anomaly. take the consecutive instances using T r [Ĉi ] and use them as input
The first step of the testing phase involves taking the test in- to the function trainKDModel which outputs sub-model kdmi .
stance tr and applying the same feature extraction process using trainKDModel can use any keystroke dynamics algorithm, feature
F which transforms the test instance into tF . Once the features extraction, or preprocessing. In our evaluations we tested eight
are extracted, in the second step, we perform the same quantile leading keystroke dynamics algorithms (presented in Section 4.3),
transformation, using transformer Q which produces tF Q . In the however any one of the many existing algorithms can also be
third step, we associate the instance tF Q to a Ĉi cluster using used. Each kdmi is added to the KDM list we initiated in line 4
the clustering model X . Finally, we apply the cluster’s associated and will eventually contain all of the sub-models. Finally, in line
sub-model kdmi on tr , that produce the anomaly score p. 6 we return KDM , X , and Q, that are needed for the testing phase.

3.4. Formal algorithms

The training and testing phases are summarized in the pseu-


docode presented in Algorithms 1 and 2 respectively.

Algorithm 2 has different constants, inputs and outputs from


algorithm 1. In the constants it has only F , which is used for
feature extraction. It does not require to train the X-means model
and therefore it does not need τ , and γ . In the input it has of
course the raw instance tr , as well as KDM , X , and Q which were
generated in the training phase and are now needed to transform,
cluster and score tr . The output of the algorithm is p, the anomaly
score.
The process of the algorithm is as follows: in line 1, we take
Algorithm 1 has three constants F , τ , and γ . As discussed
the raw test instance tr and F and run the function
earlier F is used for the feature extraction, and τ and γ are used
extractDiGraphs which sends back feature vector tF . Then, in
for training the clustering model. The input is T r = [tr1 , . . . , trn ],
line 2, we run a function transformQuantile that receives a fitted
which is the list of raw instances used for training. The output
transformer Q and the feature vector tF , and transforms them
is made up of three parts: KDM = [kdm1 , . . . kdmm ], which is
accordingly to tF Q . In line 3, we take tF Q and the trained model
the list keystroke dynamics sub-models; X , which is the X-means
X and apply testXMeans which returns the cluster number j which
trained model; and Q, which is the quantile transformer that is
represents the relevant kdmj in the KDM list . In line 4, we take
fit to the training set. These three parts are needed in the testing
the relevant model kdmj , and in line 5, we apply testKDModel
phase. on the raw test instance tr using kdmj , in order to produce the
The training of our method, presented in Algorithm 1, goes as anomaly score p which is returned in line 6.
follows: in line 1, it takes T r and the requested features F and
runs a function called extractDiGraphs which extracts the feature 3.5. Support of binary classification
vectors for T r and sets them into T F . Next, in line 2, it takes
the feature vectors in T F and transforms them into quantiles, The scheme, as described above, is more oriented towards
according to each feature, and saves the result in T F Q . We do one-class classification algorithms, however it can easily be
this using a function that we refer to as fitQuantile that also pro- adapted to support binary classification algorithms as well. Binary
vides a fitted transformer Q, so that we can transform additional classification algorithms, such as SVM, Naïve Bayes, Random
instances in the testing phase. In line 3, we cluster the instances Forest, and Gradient Boosting require the use of both positive and
using a function called trainXMeans which receives T F Q , τ , and negative tagged examples. In our case, this would mean using
γ , and returns two outcomes: Ĉ , a vector of m lists containing additional examples produced by a group of impostors that do
the indices of T r , and X , which is the trained clustering model. not intersect with the trained benign users. These impostors will
For example, if we have 10 instances, and X-means found three be required to type the same fixed text as the benign users, so
clusters, then Ĉ might be Ĉ =[1,2,5], [3,6,7,10], [4,8,9], where we can extract the same features used in the training phase.
the first cluster contains instances Ĉ1 =[1,2,5], the second cluster Once we have this group of impostors the method can be
contains instances Ĉ2 =[3,6,7,10], and the third cluster contains adapted as follows: In the training phase, the first three steps
6
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

of feature extraction, quantile transformation, and clustering are in the dataset to type uniform text. Second, since we wanted
performed exactly as before. The main adaptation occurs in the to evaluate and select the best subset of di-graph features for
fourth step in which the sub-models are built. In this case, the our method, we could not consider datasets that did not contain
sub-models are built using the binary classification algorithms all of the possible features or they could not be deduced. Third,
after adding a copy of the entire set of the impostors’ instances these three datasets are widely used for benchmarking different
to the benign instances in each cluster. algorithms and methods. And fourth, these were the only datasets
In the pseudocode, the significant change takes place in Al- that had at least four sessions, that as discussed earlier, is im-
gorithm 1. The function trainKDModel, line 5.1, will also receive portant according to Montalvão et al. [19] who examined the
r
T , the set of negative examples, in addition to T r , the set of stabilization process of keystroke dynamics users. Therefore, we
positive examples. Therefore, each sub-model is created using a can have three sessions for training and at least one for testing.
combination of one cluster’s instances and a copy of the impos- CMU 2009 [10] is by far the most used and cited dataset for
tors’ instances, allowing us to use binary classification algorithms keystroke dynamics among those surveyed by Giot et al. [41]. It
as well. In Algorithm 2, there are no changes. However, having was collected in 2009 at Carnegie Mellon University and contains
several impostors typing the same fixed text for each user is data provided by 51 participants. Each participant was requested
less practical in real-world applications, and therefore we did not to attend eight sessions, and each session took place on a dif-
evaluate this scenario in this study. ferent day in a controlled environment in which the participants
were asked to type the passphrase ‘‘.tie5Roan’’ 50 times. The
4. Experiments and results passphrase ‘‘.tie5Roan’’ was chosen by the researchers to simulate
a typical eight character password that is considered strong, since
4.1. Experimental process it contains a capital letter, a special character, and a digit. This
dataset was collected so that each of the users participated in the
We evaluated the performance of our method on both multi- same number of eight sessions and provided the same number of
user cases and one-user cases. For each dataset (described in samples in each session. In our evaluations we used data from all
Section 4.2) and each algorithm (described in Section 4.3), we 51 participants; three sessions from each user were assigned to
trained and tested models consisting of 1–5 users. The purpose of training and the remaining five sessions were assigned to testing.
using both one-user cases and multi-user cases is to ensure that Greyc 2009 [5,15,42,43] is a very popular dataset for keystroke
our method can be used both on models that contain a single user, dynamics as well. It was also collected in 2009, for a period of two
which remain the majority, as well as those that contain multiple months at the Greyc Laboratory at Université de Caen. The data
users, which also need to be supported, given the reality today was collected from 133 participants, which is the largest among
commonly used shared accounts. the surveyed datasets by Giot et al. [41]. Each of the participants
The process of our experimentation went such that in each typed the passphrase ‘‘greyc laboratory’’ between five and 107
dataset we iterated on the number of users 1 ≤ k ≤ 5, and times. The dataset was collected in several sessions; in each
for each number of users k, we generated 100 sub-experiments session the users were requested to type the text six times using
in which we randomly sampled k users and implemented our two different keyboards: laptop and USB plugged (for a total of 12
method’s four steps. Each of the k users contributed three ses- times). The average gap between sessions was a week, however
sions of 10 instances each to the benign training set; thus, the the collection was less strict than the CMU dataset in terms of
size of the training set |T r | = 3 × 10 × k. The reason we used the number of participants attending and the number of samples
three training sessions is according to Montalvão et al. [19] who collected in each session. We used the data of the 77 participants
examined the stabilization process of keystroke dynamics users who attended at least four sessions so we could use three sessions
and showed that the user profile usually stabilizes in the third for training and at least one for testing.
training session. Greyc-Web 2012 dataset [16] was collected over a period of
The rest of the instances of the k users became part of the 17 months during 2010–2012, a collection period which is by far
benign test set (we did not use instances from the same session the longest among the surveyed datasets. The collection was done
in both the training and test sets). The other users (those that in a Web-based uncontrolled environment, in which each partici-
were not part of the k randomly sampled users) in each sub- pant could access the collection system from his/her own device.
experiment served as impostors, and they were used to evaluate This type of collection using different devices introduced valu-
the multi-user model in the testing phase. As the number of able variability into the dataset. In addition, Web-based systems
impostors is much larger than the k users, we included all of are reflective of today’s internet use, in which bank accounts,
the impostors in the testing, to increase the variability of the email, and social media are accessed online. The Greyc-Web 2012
evaluations, but only used one instance (the first) from each of dataset actually contains two datasets: (1) a dataset in which
them. We evaluate the performance of the first three steps in the users type a self-chosen username and password, and (2) a
Section 4.4 and the entire four-step process in Section 4.5. dataset in which all of the users typed the same username and
password of ‘‘laboratoire greyc’’ and ‘‘sesame’’. Given the nature
4.2. Datasets used of our research, we only used the second dataset with the uniform
text, which allowed us to simulate multi-user models. The dataset
Several keystroke dynamics datasets have been collected and consists of the data of 118 participants, of which we used the 61
are available for use. To choose the best datasets for our evalua- participants that had at least four sessions of collected data, in
tion, we relied on a recent survey published by Giot et al. [41] line with the other two datasets.
that compared 16 different datasets using many criteria, such
as the number of participants, text complexity, duration period, 4.3. Algorithms evaluated
collection environment, etc. According to Giot et al. the three best
datasets are the CMU 2009 [18] (referred to as DSN 2009 in their To evaluate our method’s ability to leverage existing keystroke
paper), Greyc 2009 [15], and Greyc-Web 2012 passphrase [16] dynamics algorithms to support multi-user models, we looked
datasets. Fortunately, these datasets well meet the requirements for several state-of-the-art fixed text keystroke dynamics algo-
of our study for the following reasons. First, because we needed to rithms that perform well on one-user models. Many algorithms
simulate one model for multiple users, we needed all of the users for keystroke dynamics have been developed and published, and
7
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

it is not possible to identify one algorithm that is unanimously the distance between the test features and the training features is
known as the best, as the algorithms’ performance varies depend- calculated, as shown in formula (4). Eventually the anomaly score
ing on the circumstances and the settings of the experiment. One of x is produced after subtracting the result from one, so we will
way to categorize keystroke dynamics algorithms is according receive anomality score instead of similarity score.
to one-class classification algorithms and binary classification
algorithms. One-class classification algorithms are only trained n
1 ∑ −
|xi −µi |
σi
on the benign user examples. In contrast, binary classification 1− · e (4)
n
algorithms are trained both on benign user samples and impos- i=1
tors’ samples, usually simulated by pseudo-impostors, who are
M2005 was presented originally by Magalhães et al. [7] and
essentially other users typing the same fixed text. However, the
later used in the evaluations of Pisani et al. [20]. This algorithm
need for impostors or pseudo-impostors’ examples for each user
works such that for each feature i the mean µ, median η, and
in order to train a binary classification algorithm is impractical in
standard deviation σ are calculated according to the training
real-world applications. Therefore, many studies (including ours)
set. When a test sample x is presented, the algorithm iterates
focus solely on one-class classification algorithms.
on the test features and calculates a similarity score based on
But even after narrowing the search down further to one-class
the features that are within certain boundaries defined by the
classification algorithms, there were many choices. Traditionally
training set, as can be seen in formula (5). We denote formula (5)
in the field of keystroke dynamics, it is common to use distance-
as a function φ that returns T (i.e. True) or F (i.e. False) according
based algorithms that usually do not build a sophisticated model
to whether xi satisfies it or not. After φ (xi ) is calculated for each
but rather use simpler approaches that rely on distance cal-
feature i, the algorithm aggregates the similarity score such as
culations and aggregation of scores, therefore, we wanted to
can be seen in formula (6). Here again the anomaly score is
use several algorithms from this category. In addition, recently
returned after subtracting the result from one, so we will receive
there has been an increased use of more complicated model-
an anomality score instead of a similarity score.
based algorithms in the keystroke dynamics domain, therefore
we wanted to use several of these algorithms as well. Therefore, (
σi
) (
σi
)
we selected eight high-performing algorithms, four of each type min (µi ; ηi ) · 0.95 − ≤ xi ≤ max (µi ; ηi ) · 1.05 − (5)
µi µi
(distance-based and model-based) that cover the entire spectrum ⎧ ⋀ ⎫
n ⎨ 1|φ (xi ) = T i=1
of algorithms in the keystroke dynamics field. 1 ∑

1|φ (xi ) = T φ (xi−1 ) = F

1− · (6)
Among the four distance-based algorithms used, we used two n
1.5|φ (xi ) = T φ (xi−1 ) = T
⎩ ⋀ ⎭
i=1
that showed high accuracy in a study performed by Killourhy
et al. [10] which compared 14 one-class classification algorithms, The other four model-based algorithms were taken from var-
mostly distance-based. We reproduced their experiment, apply- ious papers, some of them were very recently published. These
ing almost the entire list of algorithms on a smaller training set algorithms are more complicated and could not be represented
(i.e., 30 samples instead of 200), and the two best performing using a specific formula like the distance-based algorithms above.
algorithms were Outlier Count z-score and Manhattan scaled; We The first one is the Isolation Forest which was originally intro-
used both algorithms in our evaluations. Two additional distance- duced in a paper by Liu et al. [44]. The algorithm is based on an
based algorithms that performed well in other surveys were used: ensemble of trees that are built using random splits on the train-
the Statistical model and M2005. ing set feature range at each of the tree nodes until the leaves are
Outlier Count z-score was originally described by Haider et al. pure or a certain criteria on the purity has reached. We also used
[8]. In this method the mean µ and standard deviation σ of each One Class SVM which was used in the evaluation of Killourhy
feature i are calculated according to the training set, and when et al. [10] as well as by many others [2]. This relatively well-
a test sample x is presented, its features are compared with the known algorithm is based on finding the smallest hypersphere
mean and standard deviation of the matching features in the that encompasses the margin of the training samples.
training set to see if the distance exceeds a specific threshold δ . Another two model-based algorithms we used, both of which
The anomaly score of x is the ratio of features whose distance were recently implemented successfully in the field of keystroke
exceeds the threshold as can be seen in formula (2). dynamics: the HBOS (Histogram-Based Outlier Score) and Au-
toencoder algorithms. HBOS was originally presented by Gold-
n
{ |xi −µi |
}
1 ∑ 0| σi
≤δ
· |xi −µi | (2) stein et al. [45] and later used in the evaluation of Darabseh
n 1| σ >δ et al. [13] for keystroke dynamics and performed best among
i=1 i
several one-class classification algorithms (e.g. CBLOF, ABOD, LOF,
Manhattan scaled was originally described by Araujo et al. [9]. In
KNN, etc.) tested. The algorithm constructs a univariate histogram
this method, the mean µ and absolute error α of each feature i
are calculated according to the training set. When a test sample x for each feature separately and analyzes the density of the test
is presented, the anomaly score is calculated using the Manhattan sample according to the associated bins of each of its features.
distance from the test sample and the training vector of means, The Autoencoder is a known neural network-based technique
which is then scaled according to the absolute error, as can be for dimensionality reduction that was adapted to solve one-class
seen in formula (3). classification problems, as explained in [46]. Autoencoders one
class classification variant was recently shown to be successful
n
∑ |xi − µi | in the field of keystroke dynamics field by Patel et al. [47] after
(3) obtaining better results than other tested methods (i.e. GMM and
αi
i=1 KDE). The idea of this method is to use an Autoencoder neural
Statistical model was first used for keystroke dynamics by Hoc- network to learn the distribution of the training data by training
quet et al. [12] and was later adopted by Giot et al. [16] for the it to reconstruct the training samples. When a test sample is
evaluation of the Greyc-Web 2012 dataset. Using the Statistical presented, the trained Autoencoder tries to reconstruct it. If the
model, Giot et al. were able to show high performance with as Autoencoder does not manage to do well reconstructing the test
little as 20 training samples. The algorithm works such that it fits sample, it means that its distribution is different than the one
a normal distribution µ and σ to each feature i in the training in the training set, which increases the probability of the sample
set. When a test sample x is presented, the average exponent of being generated by an impostor. Note that the experiments using
8
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

Table 1 Table 2
Average number of clusters detected with quantile transformation and Average ARI of the clusters detected with quantile transformation and
without it. without it.
# of users With quantile transformation Without quantile transformation # of users With quantile transformation Without quantile transformation
CMU Greyc Greyc-Web CMU Greyc Greyc-Web CMU Greyc Greyc-Web CMU Greyc Greyc-Web
2009 2009 2012 2009 2009 2012 2009 2009 2012 2009 2009 2012
1 1.16 1.04 1.23 6.84 5.92 7.70 1 0.88 0.96 0.87 0 0.04 0
2 2.40 2.03 2.22 7.55 7.60 7.82 2 0.83 0.90 0.94 0.34 0.47 0.26
3 3.65 3.16 3.20 7.57 7.97 7.80 3 0.82 0.91 0.93 0.33 0.61 0.21
4 5.13 4.26 4.30 7.59 7.66 7.68 4 0.77 0.88 0.92 0.29 0.57 0.19
5 6.54 5.35 5.38 7.55 7.69 7.66 5 0.76 0.87 0.91 0.25 0.5 0.15

the Autoencoder took significantly longer, and thus we had to C = [C1 , . . . , Ck ] against Ĉ = [Ĉ1 , . . . , Ĉm ], which in turn will
reduce the number of sub-experiments to 30 in this case. reflect the correctness of our method. The ARI has two steps;
first, it calculates the Rand Index (RI) and second, it adjusts the
4.4. Clustering correctness score to be on a certain range. The idea behind the RI is to look
at the relations between the instances within the groups. It does
so by observing all pairs of instances in one group and assessing
Before evaluating the performance of the leveraged keystroke
whether their relation is maintained in the second group. In other
dynamics models, we wanted to evaluate our method’s ability
words, if a certain pair of instances were in the same cluster in
to separate the training set T r into clusters Ĉ = [Ĉ1 , . . . , Ĉm ] the first group, they should be in the same cluster in the second
that are as similar as possible to the correct clusters of users group; if they were not in the same cluster in the first group, then
C = [C1 , . . . , Ck ]. This means that they should address the two they should not be in the same cluster in the second group. More
conditions presented in Section 3.2: (1) m = k and (2) for all i of formally, RI sums the number of pairs x, y according to formula
Ĉi ∈ Ĉ there exists j of Cj ∈ C such that Ĉi = Cj . (7). Then, the ARI performs a scaling adjustment, in which the
To evaluate condition (1), for each dataset and number of users score is placed on a scale ranging from expected by chance to a
k we performed 100 sub-experiments and compared the number perfect clustering. It does so by calculating two additional values:
of clusters detected m to the real number k. To assess the impact E(RI), which is the number of correct pairs expected by chance,
of quantile transformation on the clustering, we performed the according to the number of groups m and k; and Max(RI), which is
evaluations both with quantile transformation and without it. the maximal number of pairs, without significance to their order.
Table 1 presents the average number of clusters detected m in Finally, the ARI calculated is according to formula (8).
each dataset according to the number of users k in the training
set both with and without quantile transformation.
In the columns on the right side of Table 1, without the ⎧ ⎫
∑ ⎨ 1|x, y ∈ Ci x, ⋀

quantile transformation, we can see that the number of clusters y ∈ Ĉj ⎬
detected does not correlate to the number of users at all. Further- RI = / Ci x ∈ Ĉj , y ∈
1|x ∈ Ci , y ∈ / Ĉj (7)
more, the MAE (Mean Absolute Error) from the correct number is x,y∈{∪C }

0|otherwise

4.5. This means that the clustering algorithm incorrectly deviates RI − E (RI )
from the correct number of users in the training set by 4.5, ARI = (8)
on average. Therefore, we can conclude that directly inputting Max (RI ) − E (RI )
the di-graph features into X-means without performing quantile
transformation is of very little value. On the other hand, in the
columns on the left side of Table 1, with quantile transformation, This structure of the ARI has three advantages: (1) it elimi-
we observe that the number of clusters detected is much more nates the effect of randomly assigning an element to the correct
closely tied to the number of users in the training set, with a MAE cluster, by assigning a score of zero for the expected by chance
of just 0.4. This means that on average the X-means only deviates correct clustering; (2) it does not tend towards over-clustering
from the correct number of users in the training set by 0.4. We or under-clustering, as both will be equally useless when con-
can also see that there is a minor overestimation in the number sidering relation pairs; (3) it does not assume that the number
of clusters. The overestimation is relatively low with the Greyc of clusters in the first group and the number of clusters in the
2009 and Greyc-Web 2012 datasets (respectively, 5% and 11% on second group are the same, which allows us compare the two
average) and is a bit larger with the CMU 2009 dataset (23% on groups of clusters even though the number of clusters may dif-
average). We believe that the reason for the better performance fer. For these reasons, we used the ARI to evaluate the second
with the Greyc 2009 and Greyc-Web 2012 datasets is related condition we tried to optimize in our clustering. Table 2 presents
to the fact that these datasets were collected in less controlled the ARI scores for each dataset, according to the number of
environments and over a longer period of time, allowing the users in the training set, again, with and without using quantile
participants to develop a more distinctive behavioral pattern. transformation.
Although the number of clusters detected somewhat reflects In Table 2 we can also see that when quantile transformation
the correctness of the division, it does not provide an in-depth is not performed, the average ARI for all of the datasets and
inspection of the instances in the clusters. For this reason, we numbers of users is only 0.28. Meaning the number of correctly
performed a second evaluation that targets condition (2) to see divided pairs is only 28% better than expected by chance. How-
how well the instances are divided according to the users. Many ever, when applying quantile transformation, the average ARI
metrics can be used to evaluate the correctness of the clustering jumps to 0.88, which is much closer to a perfect clustering. It
algorithm in dividing the instances, and we chose to use the ARI can also be observed that the Greyc 2009 and Greyc-Web 2012
(Adjusted Rand Index) [40]. datasets obtained higher ARIs of 0.90 and 0.91 on average re-
The ARI is a method used to score the similarity between spectively, whereas with the CMU 2009 dataset the ARI was a bit
two groups of clusters. In our case, we would like to compare lower (0.81 on average).
9
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

Fig. 3a. Average AUC of the M2005 algorithm with the three datasets as a Fig. 3e. Average AUC of the One Class SVM algorithm with the three datasets
function of the number of users. as a function of the number of users.

Fig. 3b. Average AUC of the Outlier Count z-score algorithm with the three Fig. 3f. Average AUC of the Isolation Forest algorithm with the three datasets
datasets as a function of the number of users. as a function of the number of users.

Fig. 3c. Average AUC of the Manhattan scaled algorithm with the three datasets Fig. 3g. Average AUC of the HBOS algorithm with the three datasets as a
as a function of the number of users. function of the number of users.

Fig. 3d. Average AUC of the Statistical algorithm with the three datasets as a Fig. 3h. Average AUC of the Autoencoder algorithm with the three datasets as
function of the number of users. a function of the number of users,.

4.5. Sub-model accuracy figures presents the results of one of the keystroke dynamics al-
gorithms examined. Each figure contains three plots, one for each
of the datasets evaluated: CMU 2009, Greyc 2009, and Greyc-Web
After being convinced that our method is capable of correctly 2012. Each plot contains two curves showing the average AUC
dividing the data into clusters, we moved on to integrating our or EER (y-axis) as a function of the number of users (x-axis) in
method with the existing keystroke dynamics algorithms and the training set. The blue curve indicates the performance using
training a sub-model for each cluster’s instances. With the trained the current method, and the orange curve shows the performance
sub-models, we can evaluate the effectiveness of our method in using our method. The results were rounded to two digits after
its entirety using the eight state-of-the-art, keystroke dynamics the decimal point.
algorithms presented in Section 4.3. In Figs. 3a–3h and Figs. 4a–4h, we can clearly see the ad-
We compared the performance of the keystroke dynamics vantage of our method. In terms of both the AUC and EER our
algorithms when using our four-step method (referred to as our method was shown to be better than the current method with a
method) to the performance of the same keystroke dynamics statistical significance (p − value < 0.0001; df = 119). Performing
algorithms as is, without being wrapped by our method (re- a more fine-grained analysis, in the multi-user cases (i.e., k > 1),
ferred to as the current method). This was done in order to our method is always better than the current method, with an
obtain a baseline that can be used to assess the effectiveness of average AUC increase of 9.2% and an average EER decrease of 8.6%.
our method. To ensure a fair comparison between our method We can even see that as the number of users grows, the level
and the current method, we ran the same keystroke dynamics of improvement increases; this can be seen by the gap between
algorithms with the same configurations, and when a random the curves on the graphs, which gets larger as the number of
functionality was used, the same seed was used again. users increases. In the one-user cases (i.e., k = 1) the results
We analyze both the AUC and the EER obtained by each were almost the same, but very slightly worsened when using
keystroke dynamics algorithm for each dataset according to the our method, with a decrease in the average AUC of 0.2% and
number of users in the training set. Figs 3a–3h present the av- an increase in the average EER of 0.3%. Table 3 summarizes the
erage AUC, and Figs. 4a–4h present the average EER. Each of the
10
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

Fig. 4a. Average EER of the M2005 algorithm with the three datasets as a Fig. 4e. Average EER of the One Class SVM algorithm with the three datasets
function of the number of users. as a function of the number of users.

Fig. 4b. Average EER of the Outlier Count z-score algorithm with the three Fig. 4f. Average EER of the Isolation Forest algorithm with the three datasets
datasets as a function of the number of users. as a function of the number of users.

Fig. 4c. Average EER of the Manhattan scaled algorithm with the three datasets Fig. 4g. Average EER of the HBOS algorithm with the three datasets as a function
as a function of the number of users. of the number of users.

Fig. 4d. Average EER of the Statistical algorithm with the three datasets as a Fig. 4h. Average EER of the Autoencoder algorithm with the three datasets as
function of the number of users. a function of the number of users.

Table 3
average AUC and EER results for each of the three datasets for Average AUC and EER of the current method vs. our method in one user scenario
and multi-user cases.
one-user cases and multi-user cases.
Dataset One-user Multi-user
In Table 3 we can see that in the multi-user cases the improve-
ment of using our method is apparent in all three datasets and Current Our Current Our
method method method method
has a greater effect on the Greyc 2009 and the Greyc-Web 2012
CMU 2009 0.948 0.942 0.831 0.885
datasets (an average AUC increase of 12.6% and 9.6% respectively;
AUC Greyc 2009 0.933 0.934 0.766 0.892
and an average EER decrease of 11.7% and 9.1% respectively) Greyc-Web 2012 0.941 0.940 0.793 0.889
but also has an influence on the CMU 2009 dataset (an average
CMU 2009 0.105 0.112 0.239 0.187
AUC increase of 5.4%; and an average EER decrease of 5.3%). EER Greyc 2009 0.120 0.118 0.291 0.175
In the one-user cases we observe around the same AUC (with Greyc-Web 2012 0.108 0.110 0.268 0.177
slight differences of −0.62%, +0.06% and −0.12% in the CMU
2009, Greyc 2009, and Greyc-Web 2012 respectively), and the
same trend is seen in the EER (with slight differences of +0.73%,
missing values in each feature with the median, after trying
−0.17%, +0.21%).
several metrics (e.g., average, max, min). Lastly, in order to apply
4.6. Technical implementation X-means clustering we used the implementation from [49] avail-
able in GitHub. In contrast to many other clustering algorithms,
The implementation of the first three steps of our method, X-means is not implemented in the sklearn package. Therefore,
both in the training and testing phases is described below. The we sought for an implementation that seemed to be the most
first step of feature extraction was done directly on the datasets suitable and comprehensive.
using our implementation of di-graphs extraction. Then, to per- As discussed earlier, the X-means algorithm has two parame-
form quantile transformation, we used the implementation from ters. We set τ , the maximal number of iterations, to 100; and γ ,
sklearn package [48], a widely used Python package for machine the maximal number of clusters, to be 10. Setting the number of
learning. Since sklearn does not allow missing values, we replaced iterations to 100 should be large enough to allow the algorithm

11
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

Fig. 5. Average number of clusters detected as a function of the number of Fig. 6. Average ARI as a function of the number of users in each training session,
instances in each training session, for 1–5 users in the training set. with 5–50 users in the training set.

to converge before hardly stopping it, and setting the number of In these sub-experiments, we could also see that our method
clusters to 10 should help us both eliminate under-clustering and underestimates the number of users when there are six or less
avoid overfitting to the actual maximal number we used — five, instances in each the three training sessions. This was seen con-
as it is twice greater. sistently across all of the datasets — the magic number of
The implementation of the fourth step went according to instances needed for correct classification is seven for each of the
the keystroke dynamics algorithm selected. The distance-based three training sessions or above.
algorithms were relatively straightforward, thus we implemented
them ourselves according to the processes described in the pa- 4.8. Number of users in the model
pers. The model-based methods were a little more complicated
but had implementations in known Python packages that we
Another interesting question we addressed, to examine the
could use. HBOS and Autoencoder algorithms were in the PyOD
limitations of our method, is what is the maximal number of users
package [50] and Isolation Forest and One Class SVM were in the
our method can support. This is not a trivial question to answer,
sklearn package [48].
as Figs. 3a–3h and Figs. 4a–4h showed that as the number of users
grows, so does the gap in favor of our method, meaning that our
4.7. Number of instances in a session method’s improvement over the current method is greater when
there is a larger number of users. However, there is a concern here
An interesting question we investigated is how the number that too many users can result in many errors and the production
of training instances affects our method’s performance. Our con- of meaningless clusters. In order to evaluate this scenario, we
cern here is that the number of clusters will be affected by focused our evaluation on the ARI score of the clustering. As
the number of instances, meaning few samples will results in mentioned earlier, the ARI is a metric used to assess the division
under-clustering, and many samples will result in over-clustering. of the data into clusters based on pairs of instances; an ARI score
Examining this issue will help us understand our method’s lim- of zero represents a random division into clusters and a score of
itations and determine both the minimal number of needed in- one means that the data was clustered perfectly.
stances and the optimal number of needed instances. The 100 sub-experiments on the three datasets were repeated
In order to address this, we ran the 100 sub-experiments on here as well. In each sub-experiment, we measured the ARI,
the three datasets again, with a focus on the first three steps of repeating the experiments by changing the number of users k
our method. In each sub-experiment, we measured the number from five to 50 (with intervals of five), to see how our method
of clusters detected m to see if it fits the k number of users, while cooperates with many users. The number 50 was chosen as it
repeating the experiments when changing the size of each user’s is the maximal we can use in all three datasets. We set γ , the
training set. To preserve the distribution of sessions that are in maximal number of clusters, to be 200, so it could support the
the training set versus those in the test set, we fixed the number number of required clusters, and still not to overfit, and the τ
of training sessions at three and only changed the number of remained at 100. In Fig. 6 we show the ARI score as a function of
instances in each session, with one to 20 instances in each. In the number users in the training set 5 ≤ k ≤ 50; the red points
Fig. 5 we show the average number of clusters detected as a indicate the average score and the gray vertical lines indicate the
function of the number of training instances in each of the three standard deviation.
training sessions for 1 ≤ k ≤ 5 users in the training set, in five As shown in Fig. 6, the average ARI with five users is 0.89,
different colors. 0.92, and 0.76 respectively with the Greyc 2009, Greyc 2012 and
As seen in Fig. 5, in all three datasets the number of clusters CMU 2009 datasets. As we expected, the average ARI decreases
detected is the most accurate when there are seven instances in as the number of users in the training set increases, however the
each the three training sessions (i.e., 3 × 7 = 21 instances total), rate at which it decreases is nonlinear, but lower, and we can
with a minor deviation of only 0.5% from the actual number of see that with more users, the rate between the same delta of
users in the training set. Also, we can see that with the Greyc five users decreases. In the first delta, of 5–10 users, there is a
2009 and Greyc-Web 2012 datasets the number of clusters de- decrease of 6.9%, 4.6%, and 9.7% respectively in the Greyc 2009,
tected converges after 12 and 10 instances, respectively, to the Greyc 2012 and CMU 2009 datasets. However, in the last delta of
correct number of clusters, with a minor overestimation of 11% 45–50 users we can see much smaller decrease of 1.1%, 2.5%, and
and 8% (on average), respectively. On the other hand, with the 2.3% respectively in the Greyc 2009, Greyc 2012 and CMU 2009
CMU 2009 dataset, the number of clusters seems to keep growing datasets respectively. We can also see that when there are up to
and does not converge well, with an average overestimation of 10 users with the Greyc 2009 dataset and 15 users in the Greyc-
44% when there are 10 or more instances in a training session. Web 2012 dataset, the average ARI is very high (above 0.8), while
The results with the CMU 2009 data are once again, to our with the CMU 2009 dataset the average ARI is above 0.8 only for
understanding, an outcome of the very controlled environment in up to three users (see Table 3).
which the dataset was collected, an environment which led the Overall, we could see that the results are way above zero,
users to be a bit more difficult to separate compared to the Greyc meaning that the algorithm is able to cluster the data mean-
2009 and Greyc-Web 2012 datasets. ingfully, even when there are many. In particular, the average
12
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

Table 4 keystroke dynamics models enabling them to verify a user’s


Average AUC and EER of the K-means vs. X-means in one user and multi-user identity in both one user and multi-user cases. We supported
cases.
claim (a) by experimenting with a varied number of users and
Dataset One-user Multi-user
several different datasets, aimed at demonstrating the method’s
K-means X-means K-means X-means
ability to divide the data into the correct clusters, according to the
CMU 2009 0.948 0.942 0.897 0.885 users; in these experiments our method performed well, with an
AUC Greyc 2009 0.933 0.934 0.893 0.892
accurate detection of the number of clusters and a correct inner
Greyc-Web 2012 0.941 0.940 0.892 0.889
division to clusters. We supported claim (b) by experimenting
CMU 2009 0.105 0.112 0.172 0.187
on eight state-of-the-art keystroke dynamics algorithms on three
EER Greyc 2009 0.120 0.118 0.173 0.175
Greyc-Web 2012 0.109 0.110 0.174 0.177 different datasets while changing the number of users; these
experiments demonstrated our method’s ability to improve the
AUC and EER (compared to the current method) with a statistical
significance. In the multi-user cases our method was always
ARI is quite high with the Greyc 2009 and Greyc-Web 2012 improving the results, though in the one-user cases we could see
datasets (0.67 and 0.62 respectively) for 50 users, indicating that a slight reduction. These experiments also clearly showed that
the algorithm does much better job than a random guess when the value of our method increases with each additional user in
there are many users. With the CMU 2009 dataset the average ARI the model.
is a bit lower (0.37) but is still far better than a random division We assessed the limitations of the method in two complemen-
into clusters. tary evaluations. First, we examined how the number of instances
in the training set affects our method’s performance and second,
4.9. Accounts with a known number of users we looked at our method performance when there are many
users in one model. We found that too few instances causes a
The last open question we had, which is a slight deviation from reduction in the model’s performance, but when there is a suffi-
the main topic of this paper, is how to handle multi-user cases in cient number of training instances, the algorithm performs well,
which we do know the number of users k beforehand. Should the with high accuracy. We also found that the method continues
X-means, which is a nonparametric clustering algorithm, still be to perform better than the current method when the number
used, or is it better to use a parametric clustering algorithm? To of users in the training set is very high, however, the ability to
address this question, we evaluated several different clustering correctly divide the data into clusters reduces with the addition
algorithms that receive the number of clusters as input. We ran of more users.
again the 100 sub-experiments on the three datasets and the Lastly, we were curious about the best way to handle multi-
eight different keystroke dynamics algorithms with 1 ≤ k ≤ 5 user keystroke dynamics models when the number of users is
users, always inputting the correct number of clusters. Table 4 known in advance. In this case, we observed that using K-means
presents the AUC and EER results of X-means and K-means for clustering algorithm with K set to the number of users is bet-
each dataset, both one-user and multi-user cases. ter than using the nonparametric X-means clustering algorithm,
In Table 4 we can see that when the number of users k is although the improvement is minor and not insignificant.
known beforehand, it is slightly better to replace the X-means
with K-means and to set the K parameter to the number of users. CRediT authorship contribution statement
In the one-user cases, there are minor differences in the AUC of
+0.6%, −0.1, and +0.1%, and in the EER of −0.7%, +0.2%, and Itay Hazan: Conceptualization, Methodology, Software, Inves-
−0.1% respectively for the CMU 2009, Greyc 2009 and Greyc- tigation, Validation, Visualization, Writing - original draft. Oded
Web 2012 datasets when changing the clustering algorithm to Margalit: Conceptualization, Writing - review & editing, Supervi-
K-means. In the multi-user cases, K-means is consistently shown sion. Lior Rokach: Conceptualization, Writing - review & editing,
to perform better, although the differences are still minor, with Supervision.
AUC differences of +1.2%, +0.1%, and +0.3% and EER differences
of −1.5%, −0.2%, and −0.3% respectively for the CMU 2009, Greyc Declaration of competing interest
2009 and Greyc-Web 2012 datasets.
The authors declare the following financial interests/personal
5. Conclusions relationships which may be considered as potential competing
interests: The First Author (Itay Hazan) is an IBM employee. The
In this paper, we defined the problem faced by keystroke Second Author (Oded Margalit) was an IBM employee while most
dynamics algorithms when there are multiple users sharing an of the work was done.
account and provided the motivation for its study. To address this
problem, we proposed a four-step method that wraps existing References
keystroke dynamics algorithms and leverages them to support
[1] https://www.interpol.int/en/News-and-Events/News/2020/Preventing-
multiple user accounts for which the number of users is un-
crime-and-protecting-police-INTERPOL-s-COVID-19-global-threat-
known and the association of between the instances and the assessment; Interpol.
users is unknown. We described our method’s four steps: fea- [2] Pin Shen Teh, Andrew Beng Jin Teoh, Shigang Yue, A survey of keystroke
ture extraction (press-to-release and release-to-release), quantile dynamics biometrics, Sci. World J. 2013 (2013).
[3] Eesa Al Solami, et al., Continuous biometric authentication: Can it be more
transformation, X-means clustering, and training sub-models, and
practical?, in: High Performance Computing and Communications (HPCC),
discussed the importance of each of the steps. 2010 12th IEEE International Conference on, IEEE, 2010.
We evaluated our method using eight state-of-the-art [4] Robert Moskovitch, et al., Identity theft, computers and behavioral bio-
keystroke dynamics algorithms, both distance-based and model- metrics, in: Intelligence and Security Informatics, 2009. ISI’09. IEEE
based, on three publicly available, commonly used datasets. Based International Conference on, IEEE, 2009.
[5] Romain Giot, Baptiste Hemery, Christophe Rosenberger, Low cost and
on our evaluations, we conclude the following: (a) our method usable multimodal biometric system based on keystroke dynamics and 2D
can divide the instances into clusters according to the original face recognition, in: Pattern Recognition (ICPR), 2010 20th International
users with high accuracy, (b) our method dramatically improves Conference on, IEEE, 2010.

13
I. Hazan, O. Margalit and L. Rokach Knowledge-Based Systems 221 (2021) 106982

[6] Salil P. Banerjee, Damon L. Woodard, Biometric authentication and iden- [27] Cristiano Giuffrida, et al., I sensed it was you: authenticating mobile users
tification using keystroke dynamics: A survey, J. Pattern Recognit. Res. 7 with sensor-enhanced keystroke dynamics, in: International Conference
(1) (2012) 116–139. on Detection of Intrusions and Malware, and Vulnerability Assessment,
[7] Paulo Sérgio Tenreiro Magalhães, Henrique Dinis dos Santos, An improved Springer, Cham, 2014.
statistical keystroke dynamics algorithm, in: Proceedings of the IADIS [28] Margit Antal, László Zsolt Szabó, An evaluation of one-class and two-class
MCCSIS, 2005. classification algorithms for keystroke dynamics authentication on mobile
[8] S. Haider, A. Abbas, A.K. Zaidi, A multi-technique approach for user identi- devices, in: 2015 20th International Conference on Control Systems and
fication through keystroke dynamics, in: IEEE International Conference on Computer Science, IEEE, 2015.
Systems, Man, and Cybernetics (ICSMC), vol. 2, 2000, pp. 1336–1341. [29] Paulo Henrique Pisani, Ana Carolina Lorena, A systematic review on
[9] Lívia C.F. Araújo, et al., User authentication through typing biometrics keystroke dynamics, J. Braz. Comput. Soc. 19 (4) (2013) 573.
features, IEEE Trans. Signal Process. 53 (2) (2005) 851–855. [30] John V. Monaco, Robust keystroke biometric anomaly detection, 2016,
[10] Kevin S. Killourhy, Roy A. Maxion, Comparing anomaly-detection algo- arXiv preprint arXiv:1606.09075.
rithms for keystroke dynamics, in: Dependable Systems & Networks, 2009. [31] Yunbin Deng, Yu Zhong, Keystroke dynamics user authentication based on
DSN’09. IEEE/IFIP International Conference on, IEEE, 2009. gaussian mixture model and deep belief nets, in: ISRN Signal Processing
[11] Kai Song, et al., Isolated forest in keystroke dynamics-based authentication: 2013, 2013.
Only normal instances available for training, in: 2017 2nd IEEE Interna- [32] Aparna Bhatia, et al., Keystroke dynamics based authentication using GFM,
tional Conference on Computational Intelligence and Applications (ICCIA), in: 2018 IEEE International Symposium on Technologies for Homeland
IEEE, 2017. Security (HST), IEEE, 2018.
[12] Sylvain Hocquet, J.-Y. Ramel, Hubert Carbot, Estimation of user specific [33] Seong-seob Hwang, Hyoung-joo Lee, Sungzoon Cho, Account-sharing de-
parameters in one-class problems, in: 18th International Conference on tection through keystroke dynamics analysis, Int. J. Electron. Commer. 14
Pattern Recognition (ICPR’06), vol. 4, IEEE, 2006. (2) (2009) 109–126.
[13] Alaa Darabseh, Doyel Pal, Performance analysis of keystroke dynamics [34] Abir Mhenni, et al., Double serial adaptation mechanism for keystroke
using classification algorithms, in: 2020 3rd International Conference on dynamics authentication based on a single password, Comput. Secur. 83
Information and Computer Technologies (ICICT), IEEE, 2020. (2019) 151–166.
[14] Yogesh Patel, et al., Keystroke dynamics using auto encoders, in: 2019 [35] Blaine Ayotte, et al., Fast free-text authentication via instance-based
International Conference on Cyber Security and Protection of Digital keystroke dynamics, IEEE Trans. Biometr. Behav. Identity Sci. 2 (4) (2020)
Services (Cyber Security), IEEE, 2019. 377–387.
[15] Romain Giot, Mohamad El-Abed, Christophe Rosenberger, Greyc keystroke: [36] Ian Jolliffe, Principal Component Analysis, Springer, Berlin Heidelberg,
a benchmark for keystroke dynamics biometric systems, in: 2009 IEEE 3rd 2011.
International Conference on Biometrics: Theory, Applications, and Systems, [37] James MacQueen, Some methods for classification and analysis of multi-
IEEE, 2009. variate observations, in: Proceedings of the Fifth Berkeley Symposium on
[16] Romain Giot, Mohamad El-Abed, Christophe Rosenberger, Web-based Mathematical Statistics and Probability, vol. 1, no. 14, 1967.
benchmark for keystroke dynamics biometric systems: A statistical anal- [38] Dan Pelleg, Andrew W. Moore, X-means: extending k-means with efficient
ysis, in: 2012 Eighth International Conference on Intelligent Information estimation of the number of clusters, in: ICML, vol. 1, 2000.
Hiding and Multimedia Signal Processing, IEEE, 2012. [39] Robert E. Kass, Larry Wasserman, A reference Bayesian test for nested
[17] R. Stockton Gaines, et al., Authentication By Keystroke Timing: Some hypotheses and its relationship to the Schwarz criterion, J. Amer. Statist.
Preliminary Results, No. RAND-R-2526-NSF. Rand Corp Santa Monica CA, Assoc. 90 (431) (1995) 928–934.
1980. [40] Lawrence Hubert, Phipps Arabie, Comparing partitions, J. Classification 2
[18] John Leggett, Glen Williams, Verifying identity via keystroke characterstics, (1) (1985) 193–218.
Int. J. Man-Mach. Stud. 28 (1) (1988) 67–76. [41] Romain Giot, Bernadette Dorizzi, Christophe Rosenberger, A review on the
[19] Jugurta Montalvão, et al., Contributions to empirical analysis of keystroke public benchmark databases for static keystroke dynamics, Comput. Secur.
dynamics in passwords, Pattern Recognit. Lett. 52 (2015) 80–86. 55 (2015) 46–61.
[20] Paulo Henrique Pisani, Ana Carolina Lorena, André C.P.L.F. de Carvalho, [42] Romain Giot, Mohamad El-Abed, Christophe Rosenberger, Keystroke dy-
Adaptive approaches for keystroke dynamics, in: Neural Networks (IJCNN), namics with low constraints svm based passphrase enrollment, in: 2009
2015 International Joint Conference on, IEEE, 2015. IEEE 3rd International Conference on Biometrics: Theory, Applications, and
[21] Abdul Serwadda, Vir V. Phoha, Examining a large keystroke biometrics Systems, IEEE, 2009.
dataset for statistical-attack openings, ACM Trans. Inf. Syst. Secur. 16 (2) [43] Romain Giot, Mohamad El-Abed, Christophe Rosenberger, Keystroke dy-
(2013) 8. namics authentication for collaborative systems, in: 2009 International
[22] Patrick Bours, Jørgen Ellingsen, Cross keyboard keystroke dynamics, in: Symposium on Collaborative Technologies and Systems, IEEE, 2009.
2018 1st International Conference on Computer Applications & Information [44] Fei Tony Liu, Kai Ming Ting, Zhi-Hua Zhou, Isolation forest, in: 2008 Eighth
Security (ICCAIS), IEEE, 2018. IEEE International Conference on Data Mining, IEEE, 2008.
[23] Jiacang Ho, Dae-Ki Kang, Mini-batch bagging and attribute ranking for [45] Markus Goldstein, Andreas Dengel, Histogram-based outlier score (hbos):
accurate user authentication in keystroke dynamics, Pattern Recognit. 70 A fast unsupervised anomaly detection algorithm, in: KI-2012: Poster and
(2017) 139–151. Demo Track, 2012, pp. 59–63.
[24] Nataasha Raul, Royston D’mello, Mandar Bhalerao, Keystroke dynamics au- [46] Charu C. Aggarwal, Outlier Analysis. Data Mining, Springer, Cham, 2015.
thentication using small datasets, in: International Conference on Security [47] Yogesh Patel, et al., Keystroke dynamics using auto encoders, in: 2019
& Privacy, Springer, Singapore, 2019. International Conference on Cyber Security and Protection of Digital
[25] Denis Migdal, Christophe Rosenberger, Statistical modeling of keystroke Services (Cyber Security), IEEE, 2019.
dynamics samples for the generation of synthetic datasets, Future Gener. [48] Fabian Pedregosa, et al., Scikit-learn: Machine learning in Python, J. Mach.
Comput. Syst. (2019). Learn. Res. 12 (2011) 2825–2830.
[26] Soumik Mondal, Patrick Bours, A study on continuous authentication using [49] https://github.com/alexkimxyz/XMeans/blob/master/xmeans.py; Alexander
a combination of keystroke and mouse biometrics, Neurocomputing 230 Kim.
(2017) 1–22. [50] Yue Zhao, Zain Nasrullah, Zheng Li, Pyod: A python toolbox for scalable
outlier detection, 2019, arXiv preprint arXiv:1901.01588.

14

You might also like