You are on page 1of 56

Improving the Security of the Android

Pattern Lock using Biometrics and Machine


Learning

Jacob Nilsson

Civilingenjör, Teknisk fysik och elektroteknik


2017

Luleå tekniska universitet


Institutionen för system- och rymdteknik
L ULEÅ T EKNISKA U NIVERSITET

M ASTER ’ S T HESIS

Improving the Security of the Android


Pattern Lock using Biometrics and
Machine Learning

Jacob Nilsson

supervised by
Philip Lindblad

August 30, 2017


ii
A BSTRACT

With the increased use of Android smartphones, the Android Pattern Lock graphical password
has become commonplace. The Android Pattern Lock is advantageous in that it is easier to
remember and is more complex than a five digit numeric code. However, it is susceptible to a
number of attacks, both direct and indirect. This fact shows that the Android Pattern Lock by
itself is not enough to protect personal devices. Other means of protection are needed as well.
In this thesis I have investigated five methods for the analysis of biometric data as an un-
noticable second verification step of the Android Pattern Lock. The methods investigated are
the euclidean barycentric anomaly detector, the dynamic time warping barycentric anomaly
detector, a one-class support vector machine, the local outlier factor anomaly detector and a
normal distribution based anomaly detector. The models were trained using an online training
strategy to enable adaptation to changes in the user input behaviour. The model hyperparam-
eters were fitted using a data set with 85 users. The models are then tested with other data sets
to illustrate how different phone models and patterns affect the results.
The euclidean barycentric anomaly detector and dynamic time warping (DTW) barycentric
anomaly detector have a sub 10 % equal error rate in both mean and median, while the other
three methods have an equal error rate between 15 % and 20 % in mean and median. The
higher performance of the euclidean and DTW barycentric anomaly detector is likely because
they account for the time series nature of the data, while the other methods do not. Each user
in the data set have provided each pattern at most 50 times, meaning that the long-term effects
of user adaptation could not be studied.

iii
iv
P REFACE

The work in this thesis has been carried out at BehavioSec. I would like to thank them
for giving me the opportunity to do this. It has given me a lot of insight into areas that
I first never thought I go near. The work has been interesting and I have had a blast
spending this short time with them. A special thanks goes to my supervisor Philip
Lindblad for presenting interesting ideas and guiding me in the right direction. Next
I would like to thank my examinator at LTU Fredrik Sandin for helping me when I
needed it.
Lastly I would like to acknowledge the support that my parents have given me
over the last five years, and also the friends I’ve made here. Without you, I don’t think
I would have coped with studying at this level.

Jacob Nilsson
Luleå, 7 June, 2017

v
vi
L IST OF F IGURES

1.1 Example of android lock pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Visualization of an alignment of time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8


2.2 Dynamic time warping algorithm visually explained . . . . . . . . . . . . . . . . . . . 9
2.3 Sakoe-chiba band visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Dynamic time warping averaging problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Visualization of dynamic time warping barycentric averaging . . . . . . . . . . . 11
2.6 Example of a one-class support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Illustration of barycentric anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Illustration of local outlier factor anomaly detection . . . . . . . . . . . . . . . . . . . . 15
2.9 Reciever operating characteristic curve example . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 The different patterns used in this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20


3.2 Demonstration of the unevenness of the sampling . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Training regime illustrated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Time-variability of user input behaviour. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


5.2 Reciever operating characteristic curves for dynamic time warping barycen-
tric anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Magnified reciever operating characteristic curves for dynamic time warp-
ing barycentric anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Multiple phone results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Multiple phones results with changed hyperparameters . . . . . . . . . . . . . . . . . 32
5.6 Multiple pattern results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.7 Mean results for all five patterns, Huawei P9 . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.8 Mean results for all five patterns, S7 only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.9 Mean results for all five patterns, S7 Edge only . . . . . . . . . . . . . . . . . . . . . . . . . 34

vii
viii
L IST OF TABLES

4.1 Methods considered for testing with the corresponding hyperparameters. 27

5.1 Training equal error rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30


5.2 Analysis of sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

ix
x
S YMBOLS AND N OTATION

Notation Meaning
s ti , s i Element i of time series s
D p,q Distance matrix between time series p and q
E p,q Accumulated distance matrix between time series p and q
α Threshold parameter for distance-based anomaly detectors
k · kp, k · k L p -norm
O Big O-notation of time complexity
min f (·) The minimum value of function f
argmin f ( a) The argument a that minimizes function f (·)
a
X Training set
x∗ Test sample
N (x) The neighborhood of x
L( f ) Log-likelihood of f

APL Android Pattern Lock


DTW Dynamic Time Warping
MD-DTW Multi-Dimensional Dynamic Time Warping
OCSVM One-Class Support Vector Machine
BA Barycentric Anomaly detection
EBA Euclidean barycentric Anomaly detection
DBA DTW-barycentric Anomaly detection
LOF Local Outlier Factor
ND Normal Distribution based anomaly detector
ROC Reciever Operator Characteristic
TPR True Positive Rate
FPR False Positive Rate
JSON JavaScript Object Notation
PCA Principal Component Analysis

xi
xii
C ONTENTS

C HAPTER 1 – Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Android pattern lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Behavioural biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Problem description and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Research on touch screen usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Biometrics on android pattern lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Side-channel attacks using other sensors . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

C HAPTER 2 – Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Elastic and non-elastic measures of time series . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Dynamic time warping (DTW) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 DTW barycentric averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 One-Class support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Distance-based anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Barycentric Anomaly Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Local outlier factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Normal density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Reciever-operating characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

C HAPTER 3 – Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Collection application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Outsourced data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Inhouse data set #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.4 Inhouse data set #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Pre-processing of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

xiii
3.2.1 Uneven sampling of raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Resampling of raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Removal of bad users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Phone on table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Sensor malfunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C HAPTER 4 – Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Convergence of input behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Input time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 DTW-similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Machine learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Division of data into training and test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.1 Outsourced data set used as the training set . . . . . . . . . . . . . . . . . . . . . 27
4.4.2 Inhouse data sets used as test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

C HAPTER 5 – Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


5.1 Convergence of user inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Outsourced data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Tests with multiple phones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Multiple patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

C HAPTER 6 – Discussion and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1 User input behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.2 Different phones and patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1.3 Method comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.1.4 Identification power with different sensors . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2.1 Improving the models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2.2 Better studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

xiv
C HAPTER 1

I NTRODUCTION

1.1 Background
The mobile phone has become a tool which most of us can not do without. It is used for
communication, entertainment, data storage and more. As the usage of mobile phones
has increased, we have grown more comfortable with saving secretive data on them.
With increasing amounts of data, the need to keep out malicious users increases as
well. Traditionally, a PIN code has been used to log into a mobile phone. The PIN code
works well for phones with a physical keyboard, but with the advent of smartphones,
we now have the possibility of using graphical passwords.
A graphical password relies on visual cues for security. They can be as simple as
just identifying pictures, but also more complicated ones such as the Draw-A-Secret
scheme by Jermyn et al [1]. The Draw-a-secret scheme has the user drawing a pattern
on a screen with a N × N grid. When the pattern is drawn, the order in which the
pattern goes into the cells of the grid is stored as the password. Dunphy et al. [2]
improved the Draw-A-Secret scheme by introducing background images. The back-
ground images made users choose more complex passwords. Another similar idea
was introduced by Tao [3]. Instead of drawing free hand on a screen, in Tao’s Pass-
Go scheme, the user connects intersections of a grid on the screen. Googles Android
Pattern Lock is based on Tao’s idea, but using a smaller 3 × 3 grid.
The reason graphical passwords are preferentiable to PIN-codes is the pictorial su-
periority effect. This is the fact that we learn to remember graphical patterns more
easily than strings of numbers and letters [4, 5].
Like all other passwords, there are ways for malicious users to get a hold of graphi-
cal passwords and access a phone without permission. Graphical passwords are more
susceptible to shoulder-surf attacks (where an attacker looks at your device while you
input your password) than PIN codes due to the pictorial superiority effect. Ye et al.
[6] used recordings of password inputs to find what pattern was inputted. From the
footage, they could recover the basic motion of the fingers on the screen. Then, they

1
2 1.1. BACKGROUND

could find some candidates for that pattern. Using those candidates, they could guess
the password with 95 % accuracy on the first five attempts. Also, their method of attack
was better at finding large patterns, as a more complex motion reduced the number of
candidate patterns. The footage they used was captured with all kinds of cameras in-
cluding smartphone cameras. The angles also varied and could be so extreme that the
screen was not visible. Another vector of attack is the smudge attack. When using
a touch-screen, your fingers leave oily residues called smudges. If an attacker gets
hold of your phone, they can use these smudges to deduce the pattern. Aviv et al. [7]
tested how strong these kinds of attacks can be. They showed that a picture of the
smudges can be enough to figure out the password. They also tested the resiliance of
the smudges and found that they can persist after using the screen for a while. The
only way to make sure that a smudge attack could not be used was to wipe the screen
clean.
Due to these security issues the need for extra verification is necessary. One kind
way to impose extra security is to use behavioural biometrics. Behavioral biometrics is
the discipline of measuring individual’s behaviour. One way this is used is to identify
users with keyboard dynamics [8]. If such biometric information can be used to verify
the correct user during log in, a second verification step can be implemented invisibly.
As users spend over a minute a day logging in to their devices [9], there is a need for
this verification to be fast and non-intrusive.

1.1.1 Android pattern lock


One of the most commonly used graphical passwords is the Android Pattern Lock
(APL). It is one of the standard unlock methods on Android devices. The APL consists
of a 3 × 3 dot matrix where the task is to draw lines between the dots. The password
is then stored as the order in which the dots were connected. The rules of deciding if a
password is valid are

• The lines must go through at least 4 dots (ensuring at least one change of direc-
tion.)
• A dot can only be used once.
• If a line passes through a previously unused dot, this dot will be used.

There is a total of 389 112 possible patterns, which is more than a PIN code of length 5
[7]. Figure 1.1 presents an example of a valid APL password.
Despite the fact that there are more possible APL patterns than there are PIN codes
of length 5, the APL is not necessarily more secure. In a large study, Uellenebeck et al.
[10] tested how people use the APL. First, participants answered where their personal
APL pattern started and then they got to input a pattern that they thought was secure.
The study revealed that 38 % of users started their pattern in the top left corner, and
CHAPTER 1. INTRODUCTION 3

Figure 1.1: Android Lock Pattern example. The sequence of this password is 1-5-7-4-2-8-9-6-3

that the corners started 75 % of all patterns. Using the inputted data, they estimate
that real pattern have an entropy slightly lower than that of a 3 digit PIN code. They
conclude that even though the APL in theory is safer than a 5 digit PIN, in practice it
is not.

1.1.2 Behavioural biometrics


Behavioural biometrics are measurements of how we do things. This ranges from very
simple things like the fact that I use my phone with my right hand to very complicated
things like keystroke dynamics (the way a person types on a keyboard). Biometrics can
include features that changes over time. Some reasons for these changes in behaviour
are injuries, development of muscle memory or external factors. The development
of muscle memory implies that a behavior might change a lot when starting with a
new task [11], like inputting an APL pattern. These rapid changes implies a method
for recognizing user input behaviour needs to be adaptive to changes in user input
behaviour.

1.2 Problem description and limitations


The goal of this thesis is to develop an algorithm for two-stage verification with APL-
like graphical passwords. The first verification stage should be the APL-password and
the second stage should use the sensor data from commonly available smartphones to
verify the identity of the user. The project can be divided into three parts focusing on
the following questions:

• To what extent can password patterns be verified with machine learning meth-
4 1.3. RELATED WORK

ods?

• Which of the typically available sensors are useful for that purpose?

• How many times must a typical user input a pattern so that the pattern data is
consistent enough to be used with machine learning methods?

1.2.1 Limitations

The work in this thesis focus on finding methods than can recognize the unique way
a user uses their device and detect when malicious users try to enter the device. An
Android phone contains a lot of different sensors. Here I focus on the XY-position of
fingers on the screen, the accelerometer and the gyroscope data. Other sensors such as
area touched and finger pressure do not work the same on all phones and therefore I
do not consider them. The goal of this thesis is to develop the methods for two-step
verification. However, creating an application using this method is beyond the scope
of this thesis.

1.3 Related Work

1.3.1 Research on touch screen usage

Due to the ubiquity of smart phones there is a lot of research on how we use touch
screens. Weir et al. [12] used gaussian process regression to improve the accuracy of
button presses on a Nokia N9. The regression model was trained on each test user to
provide individual improvements. Buschek et al. [13] goes further with this idea and
uses screen data with features akin to keystroke dynamics [8] to provide an extra veri-
fication step for normal passwords on smart phones. They also provided a framework
for inputting the password with different hand postures. The same group have also
provided research on how the target shape and size affects touch screen accuracy [14].
An interesting contribution was the construction of an index of individuality. They
used an gaussian regression model to classify users and calculated the decrease in en-
tropy, which they used as an index of individuality. As mentioned earlier, the APL is
just a kind of graphical password. Li [15] created a free-form gesture recognizer called
Protractor, which uses screen XY-positions and downsamples this information to only
16 samples. Despite the heavy downsampling, Li gets results comparable to the more
computationally expensive $1 free-form gesture recognizer [16].
CHAPTER 1. INTRODUCTION 5

1.3.2 Biometrics on android pattern lock


Work focusing exclusively on the APL include the works by Angulo and Wästlund [17]
and de Luca et al. [18]. They both explore ways to create a second layer of verification
and use two different approaches. Angulo and Wästlund uses the time of dots and
time between dots as features, an approach similar to keystroke dynamics. They use
these features in a random forest algorithm and achieved an equal error rate of around
10 %. De Luca et al. instead use the XY-position of the finger on the screen as input to
a dynamic time warping based algorithm to classify the users. They achieved an equal
error rate of 8 %.
I have not found published work trying to combine the screen XY-data with the
accelerometer and gyroscope data.

1.3.3 Side-channel attacks using other sensors


Keyloggers using the screen data are not possible on Android devices. This is because
the OS does not allow applications that are not visible on screen to use the screen data.
That means that an attacker would need to use the other sensors of the phone to find
the keys touched. This is a so-called side channel attack. Aviv et al. [19] used the
accelerometer to find what key was pressed when inputting a pin code. Cai et al. [20]
used the gyroscope sensor for the same purpose. These articles show that there is
definitely some information about keystrokes in both the accelerometer and gyroscope
data.

1.4 Outline
This thesis is outlined as such: Chapter 2 contains the theory for the methods used in
this thesis. Chapter 3 describes the data and the pre-processing steps taken. Chapter 4
describes the methods used for analysis and testing. Chapter 5 contains the results of
the analysis and testing. Lastly, Chapter 6 contains the discussion, my thoughts on
future work and my conclusions.
6 1.4. OUTLINE
C HAPTER 2
T HEORY

In this chapter I introduce the theory required to understand the methods described in
chapter 4. Section 2.1 describes what time series are, the difference between non-elastic
and elastic measures and the Dynamic Time Warping algorithm in detail. Section 2.2
describes the one-class support vector machine. Section 2.3 describes three different
distance-based anomaly detectors, the barycentric anomaly detector, the local outlier
factor and the normal density estimator. Lastly, Section 2.4 covers the Reciever Opera-
tor Characteristic, a standard anomaly detection performance measure.

2.1 Elastic and non-elastic measures of time series


A time series is a set of elements s = [st1 , st2 , . . . , stn ] where ti denotes the time at which
element sti was taken. There are different ways to measure the difference between
two time series and most of them can be put into two groups: non-elastic (lock-step)
measures and elastic measures [21]. Further on, si will be used as a shorthand for sti .
Non-elastic measures compare two time series q and s by measuring the distance
at each time instance ti . One way to do this is to choose a norm k · k and add together
the norm of the differences at each time step ti

n
D (q, s) = ∑ k qi − si k . (2.1)
i =1

The norm in this case can be the euclidian (p = 2), manhattan (p = 1) or any k ·
k p norm. In practical tests for time series classification using a 1-nearest neighbour
algorithm, the euclidean norm performed better than all other non-elastic measures
[21]. Non-elastic measures work well when every time series is sampled in the exact
same way, but if the sample timings are inconsistent the time series compared are no
longer the same. This could decrease the classification performance. Also, non-elastic
measures requires that each signal has the same number of samples, which need not
be the case.

7
8 2.1. ELASTIC AND NON-ELASTIC MEASURES OF TIME SERIES

To compare time series with timing inconsistencies or different lengths, elastic mea-
sures have been developed. Elastic measures try to find the samples in a time series
that best fit each other and then measures the distances between them. Examples
of such algorithms are Dynamic Time warping (DTW) [22], Edit distance with Real
Penalty [23] and more [21]. Dynamic time warping was one of the earliest conceived
elastic measures and despite many tries to create better algorithms, Wang et al. have
shown that dynamic time warping is in most cases as good or better than more recent
algorithms [21].

2.1.1 Dynamic Time Warping


Dynamic time warping is an elastic measurement of similarity for time series intro-
duced by Sakoe and Chiba in 1978 [22]. The algorithm tries to find the optimal pairing
of two series s and q. A pairing can be denoted by two vectors (π1 , π2 ), where π1 con-
tains indices for s and π2 contains indices for q. Each πi must contain all indices for its
corresponding time series. The indices have to be ordered sequentially and duplicates
are allowed:

1 = π1 ≤ πi ≤ πi+1 ≤ π N = N. (2.2)

If s is a series of length 4 and q is a series of length 5 one such alignment is π1 =


(1, 1, 2, 3, 3, 4) and π2 = (1, 2, 2, 3, 4, 5). This alignment can be interpreted visually in a
n × m matrix where element (i, j) represents the alignment of si and q j , see Figure 2.1.
The optimal alignment is defined to be the alignment (π1∗ , π2∗ ) that minimizes the

s4 4,1 4,2 4,3 4,4 4,5


s3 3,1 3,2 3,3 3,4 3,5
s2 2,1 2,2 2,3 2,4 2,5 s2 s3
s1
s1 1,1 1,2 1,3 1,4 1,5 s4
q1 q2 q3 q4 q5 (1, 2)
(2, 2)
(1, 1) (3, 3) (3, 4)
(A) (4, 5)
q2
q3
q1
q4 q5

(B)

Figure 2.1: Visual representation of the alignment of two time series. The aligned elements are
(1,1), (1,2), (2,2), (3,3), (3,4) and (4,5). (A) the matrix representation of alignment. (B) the
alignment between series by thick black lines.
CHAPTER 2. THEORY 9

Algorithm 1 Dynamic Time Warping


Require: Cost matrix Di,j
1: Accumulated cost matrix E1,1 = D1,1
2:

 D1,j+1 + E1,j

Ei+1,j+1 = Di+1,1 + Ei,1 (2.3)

Di+1,j+1 + min(Ei,j , Ei+1,j , Ei,j+1 )

3: return En,m

alignment distance. The aligment distance between si and q j is the euclidean distance
between the points: Di,j = |si − q j |. The matrix D is called the distance matrix.
The DTW alignment problem can now be defined as

π1∗ , π2∗ = argmin ∑ Ds,q (π1 , π2 ). (2.4)


π1 ,π2

and the DTW distance is defined [24] as

DTW(s, q) = ∑ Ds,q (π1∗ , π2∗ ). (2.5)

Even though I use the term "DTW-distance", it is not a distance metric in the normal

Di+1,j+1 + Ei,j

Ei,j+1 = 2

Ei,j = 1 Ei+1,j = 3

Figure 2.2: Visual explanation of how the DTW algorithm accumulates distance (Equation
(2.3)). For each pairing (i + 1, j + 1) it adds the distance from the distance matrix and the
smallest accumulated distance from the left, lower left or lower pairing. In this case, the lower
left element has the smallest accumulated value and is added.
10 2.1. ELASTIC AND NON-ELASTIC MEASURES OF TIME SERIES

sense, as it does not satisfy the triangle inequality [25, Section 4.1]. The algorithm used
to calculate the DTW distance is outlined as follows: First calculate the distance matrix
D (i, j). Now create a new accumulated distance E (i, j) and set its element according to
Algorithm 1. A visual interpretation of this process shown in Figure 2.2.
Since the DTW combines each element of q for each element of s it has a compu-
tational complexity of O(m n). This is worse than the euclidean distance, which has
a complexity that increases linearly with the size of the time series. To decrease the
computational complexity the Sakoe-Chiba band was introduced. The Sakoe-Chiba
band puts a restriction on which element will be considered during calculation. It only
allows the elements that are T steps away from the diagonal to be used (Figure 2.3).
Since each element of the time series need to have an alignment, the minimum width
of the Sakoe-Chiba band is |m − n|. If this criterion is not met, there will be some ele-
ments of the longer time series that are not aligned. The Sakoe-Chiba band decreases
the complexity to linear time (O(( T + 1) · min(n, m))) and can also help with classifier
performance [26]. If T is equal to one the DTW reduces to euclidean distance.
In the case of multi-dimensional time series the basic DTW algorithm described
above will not work, since it only takes one-dimensional time series into account. The
multi-dimensional DTW (MD-DTW) [27, Chapter 4] algorithm works in the same way
in every regard apart from how the distance matrix D is calculated. Each element of D
is the sum of differences over the dimensions of the time series:
K
Di,j (s, q) = ∑ |sik − qkj | (2.6)
k =1

where sk is the k-th dimension of time series s .

2.1.2 DTW barycentric averaging


Averaging a set of time series using DTW is different from averaging them using eu-
clidean distance. When taking the euclidean average of s and q both time series must

p4 4,1 4,2 4,3 4,4 4,5


p3 3,1 3,2 3,3 3,4 3,5
p2 2,1 2,2 2,3 2,4 2,5
p1 1,1 1,2 1,3 1,4 1,5
q1 q2 q3 q4 q5

Figure 2.3: Sakoe-chiba band visualization for T = 1. The gray elements will not be considered
when performing DTW calculations.
CHAPTER 2. THEORY 11

be of equal length. Also, euclidean averaging only has one possible alignment, (si , qi )
for all i. DTW on the other hand lets you measure distance between time series of
different lengths. This implies that an average sequence c can have a length different
from any time series in the set it is averaging. Also, the optimal alignment between
time series in DTW is not know before calculating the distance see Figure 2.4. One ap-
proach for averaging time series using DTW is DTW barycentric averaging introduced

s s

c2 ci

c1

q q

(A) (B)

Figure 2.4: The problems with averaging time series with DTW. (A) An average sequence
with respect to DTW-distance (c1 and c2 for example) can have arbritary length, as the DTW-
distance is an elastic measure. (B) All possible alignments between element ci and the elements
of s and q. It is not known beforehand which alignments are part of the optimal alignment.

s s

c c

q q

(A) (B)

s s

c0 c0
c

q q

(C) (D)

Figure 2.5: Dynamic barycentric averaging algorithm. (A) Initiate c. (B) Find the optimal
alignments of c with all time series. (C) Set ci0 as the average of all elements ci aligns to. (D)
Replace c with c0 and repeat steps 2 and 3 until end conditions are met.
12 2.2. ONE-CLASS SUPPORT VECTOR MACHINES

by Petitjean et al. [28].


The DTW barycentric averaging method is an iterative method outlined as follows:
First an average time series c is chosen. This can be initialized to any length and each
element can be chosen arbitralily, see Figure 2.5 A. Next, calculate the optimal align-
ment between the average sequence c and the set of time series (s and q in Figure 2.5).
Then, create a new time series c0 that has the same length as c. For each element i in
c and c0 , set ci0 to the average of all elements that ci aligns to (s1 and q2 for element c2
in Figure 2.5 c). Time series c0 is now the DTW barycentric average of s and q. Set c0
as c and repeat the process for either a set number of iterations or until the difference
between c and c0 is sufficiently small.
In this work I have expanded this algorithm for multi-dimensional time series. The
algorithm is the same as regular DTW barycentric averaging but instead of the DTW-
algorithm to calculate optimal alignment it uses the MD-DTW algorithm.

2.2 One-Class support vector machines


The one-class support vector machine (OCSVM) can be used to address anomaly de-
tection problems. It finds the sphere with center a and radius R that enclose as much
of a data set as possible. A test sample is said to anomalous if it falls outside of this
circle. The equations for this are

min R2 +C ∑ ξ t (2.7)
t
s.t. k xt − ak2 ≤ R2 + ξ t and ξ t ≥ 0, ∀t (2.8)
3.0 SVM using gaussian kernel
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0 1 0 1 2 3 4 5

Figure 2.6: Example of a one-class SVM with a gaussian kernel. The red region predicts anoma-
lies.
CHAPTER 2. THEORY 13

where ξ t is called the slack variable for the t-th training sample and C is a regularization
factor.
The slack variable is the "enclose as much as possible"-part of the definition. This
problem can be casted as a convex optimization problem by maximizing the lagrangian
Ld :

max Ld = ∑ αt ( xt ) T xt − ∑ ∑ αt αs ( xt ) T xs (2.9)
t t s

s.t. t
0 ≤ α ≤ C and ∑ αt = 1. (2.10)
t

Here, αt is a lagrange-multiplier for the t-th training sample [29, Section 13.12].
In Equation (2.9), we can see that the optimization problem does not rely on sin-
gular training points, rather the inner product between training samples. The inner
product is part of a class of functions called positive-definite kernels, often just called
kernels. A kernel is a function K : Rn×n → R that is symmetric K ( x, y) = K (y, x ) and
positive semi-definite. For one-class support vector machines, the kernel can be seen
as a coordinate transform. It is used by simply exchanging the inner product by the
kernel used in equation (2.9):

max Ld = ∑ α t K ( x t , x t ) − ∑ ∑ α t α s K ( x t , x s ). (2.11)
t t s

This allows the OCSVM to use arbritary smooth shapes (dependending on what kernel
is used), instead of only circles.
A commonly used kernel is the gaussian kernel

k x − y k2
 
K ( x, y) = exp − (2.12)
γ2

where γ is the length-scale of the kernel. This kernel will give higher values for points
that are closer together, meaning it emphasizes the smoothness of the data [30, Section
4.2]. An example of the decision regions of an OCSVM using gaussian kernels can be
found in Figure 2.6.

2.3 Distance-based anomaly detection


Distance-based anomaly detectors works on the following principle: Given a test sam-
ple x∗ and a training set X , if

f ( x∗ , X ) > α ⇒ x∗ is an anomaly. (2.13)


14 2.3. DISTANCE-BASED ANOMALY DETECTION

Here f is any function with codomain [0, ∞) for the test set and the training set and α
is the discriminating distance or threshold.

2.3.1 Barycentric Anomaly Detector

Let md (X ) be the mean of X with respect to some distance d( x, y). The Barycentric
Anomaly Detector is then defined as

BA( x∗ , X ) = | x∗ − md (X )|. (2.14)

Examples of distances that can be used is the euclidean distance and the (MD-) DTW
distance [28]. When used with those distances, the corresponding anomaly detector is
called euclidean BA and (MD-)DBA.

2.3.2 Local outlier factor

The Local Outlier Factor (LOF) compares the density of a test sample to the density of
its k-nearest neighbors. It is formally defined as

dk ( x∗ )
LOF( x∗ , X ) = (2.15)
∑s∈N ( x∗ ) dk (s)/|N ( x∗ )|

where dk (y) is the euclidean distance from y to its k-nearest neighbour. N ( x∗ ) is the
test samples k-nearest neighbors. Intuitively, the LOF tests if the distances from the
test sample to its neighbors are similar. If the LOF is close to 1, they are probably in the
same cluster but if it is larger than 1 they are probably not [29, Section 8.7].

Figure 2.7: Illustration of barycentric anomaly detection. The black dots are the training sam-
ples and the cross is their average. The green disc and red triangle are test samples. All test
samples outside the threshold (gray dashed circle) are labeled anomalies.
CHAPTER 2. THEORY 15

Figure 2.8: Illustration of local outlier factor anomaly detection. The neighborhood of the
square, the green circle, is approximately the same size as the neighborhood of the training
set, the black circles. The neighboorhood of the triangle, the red dashed circle, is much larger
than the neighborhood of the training set. Then the square is not an anomaly and the triangle
is an anomaly.

2.3.3 Normal density estimation

With normal density estimation, each feature of the data is estimated as normally dis-
tributed and independent stochastic variable, ξ. The density function is

( x − µ )2
 
1
f ξ (x) = √ exp , (2.16)
2πσ2 2σ2

where µ is the mean of the training data and σ is the standard deviation of the training
data. For one-dimensional data, a test sample is said to be an anomaly when f ξ ( x∗ ) <
β. To save the step of calculating the exponentianl function, the log-likelihood can be
used instead. The log-likelihood is the natural logarithm of the density function

( x − µ )2 1
L( x, X ) = log( f ( x, X )) = 2
− log 2πσ2 . (2.17)
2σ 2
The last term of the sum is constant with respect to x. This means that it can be dis-
carded when comparing different test values and the resulting normal distribution
equation becomes

( x ∗ − µ )2
ND( x∗ , X ) = , (2.18)
2σ2

where x∗ is said to be anomalous if ND( x∗ ) < α. This method is distance based, as it


measures the squared distance from the mean scaled by the variance.
16 2.4. RECIEVER-OPERATING CHARACTERISTIC

2.4 Reciever-operating characteristic


The reciever operating characteristic (ROC) is a standard method used for determining
the performance of anomaly detectors. The ROC is defined as the ratio of the true
positive rate over the false positive rate

TPR
ROC = , (2.19)
FPR
where TPR is the true positive rate and FPR is the false positive rate The strength
of ROC for measuring the anomaly detector performance shows when changing the
hyperparameters of the model.
If the ROC is saved each time a model is trained with different hyperparameters,
these values define a ROC-plot. Figure 2.9 shows an example of an ROC-plot. Each
black dot represents the performance of a classifier with different hyperparameter val-
ues. The dotted line from (0, 0) to (1, 1) represents the randomness line. If a classifier
lies close to this line, the true positive rate is equal to the true negative rate and the
anomaly detector will be random. If it lies above the randomness line the performance
is better than random guessing. If it lies below the randomness line the performance
is worse random guessing but the method can be inverted to move it above the line.

0.8
True positive rate

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
False positive rate

Figure 2.9: Example of a ROC-curve. The upper green region represents better than random
performance. The red region means worse than random performance. The dashed gray diagonal
line is the EER line.
CHAPTER 2. THEORY 17

The dashed line from (1, 0) to (0, 1) is the line of equal error rate (EER), where the risk
of missclassifying both true and anomalous data is equal [31].
In training, the objective has been to find the hyperparameters that produces an
ROC-point as close as possible to the EER-line. In testing, these hyperparameters are
used.
18 2.4. RECIEVER-OPERATING CHARACTERISTIC
C HAPTER 3
D ATA

This chapter describes the data used in this thesis and the steps taken to make it useful.
Section 3.1 describes the application used to collect the data and the different data sets
used. It also describes the patterns used. Section 3.2 describes the pre-processing steps
applied to make the data suitable for analysis with the machine learning methods used
in this thesis. Lastly, Section 3.3 describes the issues with some of the data and why it
was removed.

3.1 Data Collection

3.1.1 Collection application


The data used in this thesis was collected using an Android application created by
BehavioSec. When started, it will present the user with different APL-patterns. The
user must then input the pattern correctly. If the user inputs the displayed pattern
incorrectly, the pattern will not change. When inputted correctly, it will choose a new
pattern randomly until all patterns and codes have been presented an certain number
of times.
The screen (XY-position, pressure and size), accelerometer and gyroscope data for
each correct input pattern was stored as a JSON-object. Each screen, accelerometer and
gyroscope sample also stored the UNIX-time at which they were measured. Due to
the irregularity in sample-timing, this information is crucial for measuring the speed
of finger movement. The user ID, device ID, dot positions and dot radii are also saved.
Figure 3.1 shows the different patterns and their corresponding names used in the
following parts of the thesis.

3.1.2 Outsourced data set


The first data set was collected in a study by BehavioSec where the app was distributed
to people across the globe. Only users with certain phones could participate and in

19
20 3.2. PRE-PROCESSING OF DATA

Z Snake Nunchuck

L Spiral

Figure 3.1: The different patterns used in this thesis.

total 14 different phones were used. The users were asked to perform the patterns
snake and spiral 50 times each, see Figure 3.1. In total 101 users participated in this
study. Compared to other studies, this is a high number of users [17, 18]. Also, a larger
variety of phones are used. Overall, the advantage of this data set is that it includes
numerous users. The large number of phones and patterns means it is best suited for
training because the variability of the data is high. The variability makes it hard to test
the effects of overfitting with respect to phone model and pattern.

3.1.3 Inhouse data set #1

A second data set was collected at BehavioSec. This data set is based on seven partic-
ipants that input a single pattern 30 times each. This was done with seven different
phones. This data set was collected to test the effects of using different phones in the
outsourced data set.

3.1.4 Inhouse data set #2

Lastly, a third data set was collected at BehavioSec. This data set is based 27 partici-
pants that input all patterns in Figure 3.1. Three different phones were used. All users
did not input all patterns on all phones. This data set was collected to test how the
choice of pattern affects the results.
CHAPTER 3. DATA 21

3.2 Pre-processing of data


For each user, a profile was created that include the user ID and dot positions. Three
lists containing the screen, gyroscope and accelerometer data are also stored. The sam-
pling times were also stored in the lists.

3.2.1 Uneven sampling of raw data


The collected data is not sampled evenly, as demonstrated in Figure 3.2. First, as is
most clear from the XY-data, the sensors do not have a constant sample rate. This un-
even sampling rate is because the Android OS does not let third-party applications
control how sensors are used. Instead, the applications have to ask the OS for sen-
sor data and the OS will return the data within a few milliseconds. Also, the motion
sensors (accelerometer and gyroscope) samples on average only a third as fast as the
XY-sensors.
Out of these problems, it is the fact that the different sensors do not sample at the
same rate that is the most challenging aspect. All methods that I have used requires
that the data from all sensors have the same time alignment. The following two sub-
sections presents ways to achieve that.

3.2.2 Resampling of raw data


The data was regularized with a resampling process. First, the XY-speed is calculated
from the XY-position. Then all time series are normalized in time, i.e. the input time for
all patterns is set to [0, 1]. By normalizing the time, only the shape of the inputs is taken
into account. However this will lead to loss of temporal data that might be interesting.
By storing the XY-speed, I retain some temporal information. After normalizing the
time, the amplitude of each time series is normalized by subtracting the mean and
dividing by the standard deviation. What is left is 10 time series with zero mean and a

XY

ACC

GYRO

10 20 30 40 50 60 70 80 90 100
Time (ms)

Figure 3.2: Demonstration of the unevenness of the sampling. Each dot corresponds to a sample
from one of the three sensors. The X-axis is the time of sampling.
22 3.3. REMOVAL OF BAD USERS

standard deviation of one.


Furthermore the time series are resampled to 25, 50, 75 and 100 evenly spaced sam-
ples using a linear interpolation process. The resampling ensures that the data in all
dimensions have the same time alignment. The resulting dataset is used for the analy-
sis in the rest of this thesis. In most cases the resampling is a downsampling, meaning
that this process is also a low pass filter. Due to the unevenness in sample time, this
process might down- and upsample depending on the density of samples in that re-
gion of time. This also means that some information might be lost at this stage. Also,
the data is still in time series form. This means that methods that cannot take temporal
data into account might perform worse than those who can. The inspiration for this
resampling scheme comes from the Protractor login method [15].

3.2.3 Feature extraction


The second way I regularized the data was to extract features. First the XY-speed was
calculated and the time series were normalized in time, as was done for the resam-
pling. Then for each input dimension, a total of 16 features were selected. These fea-
tures were statistical measures (mean, variance, root mean square, maximum value,
minimum value, skewness and kurtosis), third-degree polynomial approximation co-
efficients and the five first fourier series coefficients. All in all, 160 features were se-
lected. These features were normalized within each data set by subtracting the mean
and dividing by the standard deviation. The inspration for the selected features comes
from [19].
The number of featuresis reduced using the PCA transform. When used for learn-
ing, the principal components explaining 95 % of the variance are used.
These features abstract away the fact that the data is originally in time-series form.
Thus, these features may not be good for discriminating between users and they may
be susceptible to noise in the data.

3.3 Removal of bad users


As the outsourced data set was not collected under controlled conditions, the quality
of measurements varies. This give rise to two reasons to disqualify a user: Their phone
was on a table or their sensors malfunctioned. After these data items were removed
there are 85 users remaining in the outsourced data set.

3.3.1 Phone on table


The first kind of user to be removed were those that had their phone on a table during
inputs. As one of the purposes of this thesis is to use of motion sensors for identifica-
CHAPTER 3. DATA 23

tion, the phone should not be placed on a table. The data collection application itself
did not take this into account. Thus these users need to be sorted out after the data
collection. First, I wrote a script identifying all users with a mean-square acceleration
and rotation of less than a threshold value. The acceleration and rotation of these users
were then plotted and those I considered too flat were discarded.

3.3.2 Sensor malfunction


For some users, the accelerometer and gyroscope sensors malfunctioned during input
and gave no or very few samples. All users that had less than five samples for any
sensor were also removed. This was done to make sure that the selected features could
be determined.
24 3.3. REMOVAL OF BAD USERS
C HAPTER 4
M ETHODS

This chapter describes the methods used in this thesis, based on the theory presented
earlier. Section 4.1 presents the methods used to measure how the users’ input be-
haviour converge over time. Section 4.2 presents the machine learning methods used
for identification. Section 4.3 presents the online learning strategy used to train the
methods and Section 4.4 presents how the different data sets were used to train the
methods.

4.1 Convergence of input behaviour


To test how fast the input behaviour of the users converge, two tests were performed on
the outsourced data set. The first test address how the time to input a pattern decrease
and the second test address the similarity of patterns using MD-DTW. These two tests
were applied to the outsourced data set.

4.1.1 Input time


The input time for the 50 inputs of each pattern was calculated and normalized by
dividing the input times with the duration of the first input. This ensures that the first
input time is one for all users. Then, for all users and both patterns, these values are
averaged for each input time. I then use these averages to determine if the average
input time decreases or increases. The idea is to investigate if the input time stabilizes
to some value. If it does, it show that the users’ input behaviour stabilizes on average.

4.1.2 DTW-similarity
Using the DTW-distance it is possible to calculate the similarity of inputs. For each
pattern the pairwise DTW-distance between inputs n − 1, n, and n + 1 was calculated.
Doing this for all n ∈ [2, 49] a vector of 48 values is produced that shows the local sim-
ilarity of inputs. Like for the time similarity, these DTW similarities can be normalized

25
26 4.2. MACHINE LEARNING METHODS

with respect to the first similarity. Subsequently an average over all users and patterns
can be calculated. If the users’ input behaviour converge on average the similarity
vector will consist of decreasing values.

4.2 Machine learning methods

The machine learning methods chosen for this investigation can be found in Table 4.1.
The methods can roughly be dividided into two different categories: Those regarding
the data as a time series (DBA and Euclidean BA, marked in gray) and those that do
not. The BA methods and OCSVM are used with the resampled time series data. For
the BA methods, each input time series is considered as a single input. For the OCSVM,
each sample in time is considered as an input vector. This means that the OCSVM will
test all timings against each other as if they were taken at the same time. When testing,
the OCSVM will produce predictions for each time in the time series. If it predicts
that half of the test inputs are anomalous, that input is also said to be anomalous. The
LOF and density methods are applied to the extracted features. The LOF produces
a single prediction and the density method produces one prediction for each feature.
The Density method will predict an input as anomalous if more than 1/4 of the features
are anomalous. This value was chosen through testing.

4.3 Learning algorithm

Typically in machine learning, an assumption that is made the samples are indepen-
dently and identically distributed (iid). In my data this assumption can be made on the
user level. Users make their inputs independently of the other users and they aim
to produce the same pattern. Each pattern the user make, however, depends on how
many times the user has produced the pattern before due to the development of mus-
cle memory [11]. This means that the model should change over time, in order to adapt
to the developing muscle memory. To do this, an online training protocol was used for
each user where the n latest patterns classified as corret were used to train the model.
The first training set for each user was set as patterns m − n to m. A flowchart of this
training protocol is found in Figure 4.1. The scores saved for each user were the num-
ber of true positives, false positives, false negatives and true negatives, from which
the true positive rate, false negative rate and ROC-curve were determined. From the
average ROC-curve, the equal error rate (EER) was calculated. The EER is the point
where the false positive rate and false negative rate are equal.
CHAPTER 4. METHODS 27

Method Hyperparameters
DBA τ, α, iter
Euclidean BA α
Gaussian OCSVM ν, γ
LOF k, α
Density α

Table 4.1: Methods considered for testing with the corresponding hyperparameters.

4.4 Division of data into training and test sets

4.4.1 Outsourced data set used as the training set


Due to the large number of users, the outsourced data set was chosen for training
the hyperparameters of the methods, see Table 4.1. To reduce training time, training
was only performed on the snake pattern. The hyperparameters were set using grid
search. For each tested combination of hyperparameters, the true positive rates and
false positive rates were calculated. From these rates the EER was estimated and the
hyperparameters that yielded ROC-points closest to the EER-line were selected.

4.4.2 Inhouse data sets used as test sets


As noted earlier in the thesis, the outsourced data set has only two patterns (of which
I use only one) and a large variety of phones, meaning that the risk of overfitting the
hyperparameters with respect to the pattern and phone is large. The inhouse data
sets were collected for to overcome this limitation. By testing the methods with the

Initialize training data:


train_data[1,...,10] =
user_data[1,...,10],
j = 10

Fit estimator with


training data

j = j+1 Yes

Update train data


j within Yes Predict user_data[j] User data No
train_data[2,...,10]
bounds? Predict attacker inputs anomalous? + user_data[j]

No
Stop Store predictions

Figure 4.1: Training regime illustrated. First the training set is initialized to patters {2,3,4}.
The model is then tested on consecutive patterns until it finds a positive pattern, j. Now the
training set is updated to patterns {3,4,j} and the model is retrained. This process is repeated
until the user has no more patterns to train on.
28 4.4. DIVISION OF DATA INTO TRAINING AND TEST SETS

hyperparameters chosen in the testing phase, the generalization capacity of the models
can be evaluated.
C HAPTER 5

R ESULTS AND ANALYSIS

This chapter presents the analysis and results of the methods presented in Chapter
4. Section 5.1 presents the analysis of user input behaviour over time. Section 5.2
presents the results from training and the hyperparameters chosen in that stage. It also
presents analysis of the BA methods used on different sensors. Section 5.3 presents
tests of overfitting with respect to phones and Section 5.4 shows the results for different
patterns.

5.1 Convergence of user inputs


The results of the convergence tests are presented in Figure 5.1. The general behaviour
is that both local DTW-distance and input times decreases. The input time converges
faster than the local DTW-distance. However, the decreasing behaviour seems to con-
tinue beyond the 50 inputs in the data set. In the case of the local DTW-distance, the
XY input seem to converge faster than the image of the motion sensor behaviour.

Xy 1.0
1.00 Acc
Gyro
Normalized DTW distance

Full
Normalized input time

0.9
0.95

0.8
0.90

0.7
0.85

0.6
0.80
0 10 20 30 40 50 0 10 20 30 40 50
Input No. Input No.
(A) (B)

Figure 5.1: Time-variability of user input behaviour.

29
30 5.2. OUTSOURCED DATA SET

5.2 Outsourced data set


The training results and the hyperparameters chosen are presented in Table 5.1. Figure
5.2 shows the ROC-curves for the mean results of the DBA method for different num-
ber of samples. The ROC-curve with and without the standard deviation are shown to
demonstrate the uncertainty of the results. Figure 5.3 shows a zoomed-in version of the
ROC-curve and the median values. Lastly, Table 5.2 presents the performance of the
BA methods for the different sensors. Two other results are not shown in these tables
or figures. The EBA method, like DBA, performs best with 25 samples. Furthermore,
the OCSVM method performs best with 100 samples.
These graphs and tables show that the BA methods perform better than the other
methods on the training data. It also shows that when all dimensions are taken into
account, the simpler EBA performs almost as good as the more computationally com-

Table 5.1: Equal Error Rate and chosen hyperparameters for the methods used. The EER for the
mean values of EBA and DBA (marked with asterisks) where found at different α’s.

T RAINING EER
Algorithm Hyperparameters Mean Median Std
EBA α = 5.6 *0.088 0.072 0.091
DBA α = 5.6, τ = 1, iter = 4 *0.096 0.057 0.087
OCSVM ν = 0.1, γ = 0.25 0.19 0.18 0.17
LOF α = 1.6, k=4 0.19 0.16 0.14
ND α = 4.3 *0.18 0.15 0.14


'7:52&PHDQIRUGLIIHUHQWQXPEHURIVDPSOHV '7:52&PHDQIRUGLIIHUHQWQXPEHURIVDPSOHVZLWKVWG


 

 
735

735

 

 
VDPSOHV VDPSOHV
VDPSOHV VDPSOHV
VDPSOHV VDPSOHV
VDPSOHV VDPSOHV
 
           
)35 )35

(A) (B)

Figure 5.2: ROC-curves for DBA with the different for varying α. (A) Without standard
deviation. (B) With standard deviation.
CHAPTER 5. RESULTS AND ANALYSIS 31

Table 5.2: Equal error rate for the BA methods using single sensors.

S ENSOR EER, MEAN ( MEDIAN )


Algorithm XY v x vy ACC and GYRO ACC GYRO
EBA 0.42 (0.34) 0.37 (0.32) 0.17 (0.10) 0.28 (0.26) 0.20 (0.15)
DBA 0.22 (0.20) 0.28 (0.27) 0.11 (0.078) 0.17 (0.16) 0.13 (0.10)


'7:52&PHDQIRUGLIIHUHQWQXPEHURIVDPSOHV 
'7:52&PHGLDQIRUGLIIHUHQWQXPEHURIVDPSOHV
VDPSOHV
VDPSOHV
VDPSOHV
VDPSOHV
 

 
VDPSOHV
VDPSOHV
735

735 VDPSOHV
VDPSOHV
 

 

 
         
)35 )35

(A) (B)

Figure 5.3: Magnified ROC-curves for DBA with different for varying α. (A) Mean. (B)
Median.

plex DBA. However, that changes when the methods are trained on individual sensors.
The relative difference in mean between EBA and DBA for all sensors is 9.1 %, where
the single value difference varies between 35 % and 48 %. This changes somewhat bit
when looking at the median, which is 21 % for all sensors and range between 15 %
(v x vy ) and 42 % (XY).

5.3 Tests with multiple phones


The average test results from the second inhouse data set are presented in Figure 5.4.
The figure on the left hand side presents the true positive rate and the figure on the
right hand side presents the false positive rate. The effect of using different phones
in the training set can be seen, as the false positive rate is higher when the attack data
comes from the same phone model as the user data. The true positive rate is also higher
for the authentic user and the same phone model.
These rates are slightly higher than the EER presented in Table 5.1. This suggests
that the hyperparameters are set different from what they should be. In Figure 5.5
32 5.4. MULTIPLE PATTERNS
735RQPXOWLSOHSKRQHYDOLGDWLRQ )35RQPXOWLSOHSKRQHYDOLGDWLRQ
 6DPHSKRQH  6DPHSKRQH
'LIIHUHQW 'LIIHUHQW
SKRQH SKRQH
 

 
735

)35
 

 

 
(%$ '%$ 2&690 /2) 1' (%$ '%$ 2&690 /2) 1'

Figure 5.4: Results for the different phones. (A) True positive rate. (B) False positive rate.
735RQPXOWLSOHSKRQHYDOLGDWLRQ )35RQPXOWLSOHSKRQHYDOLGDWLRQ
 6DPHSKRQH  6DPHSKRQH
'LIIHUHQW 'LIIHUHQW
SKRQH SKRQH
 

 
735

)35

 

 

 
(%$ '%$ 2&690 /2) 1' (%$ '%$ 2&690 /2) 1'

(A) (B)

Figure 5.5: Multiple phone results with hyperparameters changed by 10 %. (A) True positive
rate. (B) False positive rate.

the mean results are presented with hyperparameters set 10 % smaller than during in
training. The false positive rates decrease for all models but the OCSVM. The same is
true for the true positive rate. This confirms the hypothesis that the hyperparameters
are not correctly set for these phones and pattern.

5.4 Multiple patterns


The results for the different patterns are presented in Figure 5.6. For most methods,
the choice of pattern does not matter much. The OCSVM performs worse with the
nunchuck and the L, while the spiral produces higher rates than the other patterns.
Also, BA methods perform better than the other methods no matter which pattern is
used. Figures 5.7, 5.8 and 5.9 show mean results for the different phones. The results
vary with respect to phone.
CHAPTER 5. RESULTS AND ANALYSIS 33

735E\DOJRULWKPDQGSDWWHUQ )35E\DOJRULWKPDQGSDWWHUQ
 =  =
6QDNH 6QDNH
1XQFKXFN 1XQFKXFN
/ /
 6SLUDO  6SLUDO

)DOVHSRVLWLYHUDWH
7UXHSRVLWLYHUDWH

 

 

 

 
(%$ '%$ 2&690 /2) 1' (%$ '%$ 2&690 /2) 1'

(A) (B)

Figure 5.6: The mean results for all five patterns, averaged from data of all phone models. (A)
True positive rate. (B) False positive rate. Black lines denote the standard deviation.

735E\DOJRULWKPDQGSDWWHUQIRU+XDZHL )35E\DOJRULWKPDQGSDWWHUQIRU+XDZHL
 =  =
6QDNH 6QDNH
1XQFKXFN 1XQFKXFN
/ /
 6SLUDO  6SLUDO
)DOVHSRVLWLYHUDWH
7UXHSRVLWLYHUDWH

 

 

 

 
(%$ '%$ 2&690 /2) 1' (%$ '%$ 2&690 /2) 1'

(A) (B)

Figure 5.7: Mean results for all five patterns, using data from the Huawei. (A) True positive
rate. (B) False positive rate. Black lines denote the standard deviation.

735E\DOJRULWKPDQGSDWWHUQIRUV )35E\DOJRULWKPDQGSDWWHUQIRUV
 =  =
6QDNH 6QDNH
1XQFKXFN 1XQFKXFN
/ /
 6SLUDO  6SLUDO
)DOVHSRVLWLYHUDWH
7UXHSRVLWLYHUDWH

 

 

 

 
(%$ '%$ 2&690 /2) 1' (%$ '%$ 2&690 /2) 1'

(A) (B)

Figure 5.8: Mean results for all five patterns, using data from the Samsung S7. (A) True
positive rate. (B) False positive rate. Black lines denote the standard deviation.
34 5.4. MULTIPLE PATTERNS

735E\DOJRULWKPDQGSDWWHUQIRUVHGJH )35E\DOJRULWKPDQGSDWWHUQIRUVHGJH
 =  =
6QDNH 6QDNH
1XQFKXFN 1XQFKXFN
/ /
 6SLUDO  6SLUDO
)DOVHSRVLWLYHUDWH
7UXHSRVLWLYHUDWH

 

 

 

 
(%$ '%$ 2&690 /2) 1' (%$ '%$ 2&690 /2) 1'

Figure 5.9: Mean results for all five patterns, using data from the Samsung S7 Edge. (A) True
positive rate. (B) False positive rate. Black lines denote the standard deviation.
C HAPTER 6
D ISCUSSION AND CONCLUSIONS

This chapter concludes the thesis with a discussion of the methods and results (Section
6.1) and my concluding remarks (Sections 6.3 and 6.2).

6.1 Discussion

6.1.1 User input behaviour


As can be seen in Figure 5.1, user input behaviour tend to stabilize over time, with re-
spect to local DTW similarity and input time. The improvement of input time suggests
that muscle memory impacts input time more than the similarity of motion. How-
ever, due to the properties of DTW-distance (not fulfilling the triangle inequality for
example) these seemingly small changes can actually be significant.
Note that the XY-similarity increases faster than the motion sensor similarity. This
suggests that people learn how to input the pattern on the screen faster than they learn
a motion behaviour. Intuitively, this makes sense, as the screen gives feedback while
inputting the pattern. Also, some of the motions on the screen are basically forced
as you have to go through certain regions to input the password correctly. This is
in contrast with the motion that a hand can make while inputting a pattern. The wrist
can move freely and there is no feedback from the phone showing you how your wrists
moved. I propose that the combination of more freedom of movement and no feedback
mechanism is the reason for slower convergence of motion input behaviour.
However, another observation is that these users had to input two patterns during
the study. This is how the study and application was designed. As suggested in [11],
trying to learn multiple behaviours at once will decrease the rate of muscle memory
development, which might also be a contributing factor to the seemingly slow conver-
gence of input behaviour.
Lastly, the downward trend of the similarity of input behaviour suggests that the
users on average did not learn the pattern as well as they could. While the results are

35
36 6.1. DISCUSSION

useful to show that progress happens over the first 50 inputs, they also show that more
extended tests are needed to allow users to fully learn the input patterns.

6.1.2 Different phones and patterns


By inspecting the results, it appears that the results vary with respect to method, phone
and pattern. The results presented in Section 5.2 are different compared to the EER
presented in Section 5.2. This is because the EER is dependent on the false positive
rate and the true positive rate. When changing the hyperparameters such that the
false positive rate increases, the true positive rate should also increase. This tendency
is most clear in Figure 5.5 where I changed the hyperparameters and saw drastically
different results. This shows that the hyperparameters need to be set with respect to
both the phone model and pattern.

6.1.3 Method comparison


The results suggest that the BA methods perform best out of the methods that I have
tested. The LOF method and the ND method, using the extracted features seem to
perform differently from the OCSVM, but as discussed earlier, this is most likely due
to the hyperparameters not being set correcly with respect to the phone model and
pattern.
I find it interesting that the BA methods behaved different from my expectations
with respect to the resampling. I thought that the case with 100 samples would perform
better than the other cases. For the OCSVM this is true, but not for the BA methods.
This suggests that resampling does not only remove information but also part of the
noise. The fact that it works well should not come as a surprise, as the inspiration in
[15] downsampled the data to 16 samples and get good results. The observation hat
the OCSVM needs more data likely has to do with the model. Using gaussian kernels
gives the model more variance which makes it more susceptible to noise.
Overall, the results can probably be further improved by eptimizing the parame-
ters during training. This is something I experimented with using the BA methods,
but I eventually decided to devote my time to implement more methods rather than
finetuning results. With the results in hand, it would have been interesting to further
improve on the BA methods, but I have to leave that for future work.

6.1.4 Identification power with different sensors


As suggested by Table 5.2, the data from the motion sensors yield better results for
with BA methods than the screen data. This may appear counterintuitive considering
that the user input behaviour for those sensors converge slower than the screen data.
However, as I discussed earlier, there is no feedback mechanism for the motion input.
CHAPTER 6. DISCUSSION AND CONCLUSIONS 37

The hand can move more freely than the finger on the screen. This gives the possibility
for more unique behaviour of the hand motion. Perhaps that is what increases the
discrimination performance with the motion sensor data?

6.2 Future work

6.2.1 Improving the models


Overall, the models works well compared to other studies, like [17] and [18] who had
EER’s of 8 % and 10 %. As said in the discussion, the first thing I would improve is to
find a good way to adapt the hyperparameters during training. Another way would
be to identify hyperparameters that work well for all possible patterns and phones,
but that would require studies with all phones and patterns, which would be really
challenging. One approach that I would like to test with the BA methods is to calculate
the standard deviation with the training set and update the threshold by some small
multiple of the standard deviation, basically forming a gradient descent method.
The methods based on extracted features did not perform well compared to the BA
methods. The fault may be in the methods selected but I think that the features chosen
also play a role here. Using other features like wavelet transforms with wavelets more
suited to this kind of data might be better. Using some kind of dictionary learning to
identify the features might also be useful, for example as in [32].
Another interesting thing to test would be to combine learners by boosting [33]. For
example, to make the methods work together and increase their discrimatory power.

6.2.2 Better studies


The analysis of input behaviour convergence suggests that users have not yet learned
the pattern after 50 inputs. Also, inputting many patterns simultaneously will also re-
duce the learning rate. I would suggest performing studies where the participants only
had one pattern to input over several weeks or months, where they can only input the
pattern a few times per day. This will test the long-term effects of muscle memory for-
mation, and higher number of samples means that more complex methods that require
large amounts of data (like neural networks [29, Chapter 11]) can be useful.

6.3 Conclusions
To conclude this thesis, I will answer the questions stated in the introduction:

• To what extent can password patterns be verified with machine learning meth-
ods?: The short answer is that the time series based methods that I have tested
38 6.3. CONCLUSIONS

have an equal error rate of around 10 % and the methods not using time series
have an equal error rate of around 20 %. These results seem to be universal for
different phone models and patterns. However, all models are based on global
hyperparameters for all users. The hyperparameters called α could be fitted dur-
ing training, making them regular parameters. If they are fitted during training,
the equal error rate can potentially be further decreased.

• Which of the typically available sensors are useful for that purpose?: In my
investigation I’ve found that using both the the screen position and velocity, ac-
celerometer and gyroscope together produces the best results. But this is not to
say that all sensors are of equal value. The screen data has the least discrimi-
natory power when used with time-series based methods, followed by the ac-
celerometer and the gyroscope being the best. I think that this reflects the fact
that when inputting a pattern, the finger has to follow a certain trajectory. This is
contrasted by the unconstrained way you can move your arms and wrists during
input. The unconstrained movement means there is more ways for individual
behaviour to be learned.

• How many times must a typical user input a pattern so that the pattern data is
consistent enough to be used with machine learning methods?: My tests show
that using the ten first inputs are enough for the models to perform well. How-
ever, the data shown in Figure 5.1 suggests that users are still learning the pattern
after 50 inputs. Long studies, at least 500 inputs, over many weeks are needed to
investigate how long it takes for a user to stabilize their input behaviour.
R EFERENCES

[1] I. Jermyn, A. J. Mayer, F. Monrose, M. K. Reiter, A. D. Rubin, et al., “The design and anal-
ysis of graphical passwords.,” in Usenix Security, pp. 1–14, 1999.

[2] P. Dunphy and J. Yan, “Do background images improve draw a secret graphical pass-
words?,” in Proceedings of the 14th ACM conference on Computer and communications security,
pp. 36–47, ACM, 2007.

[3] H. Tao and C. Adams, “Pass-go: A proposal to improve the usability of graphical pass-
words.,” IJ Network Security, vol. 7, no. 2, pp. 273–292, 2008.

[4] L. Standing, “Learning 10000 pictures,” The Quarterly journal of experimental psychology,
vol. 25, no. 2, pp. 207–222, 1973.

[5] D. L. Nelson, V. S. Reed, and J. R. Walling, “Pictorial superiority effect.,” Journal of Experi-
mental Psychology: Human Learning and Memory, vol. 2, no. 5, p. 523, 1976.

[6] G. Ye, Z. Tang, D. Fang, X. Chen, K. I. Kim, B. Taylor, and Z. Wang, “Cracking android
pattern lock in five attempts,” in The Network and Distributed System Security Symposium,
2017.

[7] A. J. Aviv, K. L. Gibson, E. Mossop, M. Blaze, and J. M. Smith, “Smudge attacks on smart-
phone touch screens.,” Woot, vol. 10, pp. 1–7, 2010.

[8] A. A. E. Ahmed and I. Traore, “Anomaly intrusion detection based on biometrics,” in


Information Assurance Workshop, 2005. IAW’05. Proceedings from the Sixth Annual IEEE SMC,
pp. 452–453, IEEE, 2005.

[9] M. Harbach, A. De Luca, and S. Egelman, “The anatomy of smartphone unlocking: A field
study of android lock screens,” in Proceedings of the 2016 CHI Conference on Human Factors
in Computing Systems, pp. 4806–4817, ACM, 2016.

[10] S. Uellenbeck, M. Dürmuth, C. Wolf, and T. Holz, “Quantifying the security of graphical
passwords: the case of android unlock patterns,” in Proceedings of the 2013 ACM SIGSAC
conference on Computer & communications security, pp. 161–172, ACM, 2013.

[11] R. Shadmehr and T. Brashers-Krug, “Functional stages in the formation of human long-
term motor memory,” Journal of Neuroscience, vol. 17, no. 1, pp. 409–419, 1997.

39
40 REFERENCES

[12] D. Weir, S. Rogers, R. Murray-Smith, and M. Löchtefeld, “A user-specific machine learn-


ing approach for improving touch accuracy on mobile devices,” in Proceedings of the 25th
annual ACM symposium on User interface software and technology, pp. 465–476, ACM, 2012.

[13] D. Buschek, A. De Luca, and F. Alt, “Improving accuracy, applicability and usability of
keystroke biometrics on mobile touchscreen devices,” in Proceedings of the 33rd Annual
ACM Conference on Human Factors in Computing Systems, pp. 1393–1402, ACM, 2015.

[14] D. Buschek, A. De Luca, and F. Alt, “Evaluating the influence of targets and hand pos-
tures on touch-based behavioural biometrics,” in Proceedings of the 2016 CHI Conference on
Human Factors in Computing Systems, pp. 1349–1361, ACM, 2016.

[15] Y. Li, “Protractor: a fast and accurate gesture recognizer,” in Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems, pp. 2169–2172, ACM, 2010.

[16] J. O. Wobbrock, A. D. Wilson, and Y. Li, “Gestures without libraries, toolkits or training: a
$1 recognizer for user interface prototypes,” in Proceedings of the 20th annual ACM sympo-
sium on User interface software and technology, pp. 159–168, ACM, 2007.

[17] J. Angulo and E. Wästlund, “Exploring touch-screen biometrics for user identification on
smart phones,” in IFIP PrimeLife International Summer School on Privacy and Identity Man-
agement for Life, pp. 130–143, Springer, 2011.

[18] A. De Luca, A. Hang, F. Brudy, C. Lindner, and H. Hussmann, “Touch me once and i
know it’s you!: implicit authentication based on touch screen patterns,” in Proceedings of
the SIGCHI Conference on Human Factors in Computing Systems, pp. 987–996, ACM, 2012.

[19] A. J. Aviv, B. Sapp, M. Blaze, and J. M. Smith, “Practicality of accelerometer side channels
on smartphones,” in Proceedings of the 28th Annual Computer Security Applications Confer-
ence, pp. 41–50, ACM, 2012.

[20] L. Cai and H. Chen, “Touchlogger: Inferring keystrokes on touch screen from smartphone
motion.,” HotSec, vol. 11, pp. 9–9, 2011.

[21] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, and E. Keogh, “Experimental


comparison of representation methods and distance measures for time series data,” Data
Mining and Knowledge Discovery, pp. 1–35, 2013.

[22] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word
recognition,” IEEE transactions on acoustics, speech, and signal processing, vol. 26, no. 1,
pp. 43–49, 1978.

[23] L. Chen and R. Ng, “On the marriage of lp-norms and edit distance,” in Proceedings of
the Thirtieth international conference on Very large data bases-Volume 30, pp. 792–803, VLDB
Endowment, 2004.

[24] M. Cuturi, “Fast global alignment kernels,” in Proceedings of the 28th international conference
on machine learning (ICML-11), pp. 929–936, 2011.
REFERENCES 41

[25] M. Müller, Information retrieval for music and motion, vol. 2. Springer, 2007.

[26] E. Keogh, “Exact indexing of dynamic time warping,” in Proceedings of the 28th international
conference on Very Large Data Bases, pp. 406–417, VLDB Endowment, 2002.

[27] G. A. ten Holt, M. J. Reinders, and E. Hendriks, “Multi-dimensional dynamic time warp-
ing for gesture recognition,” in Thirteenth annual conference of the Advanced School for Com-
puting and Imaging, vol. 300, 2007.

[28] F. Petitjean, A. Ketterlin, and P. Gançarski, “A global averaging method for dynamic time
warping, with applications to clustering,” Pattern Recognition, vol. 44, no. 3, pp. 678–693,
2011.

[29] E. Alpaydin, Introduction to Machine Learning. The MIT Press, 3 ed., 2014.

[30] C. E. Rasmussen and C. K. Williams, “Gaussian processes for machine learning,” the MIT
Press, vol. 2, 2006.

[31] T. Fawcett, “An introduction to roc analysis,” Pattern recognition letters, vol. 27, no. 8,
pp. 861–874, 2006.

[32] J. J. Thiagarajan, K. N. Ramamurthy, and A. Spanias, “Multilevel dictionary learning for


sparse representation of images,” in Digital Signal Processing Workshop and IEEE Signal
Processing Education Workshop (DSP/SPE), 2011 IEEE, pp. 271–276, IEEE, 2011.

[33] R. E. Schapire, “The boosting approach to machine learning: An overview,” in Nonlinear


estimation and classification, pp. 149–171, Springer, 2003.

You might also like