Professional Documents
Culture Documents
Jacob Nilsson
M ASTER ’ S T HESIS
Jacob Nilsson
supervised by
Philip Lindblad
With the increased use of Android smartphones, the Android Pattern Lock graphical password
has become commonplace. The Android Pattern Lock is advantageous in that it is easier to
remember and is more complex than a five digit numeric code. However, it is susceptible to a
number of attacks, both direct and indirect. This fact shows that the Android Pattern Lock by
itself is not enough to protect personal devices. Other means of protection are needed as well.
In this thesis I have investigated five methods for the analysis of biometric data as an un-
noticable second verification step of the Android Pattern Lock. The methods investigated are
the euclidean barycentric anomaly detector, the dynamic time warping barycentric anomaly
detector, a one-class support vector machine, the local outlier factor anomaly detector and a
normal distribution based anomaly detector. The models were trained using an online training
strategy to enable adaptation to changes in the user input behaviour. The model hyperparam-
eters were fitted using a data set with 85 users. The models are then tested with other data sets
to illustrate how different phone models and patterns affect the results.
The euclidean barycentric anomaly detector and dynamic time warping (DTW) barycentric
anomaly detector have a sub 10 % equal error rate in both mean and median, while the other
three methods have an equal error rate between 15 % and 20 % in mean and median. The
higher performance of the euclidean and DTW barycentric anomaly detector is likely because
they account for the time series nature of the data, while the other methods do not. Each user
in the data set have provided each pattern at most 50 times, meaning that the long-term effects
of user adaptation could not be studied.
iii
iv
P REFACE
The work in this thesis has been carried out at BehavioSec. I would like to thank them
for giving me the opportunity to do this. It has given me a lot of insight into areas that
I first never thought I go near. The work has been interesting and I have had a blast
spending this short time with them. A special thanks goes to my supervisor Philip
Lindblad for presenting interesting ideas and guiding me in the right direction. Next
I would like to thank my examinator at LTU Fredrik Sandin for helping me when I
needed it.
Lastly I would like to acknowledge the support that my parents have given me
over the last five years, and also the friends I’ve made here. Without you, I don’t think
I would have coped with studying at this level.
Jacob Nilsson
Luleå, 7 June, 2017
v
vi
L IST OF F IGURES
vii
viii
L IST OF TABLES
ix
x
S YMBOLS AND N OTATION
Notation Meaning
s ti , s i Element i of time series s
D p,q Distance matrix between time series p and q
E p,q Accumulated distance matrix between time series p and q
α Threshold parameter for distance-based anomaly detectors
k · kp, k · k L p -norm
O Big O-notation of time complexity
min f (·) The minimum value of function f
argmin f ( a) The argument a that minimizes function f (·)
a
X Training set
x∗ Test sample
N (x) The neighborhood of x
L( f ) Log-likelihood of f
xi
xii
C ONTENTS
C HAPTER 1 – Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Android pattern lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Behavioural biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Problem description and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Research on touch screen usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Biometrics on android pattern lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Side-channel attacks using other sensors . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
C HAPTER 2 – Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Elastic and non-elastic measures of time series . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Dynamic time warping (DTW) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 DTW barycentric averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 One-Class support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Distance-based anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Barycentric Anomaly Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Local outlier factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Normal density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Reciever-operating characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
C HAPTER 3 – Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Collection application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Outsourced data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Inhouse data set #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.4 Inhouse data set #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Pre-processing of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
xiii
3.2.1 Uneven sampling of raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Resampling of raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Removal of bad users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Phone on table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Sensor malfunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
C HAPTER 4 – Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Convergence of input behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Input time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 DTW-similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Machine learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Division of data into training and test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.1 Outsourced data set used as the training set . . . . . . . . . . . . . . . . . . . . . 27
4.4.2 Inhouse data sets used as test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
xiv
C HAPTER 1
I NTRODUCTION
1.1 Background
The mobile phone has become a tool which most of us can not do without. It is used for
communication, entertainment, data storage and more. As the usage of mobile phones
has increased, we have grown more comfortable with saving secretive data on them.
With increasing amounts of data, the need to keep out malicious users increases as
well. Traditionally, a PIN code has been used to log into a mobile phone. The PIN code
works well for phones with a physical keyboard, but with the advent of smartphones,
we now have the possibility of using graphical passwords.
A graphical password relies on visual cues for security. They can be as simple as
just identifying pictures, but also more complicated ones such as the Draw-A-Secret
scheme by Jermyn et al [1]. The Draw-a-secret scheme has the user drawing a pattern
on a screen with a N × N grid. When the pattern is drawn, the order in which the
pattern goes into the cells of the grid is stored as the password. Dunphy et al. [2]
improved the Draw-A-Secret scheme by introducing background images. The back-
ground images made users choose more complex passwords. Another similar idea
was introduced by Tao [3]. Instead of drawing free hand on a screen, in Tao’s Pass-
Go scheme, the user connects intersections of a grid on the screen. Googles Android
Pattern Lock is based on Tao’s idea, but using a smaller 3 × 3 grid.
The reason graphical passwords are preferentiable to PIN-codes is the pictorial su-
periority effect. This is the fact that we learn to remember graphical patterns more
easily than strings of numbers and letters [4, 5].
Like all other passwords, there are ways for malicious users to get a hold of graphi-
cal passwords and access a phone without permission. Graphical passwords are more
susceptible to shoulder-surf attacks (where an attacker looks at your device while you
input your password) than PIN codes due to the pictorial superiority effect. Ye et al.
[6] used recordings of password inputs to find what pattern was inputted. From the
footage, they could recover the basic motion of the fingers on the screen. Then, they
1
2 1.1. BACKGROUND
could find some candidates for that pattern. Using those candidates, they could guess
the password with 95 % accuracy on the first five attempts. Also, their method of attack
was better at finding large patterns, as a more complex motion reduced the number of
candidate patterns. The footage they used was captured with all kinds of cameras in-
cluding smartphone cameras. The angles also varied and could be so extreme that the
screen was not visible. Another vector of attack is the smudge attack. When using
a touch-screen, your fingers leave oily residues called smudges. If an attacker gets
hold of your phone, they can use these smudges to deduce the pattern. Aviv et al. [7]
tested how strong these kinds of attacks can be. They showed that a picture of the
smudges can be enough to figure out the password. They also tested the resiliance of
the smudges and found that they can persist after using the screen for a while. The
only way to make sure that a smudge attack could not be used was to wipe the screen
clean.
Due to these security issues the need for extra verification is necessary. One kind
way to impose extra security is to use behavioural biometrics. Behavioral biometrics is
the discipline of measuring individual’s behaviour. One way this is used is to identify
users with keyboard dynamics [8]. If such biometric information can be used to verify
the correct user during log in, a second verification step can be implemented invisibly.
As users spend over a minute a day logging in to their devices [9], there is a need for
this verification to be fast and non-intrusive.
• The lines must go through at least 4 dots (ensuring at least one change of direc-
tion.)
• A dot can only be used once.
• If a line passes through a previously unused dot, this dot will be used.
There is a total of 389 112 possible patterns, which is more than a PIN code of length 5
[7]. Figure 1.1 presents an example of a valid APL password.
Despite the fact that there are more possible APL patterns than there are PIN codes
of length 5, the APL is not necessarily more secure. In a large study, Uellenebeck et al.
[10] tested how people use the APL. First, participants answered where their personal
APL pattern started and then they got to input a pattern that they thought was secure.
The study revealed that 38 % of users started their pattern in the top left corner, and
CHAPTER 1. INTRODUCTION 3
Figure 1.1: Android Lock Pattern example. The sequence of this password is 1-5-7-4-2-8-9-6-3
that the corners started 75 % of all patterns. Using the inputted data, they estimate
that real pattern have an entropy slightly lower than that of a 3 digit PIN code. They
conclude that even though the APL in theory is safer than a 5 digit PIN, in practice it
is not.
• To what extent can password patterns be verified with machine learning meth-
4 1.3. RELATED WORK
ods?
• Which of the typically available sensors are useful for that purpose?
• How many times must a typical user input a pattern so that the pattern data is
consistent enough to be used with machine learning methods?
1.2.1 Limitations
The work in this thesis focus on finding methods than can recognize the unique way
a user uses their device and detect when malicious users try to enter the device. An
Android phone contains a lot of different sensors. Here I focus on the XY-position of
fingers on the screen, the accelerometer and the gyroscope data. Other sensors such as
area touched and finger pressure do not work the same on all phones and therefore I
do not consider them. The goal of this thesis is to develop the methods for two-step
verification. However, creating an application using this method is beyond the scope
of this thesis.
Due to the ubiquity of smart phones there is a lot of research on how we use touch
screens. Weir et al. [12] used gaussian process regression to improve the accuracy of
button presses on a Nokia N9. The regression model was trained on each test user to
provide individual improvements. Buschek et al. [13] goes further with this idea and
uses screen data with features akin to keystroke dynamics [8] to provide an extra veri-
fication step for normal passwords on smart phones. They also provided a framework
for inputting the password with different hand postures. The same group have also
provided research on how the target shape and size affects touch screen accuracy [14].
An interesting contribution was the construction of an index of individuality. They
used an gaussian regression model to classify users and calculated the decrease in en-
tropy, which they used as an index of individuality. As mentioned earlier, the APL is
just a kind of graphical password. Li [15] created a free-form gesture recognizer called
Protractor, which uses screen XY-positions and downsamples this information to only
16 samples. Despite the heavy downsampling, Li gets results comparable to the more
computationally expensive $1 free-form gesture recognizer [16].
CHAPTER 1. INTRODUCTION 5
1.4 Outline
This thesis is outlined as such: Chapter 2 contains the theory for the methods used in
this thesis. Chapter 3 describes the data and the pre-processing steps taken. Chapter 4
describes the methods used for analysis and testing. Chapter 5 contains the results of
the analysis and testing. Lastly, Chapter 6 contains the discussion, my thoughts on
future work and my conclusions.
6 1.4. OUTLINE
C HAPTER 2
T HEORY
In this chapter I introduce the theory required to understand the methods described in
chapter 4. Section 2.1 describes what time series are, the difference between non-elastic
and elastic measures and the Dynamic Time Warping algorithm in detail. Section 2.2
describes the one-class support vector machine. Section 2.3 describes three different
distance-based anomaly detectors, the barycentric anomaly detector, the local outlier
factor and the normal density estimator. Lastly, Section 2.4 covers the Reciever Opera-
tor Characteristic, a standard anomaly detection performance measure.
n
D (q, s) = ∑ k qi − si k . (2.1)
i =1
The norm in this case can be the euclidian (p = 2), manhattan (p = 1) or any k ·
k p norm. In practical tests for time series classification using a 1-nearest neighbour
algorithm, the euclidean norm performed better than all other non-elastic measures
[21]. Non-elastic measures work well when every time series is sampled in the exact
same way, but if the sample timings are inconsistent the time series compared are no
longer the same. This could decrease the classification performance. Also, non-elastic
measures requires that each signal has the same number of samples, which need not
be the case.
7
8 2.1. ELASTIC AND NON-ELASTIC MEASURES OF TIME SERIES
To compare time series with timing inconsistencies or different lengths, elastic mea-
sures have been developed. Elastic measures try to find the samples in a time series
that best fit each other and then measures the distances between them. Examples
of such algorithms are Dynamic Time warping (DTW) [22], Edit distance with Real
Penalty [23] and more [21]. Dynamic time warping was one of the earliest conceived
elastic measures and despite many tries to create better algorithms, Wang et al. have
shown that dynamic time warping is in most cases as good or better than more recent
algorithms [21].
1 = π1 ≤ πi ≤ πi+1 ≤ π N = N. (2.2)
(B)
Figure 2.1: Visual representation of the alignment of two time series. The aligned elements are
(1,1), (1,2), (2,2), (3,3), (3,4) and (4,5). (A) the matrix representation of alignment. (B) the
alignment between series by thick black lines.
CHAPTER 2. THEORY 9
3: return En,m
alignment distance. The aligment distance between si and q j is the euclidean distance
between the points: Di,j = |si − q j |. The matrix D is called the distance matrix.
The DTW alignment problem can now be defined as
Even though I use the term "DTW-distance", it is not a distance metric in the normal
Di+1,j+1 + Ei,j
Ei,j+1 = 2
Ei,j = 1 Ei+1,j = 3
Figure 2.2: Visual explanation of how the DTW algorithm accumulates distance (Equation
(2.3)). For each pairing (i + 1, j + 1) it adds the distance from the distance matrix and the
smallest accumulated distance from the left, lower left or lower pairing. In this case, the lower
left element has the smallest accumulated value and is added.
10 2.1. ELASTIC AND NON-ELASTIC MEASURES OF TIME SERIES
sense, as it does not satisfy the triangle inequality [25, Section 4.1]. The algorithm used
to calculate the DTW distance is outlined as follows: First calculate the distance matrix
D (i, j). Now create a new accumulated distance E (i, j) and set its element according to
Algorithm 1. A visual interpretation of this process shown in Figure 2.2.
Since the DTW combines each element of q for each element of s it has a compu-
tational complexity of O(m n). This is worse than the euclidean distance, which has
a complexity that increases linearly with the size of the time series. To decrease the
computational complexity the Sakoe-Chiba band was introduced. The Sakoe-Chiba
band puts a restriction on which element will be considered during calculation. It only
allows the elements that are T steps away from the diagonal to be used (Figure 2.3).
Since each element of the time series need to have an alignment, the minimum width
of the Sakoe-Chiba band is |m − n|. If this criterion is not met, there will be some ele-
ments of the longer time series that are not aligned. The Sakoe-Chiba band decreases
the complexity to linear time (O(( T + 1) · min(n, m))) and can also help with classifier
performance [26]. If T is equal to one the DTW reduces to euclidean distance.
In the case of multi-dimensional time series the basic DTW algorithm described
above will not work, since it only takes one-dimensional time series into account. The
multi-dimensional DTW (MD-DTW) [27, Chapter 4] algorithm works in the same way
in every regard apart from how the distance matrix D is calculated. Each element of D
is the sum of differences over the dimensions of the time series:
K
Di,j (s, q) = ∑ |sik − qkj | (2.6)
k =1
Figure 2.3: Sakoe-chiba band visualization for T = 1. The gray elements will not be considered
when performing DTW calculations.
CHAPTER 2. THEORY 11
be of equal length. Also, euclidean averaging only has one possible alignment, (si , qi )
for all i. DTW on the other hand lets you measure distance between time series of
different lengths. This implies that an average sequence c can have a length different
from any time series in the set it is averaging. Also, the optimal alignment between
time series in DTW is not know before calculating the distance see Figure 2.4. One ap-
proach for averaging time series using DTW is DTW barycentric averaging introduced
s s
c2 ci
c1
q q
(A) (B)
Figure 2.4: The problems with averaging time series with DTW. (A) An average sequence
with respect to DTW-distance (c1 and c2 for example) can have arbritary length, as the DTW-
distance is an elastic measure. (B) All possible alignments between element ci and the elements
of s and q. It is not known beforehand which alignments are part of the optimal alignment.
s s
c c
q q
(A) (B)
s s
c0 c0
c
q q
(C) (D)
Figure 2.5: Dynamic barycentric averaging algorithm. (A) Initiate c. (B) Find the optimal
alignments of c with all time series. (C) Set ci0 as the average of all elements ci aligns to. (D)
Replace c with c0 and repeat steps 2 and 3 until end conditions are met.
12 2.2. ONE-CLASS SUPPORT VECTOR MACHINES
min R2 +C ∑ ξ t (2.7)
t
s.t. k xt − ak2 ≤ R2 + ξ t and ξ t ≥ 0, ∀t (2.8)
3.0 SVM using gaussian kernel
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0 1 0 1 2 3 4 5
Figure 2.6: Example of a one-class SVM with a gaussian kernel. The red region predicts anoma-
lies.
CHAPTER 2. THEORY 13
where ξ t is called the slack variable for the t-th training sample and C is a regularization
factor.
The slack variable is the "enclose as much as possible"-part of the definition. This
problem can be casted as a convex optimization problem by maximizing the lagrangian
Ld :
max Ld = ∑ αt ( xt ) T xt − ∑ ∑ αt αs ( xt ) T xs (2.9)
t t s
s.t. t
0 ≤ α ≤ C and ∑ αt = 1. (2.10)
t
Here, αt is a lagrange-multiplier for the t-th training sample [29, Section 13.12].
In Equation (2.9), we can see that the optimization problem does not rely on sin-
gular training points, rather the inner product between training samples. The inner
product is part of a class of functions called positive-definite kernels, often just called
kernels. A kernel is a function K : Rn×n → R that is symmetric K ( x, y) = K (y, x ) and
positive semi-definite. For one-class support vector machines, the kernel can be seen
as a coordinate transform. It is used by simply exchanging the inner product by the
kernel used in equation (2.9):
max Ld = ∑ α t K ( x t , x t ) − ∑ ∑ α t α s K ( x t , x s ). (2.11)
t t s
This allows the OCSVM to use arbritary smooth shapes (dependending on what kernel
is used), instead of only circles.
A commonly used kernel is the gaussian kernel
k x − y k2
K ( x, y) = exp − (2.12)
γ2
where γ is the length-scale of the kernel. This kernel will give higher values for points
that are closer together, meaning it emphasizes the smoothness of the data [30, Section
4.2]. An example of the decision regions of an OCSVM using gaussian kernels can be
found in Figure 2.6.
Here f is any function with codomain [0, ∞) for the test set and the training set and α
is the discriminating distance or threshold.
Let md (X ) be the mean of X with respect to some distance d( x, y). The Barycentric
Anomaly Detector is then defined as
Examples of distances that can be used is the euclidean distance and the (MD-) DTW
distance [28]. When used with those distances, the corresponding anomaly detector is
called euclidean BA and (MD-)DBA.
The Local Outlier Factor (LOF) compares the density of a test sample to the density of
its k-nearest neighbors. It is formally defined as
dk ( x∗ )
LOF( x∗ , X ) = (2.15)
∑s∈N ( x∗ ) dk (s)/|N ( x∗ )|
where dk (y) is the euclidean distance from y to its k-nearest neighbour. N ( x∗ ) is the
test samples k-nearest neighbors. Intuitively, the LOF tests if the distances from the
test sample to its neighbors are similar. If the LOF is close to 1, they are probably in the
same cluster but if it is larger than 1 they are probably not [29, Section 8.7].
Figure 2.7: Illustration of barycentric anomaly detection. The black dots are the training sam-
ples and the cross is their average. The green disc and red triangle are test samples. All test
samples outside the threshold (gray dashed circle) are labeled anomalies.
CHAPTER 2. THEORY 15
Figure 2.8: Illustration of local outlier factor anomaly detection. The neighborhood of the
square, the green circle, is approximately the same size as the neighborhood of the training
set, the black circles. The neighboorhood of the triangle, the red dashed circle, is much larger
than the neighborhood of the training set. Then the square is not an anomaly and the triangle
is an anomaly.
With normal density estimation, each feature of the data is estimated as normally dis-
tributed and independent stochastic variable, ξ. The density function is
( x − µ )2
1
f ξ (x) = √ exp , (2.16)
2πσ2 2σ2
where µ is the mean of the training data and σ is the standard deviation of the training
data. For one-dimensional data, a test sample is said to be an anomaly when f ξ ( x∗ ) <
β. To save the step of calculating the exponentianl function, the log-likelihood can be
used instead. The log-likelihood is the natural logarithm of the density function
( x − µ )2 1
L( x, X ) = log( f ( x, X )) = 2
− log 2πσ2 . (2.17)
2σ 2
The last term of the sum is constant with respect to x. This means that it can be dis-
carded when comparing different test values and the resulting normal distribution
equation becomes
( x ∗ − µ )2
ND( x∗ , X ) = , (2.18)
2σ2
TPR
ROC = , (2.19)
FPR
where TPR is the true positive rate and FPR is the false positive rate The strength
of ROC for measuring the anomaly detector performance shows when changing the
hyperparameters of the model.
If the ROC is saved each time a model is trained with different hyperparameters,
these values define a ROC-plot. Figure 2.9 shows an example of an ROC-plot. Each
black dot represents the performance of a classifier with different hyperparameter val-
ues. The dotted line from (0, 0) to (1, 1) represents the randomness line. If a classifier
lies close to this line, the true positive rate is equal to the true negative rate and the
anomaly detector will be random. If it lies above the randomness line the performance
is better than random guessing. If it lies below the randomness line the performance
is worse random guessing but the method can be inverted to move it above the line.
0.8
True positive rate
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
False positive rate
Figure 2.9: Example of a ROC-curve. The upper green region represents better than random
performance. The red region means worse than random performance. The dashed gray diagonal
line is the EER line.
CHAPTER 2. THEORY 17
The dashed line from (1, 0) to (0, 1) is the line of equal error rate (EER), where the risk
of missclassifying both true and anomalous data is equal [31].
In training, the objective has been to find the hyperparameters that produces an
ROC-point as close as possible to the EER-line. In testing, these hyperparameters are
used.
18 2.4. RECIEVER-OPERATING CHARACTERISTIC
C HAPTER 3
D ATA
This chapter describes the data used in this thesis and the steps taken to make it useful.
Section 3.1 describes the application used to collect the data and the different data sets
used. It also describes the patterns used. Section 3.2 describes the pre-processing steps
applied to make the data suitable for analysis with the machine learning methods used
in this thesis. Lastly, Section 3.3 describes the issues with some of the data and why it
was removed.
19
20 3.2. PRE-PROCESSING OF DATA
Z Snake Nunchuck
L Spiral
total 14 different phones were used. The users were asked to perform the patterns
snake and spiral 50 times each, see Figure 3.1. In total 101 users participated in this
study. Compared to other studies, this is a high number of users [17, 18]. Also, a larger
variety of phones are used. Overall, the advantage of this data set is that it includes
numerous users. The large number of phones and patterns means it is best suited for
training because the variability of the data is high. The variability makes it hard to test
the effects of overfitting with respect to phone model and pattern.
A second data set was collected at BehavioSec. This data set is based on seven partic-
ipants that input a single pattern 30 times each. This was done with seven different
phones. This data set was collected to test the effects of using different phones in the
outsourced data set.
Lastly, a third data set was collected at BehavioSec. This data set is based 27 partici-
pants that input all patterns in Figure 3.1. Three different phones were used. All users
did not input all patterns on all phones. This data set was collected to test how the
choice of pattern affects the results.
CHAPTER 3. DATA 21
XY
ACC
GYRO
10 20 30 40 50 60 70 80 90 100
Time (ms)
Figure 3.2: Demonstration of the unevenness of the sampling. Each dot corresponds to a sample
from one of the three sensors. The X-axis is the time of sampling.
22 3.3. REMOVAL OF BAD USERS
tion, the phone should not be placed on a table. The data collection application itself
did not take this into account. Thus these users need to be sorted out after the data
collection. First, I wrote a script identifying all users with a mean-square acceleration
and rotation of less than a threshold value. The acceleration and rotation of these users
were then plotted and those I considered too flat were discarded.
This chapter describes the methods used in this thesis, based on the theory presented
earlier. Section 4.1 presents the methods used to measure how the users’ input be-
haviour converge over time. Section 4.2 presents the machine learning methods used
for identification. Section 4.3 presents the online learning strategy used to train the
methods and Section 4.4 presents how the different data sets were used to train the
methods.
4.1.2 DTW-similarity
Using the DTW-distance it is possible to calculate the similarity of inputs. For each
pattern the pairwise DTW-distance between inputs n − 1, n, and n + 1 was calculated.
Doing this for all n ∈ [2, 49] a vector of 48 values is produced that shows the local sim-
ilarity of inputs. Like for the time similarity, these DTW similarities can be normalized
25
26 4.2. MACHINE LEARNING METHODS
with respect to the first similarity. Subsequently an average over all users and patterns
can be calculated. If the users’ input behaviour converge on average the similarity
vector will consist of decreasing values.
The machine learning methods chosen for this investigation can be found in Table 4.1.
The methods can roughly be dividided into two different categories: Those regarding
the data as a time series (DBA and Euclidean BA, marked in gray) and those that do
not. The BA methods and OCSVM are used with the resampled time series data. For
the BA methods, each input time series is considered as a single input. For the OCSVM,
each sample in time is considered as an input vector. This means that the OCSVM will
test all timings against each other as if they were taken at the same time. When testing,
the OCSVM will produce predictions for each time in the time series. If it predicts
that half of the test inputs are anomalous, that input is also said to be anomalous. The
LOF and density methods are applied to the extracted features. The LOF produces
a single prediction and the density method produces one prediction for each feature.
The Density method will predict an input as anomalous if more than 1/4 of the features
are anomalous. This value was chosen through testing.
Typically in machine learning, an assumption that is made the samples are indepen-
dently and identically distributed (iid). In my data this assumption can be made on the
user level. Users make their inputs independently of the other users and they aim
to produce the same pattern. Each pattern the user make, however, depends on how
many times the user has produced the pattern before due to the development of mus-
cle memory [11]. This means that the model should change over time, in order to adapt
to the developing muscle memory. To do this, an online training protocol was used for
each user where the n latest patterns classified as corret were used to train the model.
The first training set for each user was set as patterns m − n to m. A flowchart of this
training protocol is found in Figure 4.1. The scores saved for each user were the num-
ber of true positives, false positives, false negatives and true negatives, from which
the true positive rate, false negative rate and ROC-curve were determined. From the
average ROC-curve, the equal error rate (EER) was calculated. The EER is the point
where the false positive rate and false negative rate are equal.
CHAPTER 4. METHODS 27
Method Hyperparameters
DBA τ, α, iter
Euclidean BA α
Gaussian OCSVM ν, γ
LOF k, α
Density α
Table 4.1: Methods considered for testing with the corresponding hyperparameters.
j = j+1 Yes
No
Stop Store predictions
Figure 4.1: Training regime illustrated. First the training set is initialized to patters {2,3,4}.
The model is then tested on consecutive patterns until it finds a positive pattern, j. Now the
training set is updated to patterns {3,4,j} and the model is retrained. This process is repeated
until the user has no more patterns to train on.
28 4.4. DIVISION OF DATA INTO TRAINING AND TEST SETS
hyperparameters chosen in the testing phase, the generalization capacity of the models
can be evaluated.
C HAPTER 5
This chapter presents the analysis and results of the methods presented in Chapter
4. Section 5.1 presents the analysis of user input behaviour over time. Section 5.2
presents the results from training and the hyperparameters chosen in that stage. It also
presents analysis of the BA methods used on different sensors. Section 5.3 presents
tests of overfitting with respect to phones and Section 5.4 shows the results for different
patterns.
Xy 1.0
1.00 Acc
Gyro
Normalized DTW distance
Full
Normalized input time
0.9
0.95
0.8
0.90
0.7
0.85
0.6
0.80
0 10 20 30 40 50 0 10 20 30 40 50
Input No. Input No.
(A) (B)
29
30 5.2. OUTSOURCED DATA SET
Table 5.1: Equal Error Rate and chosen hyperparameters for the methods used. The EER for the
mean values of EBA and DBA (marked with asterisks) where found at different α’s.
T RAINING EER
Algorithm Hyperparameters Mean Median Std
EBA α = 5.6 *0.088 0.072 0.091
DBA α = 5.6, τ = 1, iter = 4 *0.096 0.057 0.087
OCSVM ν = 0.1, γ = 0.25 0.19 0.18 0.17
LOF α = 1.6, k=4 0.19 0.16 0.14
ND α = 4.3 *0.18 0.15 0.14
'