Professional Documents
Culture Documents
1 Introduction
In today’s world, one of the most commonly used means of authentication and
security is Passwords as they are user-convenient and are easier to maintain and
remember. Passwords are the most commonly used means of authentication
as passwords are very easier to implement, convenient for users and are user
friendly.
There are mainly 2 types of Password based attacks : Online Attacks and
Offline Attacks.
Some of the examples of Offline based Password attacks include Eavesdropping
the communication channel and recording or manipulation of the conversations
that are taking place on the communication channel. Mainly offline attacks are
faster as compared to online attacks as the involvement of online authentication
server is not there, which will eventually result in the faster access of data to
and from the database. Also offline attacks, are computationally faster as the
password crack mechanism is held offline, which will happen usually happen in
depth first attack, as usually we are targeting a single user at the instant.
The online based Password attacks can be majorly classified into Targeted
(Depth First) and Untargeted attacks (Breadth First) from two types of
attacks: In a particular attack a mostly guesses are being ran for a small number
of accounts that eventually is a particular interest to an attacker. The email
or social networking accounts of celebrities, for example, are obvious targets.
In an untargeted attack guesses are varying and depicting to a huge number
of accounts. Here the interest of the attacker is not in particular individual so
much as an information gain or an account response from the user. It can be
seen that the malicious or spam emails are sent to the accounts that are more
famous or are in a limelight compared to that of that are hidden or are securely
filtered.
1
2 Abstract
This paper shows how to accurately estimate the odds that an observation x
indicates that a request is malicious. Our main assumptions are that successful
malicious logins presumably are a small in fraction compared to the the total,
and that the distribution of x in the admissible traffic is static or stationary,
or very-slowly changing. From these it shows how we can estimate the ratio of
bad-to-good traffic among any set of requests, so resulting to the question that
how can we find the subset of traffic that is admissible and how to get the help
to find the distribution of x from the least attacked subsets over the admissible
data, and then finally the odds ratio is being calculated from the estimation
value obtained.
A sensitivity analysis afterwards shows that although we are unable to get the
subset having little traffic our analysis to estimate the odds is quiet robust.
2
be relied upon elsewhere.
Considering the above challenges, in this paper we make only one assumption
about attack traffic:
Assumption : Malicious logins are a small fraction of total logins. This should
be inherently true of guessing attacks. First, any service for which it is not true
(i.e., admissible logins do’nt outnumber malicious logins in large proportion)
and it seems that it would give a certain utility to the users. Secondly, for the
best case, even the attacker’s per-guess success rate of about 0.1% ; if admissi-
ble requests succeed 95% of the time, and traffic is a 50/50 mix of benign and
malicious then malicious logins would get outnumbered by benign ones in huge
amount950:1.
P (mal)
Ψ= = (1 − α)/α
P (mal)
By using this deduction we can identify the subsets of the data where the distri-
bution of malicious traffic is very little and the traffic is mostly clean ,i.e, α ≈
1. With this the observed distribution closely approximates that of the benign
traffic with P (x) ≈ P (x | mal).
3
where,ˆrepresents estimate value. One thing to note here is, in order to sim-
plify we have represented our approach for single categorical feature and we as-
sume that distribution of features in the benign traffic is independent
(A2) and slowly-varying.
Coming back to the equation (2), we represent the two component ratios of the
likelihood ratio test as Ψ and Θ(x). Ψ is the ratio of bad-to-good traffic (often
called the pre-observation odds); Θ(x) is the ratio of how common x is in the
malicious traffic over how common it is in legitimate traffic (often called the
likelihood ratio). The product, Θ(x) · Ψ, is often called the post-observation
odds.
In addition to this we assume certain things like Malicious logins are a small
fraction of total logins (A1), which can be observed from the fact that legit-
imate user succeed at least 95% of the time while the attackers fail at least 99% of
the time. Moreover we assume that rate of login failures from benign users
is independent of feature categories (A3), which means that the benign
failure rate should be approximately constant when users are grouped by cer-
tain features like browser version, IP address pool etc. This assumption does
not hold for malicious traffic.
L = Lb + Lm
F = Fb + Fm
4
. To estimate the last term in the likelihood ratio test, the ratio of bad traffic
to good is :
P (mal)(k) 1 − α(k) (Lm + Fm )
ϕ= = =
P (mal)(k) α(k) Fb + Lb
Consider the ratio of fails to logins. Under our assumption,
Fb Fm
≈ +
Lb Lb
So if legitimate requests succeed 95% of the time, and traffic is a 50/50 mix of
malicious and benign then 1 + Lm /Lb = 1.0011 ≈ 1.
Benign traffic is the aggregate of a large number of users (≈ hundreds of mil-
lions) acting independently. For a large enough number of requests :
(Fb /Lb ) ≈ const. Because, login failures from legitimate users to happen in-
dependently at some rate p by means of typos, mis remembered passwords or
usernames etc, which we assume happen independently across the population.
So, a total of Lb + Fb benign requests should result in p · (Lb + Fb ) = Fb failures.
The ratio of benign fails to logins should be fairly constant over large enough
collections of authentication requests. Hence, if we divide our data into large-
enough subsets of requests, we get for any given subset :
F (k) Fm (k)
≈c+
L(k) Lb (k)
where c = p/(1 − p) and k is just an index over subsets. The subsets can
be any grouping of the login request data so long as they have enough entries
to ensure that the confidence interval is small relative to p. For example, the
subsets can have request data from particular time slots, requests that come
from particular IP-address ranges, requests with particular user-agent strings,
or requests against particular sets of accounts. Combinations of those things are
also possible, e.g., we could form a subset of request data during a particular
time slot, from a particular range of IP addresses against a certain collection of
accounts. We’ll take subsets that are grouped by time intervals (e.g., hour or
day) as well as by features such as browser and IP address; to avoid unnecessarily
encumbering the notation we don’t explicitly add an index for time.
We are assuming that the benign failure rate p is independent across features;
that is, Fb (k)/Lb (k) ≈ const.
We expect the benign traffic from one particular range of IP addresses to fail at
the same rate as others. We are not assuming the same of the malicious traffic,
and we are not assuming the same of the observed traffic. Observed logins,
F (k)/L(k), are due to malicious traffic. That is, if a subset has no malicious
traffic, Fm (k) = 0 and then F (k)/L(k) = c = p/(1 − p). Further, since Fm (k)
5
must be positive, malicious traffic can cause increases in F (k)/L(k) above the
baseline value c, but cannot cause decreases. Estimate the malicious traffic in
a particular subset is observed with the help of value c.
F (k)
Fm (k) = Lb (k) · ( − c) ≈ F (k) − c · L(k)
L(k)
P (mal)(k) Fm (k)
≈ (4)
P (mal)(k) Lb (k) + Fb (k)
F (k) − L(k) · c
=
L(k)(1 + c)
This involves only quantities that we observe directly (i.e., F(k) and L(k)) and
p = c/(1 + c) (to be estimated).
To get an accurate estimate of the constant p/(1 − p). We will find an esti-
mate the rate at which users enter either password or username incorrectly.
Various Observations –
6
• Finally, users may feel more reticent to allow a browser to auto-fill cre-
dentials for a bank site than for a video-streaming service or online news-
paper; this might make password submission more likely to be automated
and thus less subject to error.
For all of these reasons it seems preferable to estimate p on a per-site basis and
to update this estimate periodically. Malicious traffic can increase F (k)/L(k)
above c but not decrease it. Thus, a simple upper-bound estimate for our
constant is:
F (k)
ĉ = mink ≥c (5)
L(k)
Here the minimization is taken over all subsets large enough to satisfy our as-
sumption that the confidence interval is small enough to ensure that F b/Lb is
approximately constant.
We’ll examine subsetting strategies as follows :
Once we have this we can estimate p̂ = ĉ/(1 + ĉ). Call the index of the subset
of our data that achieves the minimum k0 . Note that k0 is the subset, where
Fm (k0 )/Lb (k0 ) is minimized. It’s easy to show that Fm (k)/Lb (k) is monoton-
ically decreasing with α(k) and thus this subset contains the least amount of
attack traffic among all the subsets. Hence, in finding our estimate ĉ we’ve also
found an estimate for the distribution of x over the benign traffic:
Now, we can drop the k index for P (x | mal) since it is constant across subsets.
On further simplifying the above equation, we get α(k) as :
P (mal)(k) 1 − α(k)
= (7)
P (mal)(k) α(k)
7
So, we can now calculate α(k) for any subset assuming that we have first esti-
mated P (mal)(k)/P (mal)(k) Now, the malicious portion of the traffic is:
4 Overall Algorithm
At First, we will try to estimate the benign traffic since they are mostly sta-
ble and do not need to be re-estimated again. From the stream of incoming
requests, we select a time-interval and divide our incoming data stream into S
subsets using specific strategies. The overall algorithm implementation can be
summarized as:
1. In the first step of the algorithm, we calculate the estimated value of ratio
of benign failure-to-benign logins, i.e ĉ = mink FL(k) (k)
.
2. Using ĉ, we calculate the estimated p̂ = ĉ/(1+ĉ) which the benign failure
rate.
8
5 Conclusion
The problem addressed in this paper is to protect an authentication server
from online password attacks mainly focusing on password guessing attacks.
The algorithm proposed in this paper helps to estimate the ratio of malicious-
to-benign traffic and hence determine the odds that an observed request is
malicious. The scope of the algorithm can be judged with the fact that it does
not assumes scarcity of properties of the attacker and avoids base rate neglect.
Following these rules we identify subsets of the request data with least attack
traffic which help us to estimate the distribution of the values of observation
over legitimate data and calculate the odds ratio.
6 Suggestions
• However, as with any statistical technique, it is important that we use
features that distinguish well between attack and benign traffic.
• A main source of error is the possibility that we estimate c from a subset
that is not entirely benign, and contains some attack traffic. We can see
how this affects the value that we get for ĉ and then examine how that
inaccuracy changes through the other quantities.
• As the model is totally based upon the methods of taking a huge amount of
dataset and then according to it we are estimating the value of parameters,
(without using any prediction, but with the help of assumptions). So we
may use here any good machine learning methods to first pre-process the
data based upon various output values and then perform the analysis and
estimation of the model.