Professional Documents
Culture Documents
A Study On Security Analysis of The Unlinkability in Tor
A Study On Security Analysis of The Unlinkability in Tor
Abstract—Tor is an important tool for anonymous communi- plex nature of noise in Tor, which makes the flow correlation
cation. However, due to its popularity, Tor is also concerned by attacks more efficient.
censors or other malicious attackers. A large body of existing In order to remove the threat of these attacks, many pro-
work examines Tor’s susceptibility to flow correlation attacks,
which are a form of deanonymization attack. Currently, a variety tocol obfuscation tools have been deployed on Tor, such as
of traffic obfuscation plugs are deployed on Tor to defeat flow FTE [8], Meek [9], Obfs4 [10]. However, most obfuscation
correlation attacks. However, whether plug-based defense can plugs deployed only obfuscate packet contents, but not traffic
effectively resist those attacks has not been tested and verified. features [9], [11], [12]. So whether such plugs can resist the
This paper focuses on how to effectively defeat the threat of existing flow correlation attacks is still a question. Besides,
flow correlation attacks on Tor. We first conduct experiments to
illustrate the actual defense effect of obfuscation plugs against
the defense effect of obfuscation plugs in face of statistical
flow correlation attacks, and the results show their effectiveness. correlation metrics-based attacks is a relatively understudied
However, the obfuscation-based defense also faces many other aspect.
problems such as censorship. So we explore the techniques used In this work, we comprehensively analyze the defense effect
in differential privacy (F P Ak and d⇤ -private) to apply a tiny of obfuscation tools. We demonstrate that the performance of
perturbation on Tor traffic. Our findings on differential privacy
suggest that the perturbations generated by the two mechanisms
plug-based defense is relatively poor when the obfuscation
can successfully flaw the existing flow correlation attacks on Tor. plug does not obfuscate packet features (using Obfs4 with IAT
Index Terms—Differential Privacy, Tor, Traffic obfuscation, mode ”off” as the example). And even the plug can obfuscate
Flow correlation attacks. packet features (using Obfs4 with IAT mode ”on” as the ex-
ample), with the continuous enhancement of learning capacity,
I. I NTRODUCTION the attackers can also invalid the defense by expanding the
obfuscation traffic training dataset [1].
Anonymous system offers an important and effective To fill this gap, our goal is to find a generic and effective ap-
method to protect the privacy of users. The unlinkability proach that protects Tor (and similar anonymity systems) from
is an important property that all anonymous systems strive flow correlation attacks. Defending against flow correlation
to achieve. A system is deemed to have unlinkability if an attacks is essentially perturbing the flow correlation classifier.
adversary who is able to scan any number of network flows Both statistical correlation metrics-based and DL-based attacks
cannot determine whether the egress and ingress segments are are learning the natural statistical features of Tor traffic. So we
from the same connection [1]. As one of the most popu- think about how to perturb the statistical characteristics of Tor
lar anonymous systems, Tor provides unlinkability by many traffic to make the classifiers misclassify.
mechanisms, such as onion-circuits and anonymous domain We make a close reading of the researches on how to con-
generation [2]. In spite of the efforts to build unlinkability duct perturbations that can mislead the classifiers [13]–[16].
system, Tor is still threatened by flow correlation attacks [1], We find the perturbations generated by differential privacy
[3], [4]. An adversary, as long as he observes the two ends of a (DP) algorithm can guarantee that certain classes cannot be
target user’s Tor connections, can carry out the flow correlation distinguished by any classifier [15], [16]. The DP algorithm
attacks to acquire the relationship information of senders and will return a ”privacy preserving” sequence which protects the
receivers. correct class of the flow pairs. Zhang et al. [15] demonstrate
Most flow correlation techniques are based on statistical the effectiveness, security and performance when the DP
correlation metrics, such as cosine similarity [5] and Spear- algorithm is used to counter traffic analysis on encrypted video
man’rank correlation coefficient [6]. By calculating similarity streaming packets. Inspired by that, we adjust the original DP
or statistical dependence of traffic characteristics (such as algorithm to fit the constraints of Tor traffic and explore the
packet timings and packet sizes), the attackers can correlate adaption of these algorithm to defend against flow correlation
traffic with reasonable true positive (TP) rate and false positive attacks.
(FP) rate [7]. With the recent development of deep learning We explore two DP mechanisms, F P Ak [17] and d⇤ -
(DL), an adversary can carry out more powerful attacks [1]. private [16], to apply differential perturbations on Tor traffic.
The DL model will automatically capture the dynamic, com- The Fourier Perturbation Algorithm (F P Ak ) provides privacy
protection for correlated time series data through Discrete B. Differential privacy
Fourier Transform (DFT) in an differential private manner. The
Differential privacy (DP) is concerned with whether a small
d⇤ -private mechanism makes the data meet the requirements
change in a database can cause privacy leakage. In order to
of differential privacy through a series of operations [16]. Our
make it difficult for observers to detect changes by observing
experiment results show that both F P Ak and d⇤ -private can
the output of the computation over the database, random noise
disturb the rates of TP and FP to the baseline result (that is,
is added to the computation results. A randomized algorithm
random guessing). With the properly parameters selected, even
A gives (✏, )-differential privacy for any set of outputs ⌦, and
the classifiers are retrained and reinforced with noised data, the
for any neighbouring datasets of D and D0 , if A satisfies
F P Ak and d⇤ -private can sill successfully defeat the existing
flow correlation attacks.
Pr[A(D) 2 ⌦] exp(✏) · Pr [A (D0 ) 2 ⌦] + (1)
Contribution. We highlight the following three main con-
tributions: The parameter ✏ refers to the privacy budget, which is nega-
• We reveal the real threat of flow correlation attacks on tively correlated with the intensity of noise.
Tor by conducting state-of-the-art attacks. Another formal definition of DP is (d, ✏) -privacy, which
• We reveal that most of obfuscation plugs are flawed in is proposed by Chatzikokolakis et al. [18]. Specifically, a
the face of the flow correlation attacks and they also face mechanism A satisfies (d, ✏) -privacy if
many other threats.
• We are the first to apply differential privacy to defend Pr(A(D) 2 ⌦) exp (✏ ⇥ d (x, x0 )) ⇥ Pr (A (D0 ) 2 ⌦) (2)
against flow correlation attacks. The proposed approach
can greatly mitigate the attacks even the DL model is The d (x, x0 ) is a function that satisfies d(x, x) =
retrained with the noised data. 0, d (x, x0 ) = d (x0 , x) and d(x, z) d(x, y) + d(y, z) for
Organization. We recall the background knowledge in all x, y, z 2 ⌦. The d⇤ -private mechanism will be introduced
Sec.II. Sec.III illustrates our motivation by conducting flow in Sec.VI.
correlation attacks on Tor. Sec.IV describes the plug-based In our context, we explore whether the perturbations gen-
defense and demonstrates its limitations. Sec.V shows the erated by algorithm A can defeat the flow correlation attacks.
defense effect of our approach. Lastly, we conclude our work Our approach relies on two aspects. The first one is that
in Sec.VI. the generation of perturbations can be viewed as a privacy
protection problem in a specific domain [15]. The second one
II. BACKGROUND is that DP can successfully resist most of privacy attacks and
provide a provable privacy guarantee [19]. So we use DP to
A. Flow correlation attacks on Tor generate a more privacy-protected representation of input.
Flow correlation attacks on Tor refer to link the egress and
ingress flows which come from the same Tor connection by III. F LOW CORRELATION ATTACKS ON T OR
comparing the flow characteristics. The attacks can be divided
into two categories, one is based on statistical correlation In this section, we illustrate our motivation with an example
metrics. It is performed by measuring the similarity or the of how to infer the user association relationships by conducting
statistical dependence of two random variables. The similarity flow correlation attacks on Tor.
measure is a function that quantifies the similarity between two
objects. The most commonly used similarity measure for real- A. Datasets
valued vectors is cosine similarity, which is used in multiple
correlation systems [5], [7]. The statistical dependence can be We use the publicly available datasets of DeepCorr [1]
used to indicate the relevance of two flows. For example, the (we pick 25000 random flows) and CAIDA 2019 anonymized
Pearson correlation coefficient [5] is used to reflect the degree Internet traces1 (we pick 250,00 random flows each at least
of linear correlation between the two variables. 1000 packets long). The format of the former dataset is the
timing features (using the inter-packet delays to represent)
The other is based on deep learning. Deep learning can
and packet size sequence of the ingress and egress flows.
learn the inherent laws and representation levels of samples,
The format of the latter dataset is raw pcap, from which we
eliminate the need for the process of feature construction by
extract the Inter-Packet Delays (IPD) sequence and packet size
humans. One of the most attractive deep learning structures is
sequence, and use the network jitter generation method used
CNN. Recently, researchers begin to apply CNN to analyze Tor
in the literature [5], [20], [21] to simulate network jitters. So
traffic. A recent study by Nasr et al. [1] proposes DeepCorr, a
the network flow, Fi , is represented as follows:
CNN-based system which has a excellent performance in cor-
relating Tor flow pairs. Compared with statistical correlation
metrics-based attacks, DeepCorr is more efficient, which has Fi = [Ti ; Si ] (3)
an extremely high rate of TP with a low rate of FP (e.g., TP
is 0.8 when FP is 10 3 ). 1 The caida dataset, https://www.caida.org/data/passive/passive dataset.xml
324
MILCOM 2021 Track 3 - Cyber Security and Trusted Computing
B. Threat model
We use the threat model of flow correlation attacks used
in previous work [1], [6]. The attackers intercept egress and
ingress network flows by controlling malicious Tor relays or
cooperating with malicious ISP, and try to link associated
flow pairs by computing the statistical correlation metrics of
traffic characteristics or using DL model. The attackers need
to determine which of the following two hypotheses is true:
• Correlated (H1 ): Fi and Fj are correlated, i.e., Fi and Fj Fig. 1. Correlation results with the four attacks.
are from the same Tor connection and the Fj is a noisy
version of Fi naturally perturbed by Tor network.
• Non-correlated (H0 ): Fi and Fj are not correlated, i.e., IV. P LUG - BASED DEFENSE
Fj is not a noisy version of Fi . A. Verification for plug-based defense
We have that: Tor has developed a variety of obfuscation plugs, which
⇢ fall into three categories: randomization, protocol mimicry and
H1 : Tj = Ti + t ; Sj = Si + s tunneling. Randomization refers to the use of encryption, ran-
(4)
H0 : Tj = T ⇤ + t ; Sj = S ⇤ + s dom padding, and other methods to randomize the characteris-
tics of Tor traffic, such as Obfs3 [11], Obfs4 [10] and Scram-
where Ti is the IDP sequence of Fi , Si is the packet size
bleSuit [22]. Protocol mimicry is imitating or masquerading
sequence of Fi . The T ⇤ and S ⇤ are the traffic characteristics
as popular whitelisted protocols which are rarely suspected
of an arbitrary flow not related to Fj . The t and s are the
by adversaries. For example, the SkypeMorch [23] integrates
perturbations that are naturally generated by the Tor network.
traffic between Tor clients and Tor bridges into Skype traffic
and the FTE (format-transforming encryption) transforms the
C. Attacks format of arbitrary packet contents into specified formats.
In this paper, for the statistical correlation metrics-based Tunneling technology is one extreme of the mimicry logic
attacks, we choose three correlation algorithms, including and a typical tunneling plug on Tor is Meek [9].
Pearson correlation [5], Cosine similarity correlation [5] and At present, the most effectively and commonly used obfus-
Spearman rank correlation [6]. For deep learning-based at- cation plugs are Obfs4 [24]. So we use the publicly available
tacks, we used the structure proposed by Nasr et al. [1], which Obfs4 dataset of DeepCorr. It consists of 500 flows over Obfs4
is called DeepCorr. (with two modes of Obfs4: IAT mode ”on” and ”off”). The
IAT mode ”on” means the plug will obfuscate traffic features
D. metrics and the value of IAT is set to 1. The IAT mode ”off” does
not obfuscate traffic features and the value of IAT is set to
As in previous work [1], [5], [6], we use two metrics for 0. Since the experiment of Obfs4 against DeepCorr has been
evaluating the performance of flow correlation attacks, namely conducted in [1], in this paper, we only show the effect of
the rates of TP and FP. Obfs4 against statistical correlation metrics-based attacks. The
result is shown in Fig. 2. We can know that the obfuscation
TP FP
TPR = ;FPR = (5) technique (Obfs4 with IAT mode ”on”) can indeed mitigate the
TP + FN TN + FP statistical correlation metrics-based attacks to a certain extent.
The TPR represents the proportion of associated flow pairs However, the defense performance is relatively poor when
which are correctly predicted to be associated. The FPR the obfuscation plug does not obfuscate traffic features. For
represents the proportion of non-correlated flow pairs that are example, when FPR is 0.001, the TPR of Spearman correlation
identified as correlated in error. Note that the value of the reaches over 0.6.
detection threshold, ⌘, trades off FPR and TPR. So we use the B. Drawback
ROC curve to illustrate the effect of flow correlation attacks.
In addition to poor performance when the IAT mode is ”off”,
the plug-based defense also has the following drawbacks.
E. Correlation results.
Expanding the training dataset Although obfuscation
We use the four attacks to correlate Tor flow pairs. As plugs can make the attack effect of DeepCorr worse, but the
shown in Fig. 1, the TP rates of four attacks are about 0.8 dataset for training in DeepCorr is only 400 flows and Nasr
when FP rates are about 0.01 (The x axis is log10 (F P )). et al [1] point that the accuracy of DeepCorr will be much
The experimental results illustrate the actual threat of flow higher for a real-world adversary who collects more training
correlation attacks on Tor. flows and achieves adequate training.
In this section, we explain our motivations, namely: the real Censorship Tor has attracted the attention of censors and
threat of flow correlation attacks on Tor. the nodes of Tor all over the network have begun to be
325
MILCOM 2021 Track 3 - Cyber Security and Trusted Computing
blocked [3]. At present, almost nobody uses the FTE protocol. TABLE I
Obfs2 and obfs3 has been announced as early as 2014 to R ESTRICTIONS AND ADJUSTMENT
be out of service [11], [25]. ScrambleSuit are also available
Feature Constrain Adjustment
for detection by various means, such as active detection IPD Non-negative Set negative values to minimum.
technology [24]. Therefore it is easily blocked by censors. Delay sensitive Control the noise intensities by ⌥
In addition, Zhang et al. [3] find that Meek faces a threat that in F P Ak and d⇤ -private.
coalescing egress traffic at cloud service providers increases Packet size Non-negative Set negative values to minimum.
Fixed-length cell Set the values of special Tor cell
vulnerability to flow correlation attacks. length values to 512 bytes.
High latency In order to avoid slow down connection, Tor
relays refrain from obfuscating traffic features [1] and the
majority of Tor bridges run Obfs4 with IAT mode ”off” [26],
which means they solely obfuscates packet contents, but B. Fourier Perturbation Algorithm (F P Ak )
not traffic features. It can be seen from Fig. 2 that Obfs4 Fourier Perturbation Algorithm (F P Ak ) is a DP algorithm
without traffic obfuscation (IAT = 0) is powerless against flow for time-series data and it relies on the Discrete Fourier
correlation attacks. Transform (DFT). DFT transforms the representation of a n-
Therefore, considering the above aspects, Tor needs a new dimensional time series data from the original domain R to the
defense mechanism to defeat flow correlation attacks and frequency domain F. Specifically, the j th element in frequency
ensure the anonymity of users. domain sequence F = (F[1], ......, F[n]) is given as
V. D IFFERENTIAL PRIVACY- BASED DEFENSE n
X p
2⇡ 1
ji
We adapt two DP mechanisms: F P Ak and d -private, ⇤ F(j) = DFT(R)j = e n Ri (6)
which meet these two definitions respectively. Fig. 3 shows i=1
overview of differential privacy-based defense. Similarly, the Inverse Discrete Fourier Transform (IDFT)
is also a transform of frequency domain to original domain
clean dat a noised dat a
(usually time domain). The j th element in original sequence
I PDs
t raining ret raining
R = (R[1], ......, R[n]) is given as
FPAk or
...
d *-privat e
Fi n
Packet sizes
DeepCorr
H1 / H0
1X 2⇡
p
1
ji
... H1 / H0 R(j) = IDFT(F)j = e n Fi (7)
I PDs Noised de e p le arning-base d at t acks n i=1
... sequences
Fj Pearson Correlat ion H1 / H0
Packet sizes
... Cosine Similarit y
Correlat ion
Spearman
H1 / H0
Algorithm 1 F P Ak algorithm
H1 / H0
Training DeepCorr wit h clean dat a.
Correlat ion
Input: The original sequence R; The scale of Laplace
Ret raining DeepCorr wit h noised dat a.
st at ist ical corre lat ion
me t rics-base d at t acks
distribution ; The parameter k in DFT.
Output: Noised sequence R̃.
Fig. 3. Differential privacy-based defense. 1: Compute Fk = DFTk (R).
2: Compute F[i] = F[i]⇣+ Lap(
⌘ ) for i = 1, . . . , k.
A. Incorporating Tor traffic constrains 3: Return R̃ = IDFT F̃ . k
326
MILCOM 2021 Track 3 - Cyber Security and Trusted Computing
C. d⇤ -private mechanism For the two defense mechanisms, we use the parameter ✏
d -private chooses the following metric d for enforcing
⇤ ⇤ as input, which represents the strength of privacy protection.
privacy. The x and x0 denote two sequences. The smaller the ✏ is, the higher the data confidentiality and
the larger the ✏ is, the higher the data availability. In order to
X protect data privacy, this parameter is usually set to a small
d⇤ (x, x0 ) = |(x[i] x[i 1]) (x0 [i] x0 [i 1])| (8) value.
i 1 For F P Ak , we refer the setting in [15] and set k to 10. So
d⇤ -private [27] is an algorithm which implements d⇤ - the first 10 Fourier coefficients are kept. Besides, we calculate
privacy. It use the noise drown from Laplace distribution to the L2 sensitivity 2 (Q) of IPD and packet sizeprespectively
ensure privacy. The perturbations ri added to x[i] is different, and generate the Laplace distribution scale = k 2 (Q)/✏.
i.e., The results of ROC curve when ✏=[0.01, 0.1, 1, 10] are
( shown in Fig. 4. For d⇤ -private, the noise is directly added
Lap ⇣1✏ ⌘ if i = D(i) upon the original sequence [15], so the ✏ is required to be
ri ⇠ blog2 ic (9)
Lap ✏ otherwise smaller than F P Ak and the ✏ is set from 1e-7 to 1e-4. Fig. 8
Where Lap(b) is a Laplace distribution with scale b and shows the performance of d⇤ -private. It can be seen from the
location µ = 0. D(i) represents the largest power of two that figures that the two mechanisms can reduce the attacker’s
divides i. correlation ability to close to random guessing (under the
appropriate parameters ✏ selected), indicating that the noised
D. Perturbation results sequences generated by F P Ak and d⇤ -private can effectively
resist flow correlation attacks on Tor.
We evaluate the performance of perturbations using the
ROC curve. We assume two kind of attackers, one is the
E. Retrain and reinforcement
ordinary attacker which is considered in this section, the other
is a more powerful attacker who can retrain and reinforce the In order to comprehensively analyze the defense effect of
DL model and is considered in next section. The ordinary DP algorithm, we assume a more powerful adversary who
attacker will not consider whether the data is disturbed. So in can obtain the noised data and retrain the DL model with
this section, we train DeepCorr with clean data and test with them. We retrain DeepCorr classifier with the noised sequences
noised data. generated by F P Ak and d⇤ -private.
327
MILCOM 2021 Track 3 - Cyber Security and Trusted Computing
328