Professional Documents
Culture Documents
Sec17 Full Proceedings Interior
Sec17 Full Proceedings Interior
Proceedings of the
26th USENIX Security Symposium
External Reviewers
Hadi Abdullah Dov Gordon Mary Maller Josh Tan
Martin Albrecht Paul Grubbs Srdjan Matic Yuan Tian
Joey Allen Eric Gustafson William Melicher Erkam Uzun
Aslan Askarov Florian Hahn Abner Mendoza Luke Valenta
Sarah Azouvi Per Hallgren Kristopher Micinski Steven Van Acker
Musard Balliu Daniel Hausknecht Payman Mohassel Luis Vargas
Erick Bauman Grant Hernandez Magnus Myreen Giorgos Vasiliadis
Sruti Bhagavatula Sanghyun Hong Adwait Nadkarni Fish Wang
Antonio Bianchi Kevin Hong Claudio Orlandi Weiren Wang
Logan Blue Yuval Ishai Xiaorui Pan Xueqiang Wang
Kevin Borgolte Kai Jansen Kyuhong Park Wenhao Wang
Jasmine Bowers Yang Ji Andre Pawlowski Haopei Wang
Fraser Brown Jeff Johns Christian Peeters Zachary Weinberg
Swarup Chandra Lachlan Kang Giancarlo Pellegrino Lei Xu
Shang-Tse Chen Vishal Karande Chenxiong Qian Carter Yagemann
Simon Chung Yigitcan Kaya Willard Rafnsson Guangliang Yang
David Clark Marcel Keller Bradley Reaves Cong Zhang
Moritz Contag David Kohlbrenner David Rupprecht Nan Zhang
Vijay D’Silva Katharina Kohls Theodor Schnitzler Zhe Zhou
Philip Daian Benjamin Kollenda Daniel Schoepe Ziyun Zhu
Anupam Das Philipp Koppe Will Scott
Alex Davidson BumJun Kwon Mahmood Sharif
Martin Degeling Sangho Lee Yan Shoshitaishvili
Ruian Duan Yeonjoon Lee Alexander Sjösten
Mattia Fazzini Tongxin Li Kyle Soska
Yanick Fratantonio Xiapu Luo Octavian Suciu
Daniel Genkin Aravind Machiry Janos Szurdi
Message from the
26th USENIX Security Symposium
Program Co-Chairs
Welcome to the USENIX Security Symposium in Vancouver! The symposium, now in its 26th year, is a premier
venue for security and privacy research, and gathers together researchers from both industry and academia to
discuss the latest on improving computer security. The program consists of research papers selected during a peer-
review process, invited talks, and professional and social events.
This was the biggest year ever for our peer-review submissions process. Like last year, the symposium had two
chairs. We were supported by an excellent program committee (PC) consisting of 77 members. Of those, 35 were
remote PC members whereas 42 also attended an in-person PC meeting in Boston that lasted a day and a half. The
PC members did an incredible amount of work, and we can’t thank them enough for their efforts!
The double-blind review process this year worked as follows: We received 572 submissions—the most ever for the
symposium—by the submission deadline of February 16, 2017. Of these, 19 were desk-rejected for various violations
of the call for papers. Reviewing proceeded in two rounds. In the first round, every paper was assigned to
two reviewers, and 199 papers were marked for rejection after the first round (due to unanimously low scores by
two confident reviewers). We allowed a special appeals process in case of reviewer error; none of the appealed
papers were revived after inspection of the authors’ concerns. The remaining papers passed on to a second round,
and received one or more additional reviews.
In an online discussion phase, the committee attempted to accept or reject a large number of papers before the
in-person meeting. This was to ensure that there was sufficient time at the meeting to discuss the most contentious
papers. We pre-accepted 49 papers during online discussions and discussed about 80 papers during the meeting.
Each paper was allowed up to eight minutes of discussion, with few exceptions. The vast majority were easily
decided upon within that time. Unlike last year, we did not unblind the authors during the PC meeting and believe
the benefits (in terms of discouraging negative or positive bias) outweighed the few situations that arose in which the
PC members would have benefited from knowing the authorship. The chairs also decided to keep the PC members
somewhat in the dark about the number of papers being accepted over the course of the meeting in the hope that
discussions could focus on the merit of individual papers rather than on the need to “fill a program.”
At the meeting, the committee had an extensive discussion about whether more theory-oriented papers are suitably
in-scope for the symposium. Over the last few years, the symposium has seen an increasing number of papers that
only very tangentially target computer security problems faced in practice, most notably in the context of papers
developing or improving cryptographic protocols not yet addressing specific security applications. After extensive
discussion, the committee decided that the broader community would be better served by encouraging such papers
to seek other venues. This was not a decision taken lightly, and it resulted in a number of very good papers being
rejected for reason of fit.
Ultimately, the committee decided to accept 85 papers, a record number for the symposium. Of these, 38 were con-
ditionally accepted and shepherded to ensure that the camera-ready version of the paper reflected reviewer feedback.
In the end, this resulted in a 15% acceptance rate, meaning that the conference continues to be exceptionally com-
petitive. We believe the final set of accepted papers represent excellent work, and the authors should be congratu-
lated for their notable achievement!
The conference would not be possible without the help of a huge number of people. The staff at USENIX ensure
that everything runs smoothly behind the scenes, and Casey Henderson and Michele Nelson specifically helped us
in innumerable ways. The PC, of course, did a tremendous amount of work, with each member reviewing about 20
papers, for a total of over 1600 reviews and over 2200 comments across the entire process. We also thank the exter-
nal reviewers who were brought in due to their particular expertise to review a smaller number of papers. We would
also like to thank for their hard work: the invited talks committee (Michael Bailey, Casey Henderson, David Molnar,
and Franziska Roesner); the Test-of-Time award committee (Matt Blaze, Dan Boneh, Kevin Fu, and David Wagner);
the poster-session chair Nick Nikiforakis; and the lightning talks chairs Kevin Butler and Deian Stefan. Finally, we
thank all of the authors of the 572 submitted papers for participating in the 26th USENIX Security Symposium.
Engin Kirda, Northeastern University
Thomas Ristenpart, Cornell Tech
USENIX Security ’17 Program Co-Chairs
USENIX Security ’17:
26th USENIX Security Symposium
Bug Finding I
How Double-Fetch Situations turn into Double-Fetch Vulnerabilities: A Study of Double Fetches
in the Linux Kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Pengfei Wang, National University of Defense Technology; Jens Krinke, University College London; Kai Lu
and Gen Li, National University of Defense Technology; Steve Dodier-Lazaro, University College London
Postmortem Program Analysis with Hardware-Enhanced Post-Crash Artifacts. . . . . . . . . . . . . . . . . . . . . . . . 17
Jun Xu, The Pennsylvania State University; Dongliang Mu, Nanjing University; Xinyu Xing, Peng Liu,
and Ping Chen, The Pennsylvania State University; Bing Mao, Nanjing University
Ninja: Towards Transparent Tracing and Debugging on ARM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Zhenyu Ning and Fengwei Zhang, Wayne State University
Side-Channel Attacks I
Prime+Abort: A Timer-Free High-Precision L3 Cache Attack using Intel TSX. . . . . . . . . . . . . . . . . . . . . . . . . 51
Craig Disselkoen, David Kohlbrenner, Leo Porter, and Dean Tullsen, University of California, San Diego
On the effectiveness of mitigations against floating-point timing channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
David Kohlbrenner and Hovav Shacham, UC San Diego
Constant-Time Callees with Variable-Time Callers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Cesar Pereida García and Billy Bob Brumley, Tampere University of Technology
Systems Security I
Neural Nets Can Learn Function Type Signatures From Binaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang, National University of Singapore
CAn’t Touch This: Software-only Mitigation against Rowhammer Attacks targeting Kernel Memory. . . . . 117
Ferdinand Brasser, Technische Universität Darmstadt; Lucas Davi, University of Duisburg-Essen; David Gens,
Christopher Liebchen, and Ahmad-Reza Sadeghi, Technische Universität Darmstadt
Efficient Protection of Path-Sensitive Control Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Ren Ding and Chenxiong Qian, Georgia Tech; Chengyu Song, UC Riverside; Bill Harris, Taesoo Kim, and
Wenke Lee, Georgia Tech
Bug Finding II
Digtool: A Virtualization-Based Framework for Detecting Kernel Vulnerabilities . . . . . . . . . . . . . . . . . . . . . 149
Jianfeng Pan, Guanglu Yan, and Xiaocao Fan, IceSword Lab, 360 Internet Security Center
kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Sergej Schumilo, Cornelius Aschermann, and Robert Gawlik, Ruhr-Universität Bochum; Sebastian Schinzel,
Münster University of Applied Sciences; Thorsten Holz, Ruhr-Universität Bochum
Venerable Variadic Vulnerabilities Vanquished . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Priyam Biswas, Purdue University; Alessandro Di Federico, Politecnico di Milano; Scott A. Carr, Purdue
University; Prabhu Rajasekaran, Stijn Volckaert, Yeoul Na, and Michael Franz, University of California, Irvine;
Mathias Payer, Purdue University
Censorship
Global Measurement of DNS Manipulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Paul Pearce, UC Berkeley; Ben Jones, Princeton; Frank Li, UC Berkeley; Roya Ensafi and Nick Feamster,
Princeton; Nick Weaver, ICSI; Vern Paxson, UC Berkeley
Characterizing the Nature and Dynamics of Tor Exit Blocking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Rachee Singh, University of Massachusetts – Amherst; Rishab Nithyanand, Stony Brook University; Sadia Afroz,
University of California, Berkeley and International Computer Science Institute; Paul Pearce, UC Berkeley;
Michael Carl Tschantz, International Computer Science Institute; Phillipa Gill, University of Massachusetts –
Amherst; Vern Paxson, University of California, Berkeley and International Computer Science Institute
DeTor: Provably Avoiding Geographic Regions in Tor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Zhihao Li, Stephen Herwig, and Dave Levin, University of Maryland
Embedded Systems
SmartAuth: User-Centered Authorization for the Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
Yuan Tian, Carnegie Mellon University; Nan Zhang, Indiana University, Bloomington; Yueh-Hsun Lin,
Samsung; Xiaofeng Wang, Indiana University, Bloomington; Blase Ur, University of Chicago; Xianzheng Guo
and Patrick Tague, Carnegie Mellon University
AWare: Preventing Abuse of Privacy-Sensitive Sensors via Operation Bindings . . . . . . . . . . . . . . . . . . . . . . . 379
Giuseppe Petracca, The Pennsylvania State University, US; Ahmad-Atamli Reineh, University of Oxford, UK;
Yuqiong Sun, The Pennsylvania State University, US; Jens Grossklags, Technical University of Munich, DE;
Trent Jaeger, The Pennsylvania State University, US
6thSense: A Context-aware Sensor-based Attack Detector for Smart Devices . . . . . . . . . . . . . . . . . . . . . . . . . 397
Amit Kumar Sikder, Hidayet Aksu, and A. Selcuk Uluagac, Florida International University
Networking Security
Identifier Binding Attacks and Defenses in Software-Defined Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Samuel Jero, Purdue University; William Koch, Boston University; Richard Skowyra and Hamed Okhravi, MIT
Lincoln Laboratory; Cristina Nita-Rotaru, Northeastern University; David Bigelow, MIT Lincoln Laboratory
HELP: Helper-Enabled In-Band Device Pairing Resistant Against Signal Cancellation. . . . . . . . . . . . . . . . . 433
Nirnimesh Ghose, Loukas Lazos, and Ming Li, Electrical and Computer Engineering, University of Arizona,
Tucson, AZ
Attacking the Brain: Races in the SDN Control Plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Lei Xu, Jeff Huang, and Sungmin Hong, Texas A&M University; Jialong Zhang, IBM Research; Guofei Gu, Texas
A&M University
Targeted Attacks
Detecting Credential Spearphishing Attacks in Enterprise Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
Grant Ho, UC Berkeley; Aashish Sharma, The Lawrence Berkeley National Laboratory; Mobin Javed, UC
Berkeley; Vern Paxson, UC Berkeley and ICSI; David Wagner, UC Berkeley
SLEUTH: Real-time Attack Scenario Reconstruction from COTS Audit Data . . . . . . . . . . . . . . . . . . . . . . . . 487
Md Nahid Hossain, Stony Brook University; Sadegh M. Milajerdi, University of Illinois at Chicago; Junao Wang,
Stony Brook University; Birhanu Eshete and Rigel Gjomemo, University of Illinois at Chicago; R. Sekar and
Scott Stoller, Stony Brook University; V.N. Venkatakrishnan, University of Illinois at Chicago
When the Weakest Link is Strong: Secure Collaboration in the Case of the Panama Papers. . . . . . . . . . . . . 505
Susan E. McGregor, Columbia Journalism School; Elizabeth Anne Watkins, Columbia University; Mahdi
Nasrullah Al-Ameen and Kelly Caine, Clemson University; Franziska Roesner, University of Washington
Trusted Hardware
Hacking in Darkness: Return-oriented Programming against Secure Enclaves. . . . . . . . . . . . . . . . . . . . . . . . 523
Jaehyuk Lee and Jinsoo Jang, KAIST; Yeongjin Jang, Georgia Institute of Technology; Nohyun Kwak, Yeseul
Choi, and Changho Choi, KAIST; Taesoo Kim, Georgia Institute of Technology; Marcus Peinado, Microsoft
Research; Brent Byunghoon Kang, KAIST
vTZ: Virtualizing ARM TrustZone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
Zhichao Hua, Jinyu Gu, Yubin Xia, and Haibo Chen, Institute of Parallel and Distributed Systems, Shanghai
Jiao Tong University; Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong
University; Binyu Zang, Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University; Haibing
Guan, Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University
Inferring Fine-grained Control Flow Inside SGX Enclaves with Branch Shadowing. . . . . . . . . . . . . . . . . . . . 557
Sangho Lee, Ming-Wei Shih, Prasun Gera, Taesoo Kim, and Hyesoon Kim, Georgia Institute of Technology;
Marcus Peinado, Microsoft Research
Authentication
AuthentiCall: Efficient Identity and Content Authentication for Phone Calls. . . . . . . . . . . . . . . . . . . . . . . . . . 575
Bradley Reaves, North Carolina State University; Logan Blue, Hadi Abdullah, Luis Vargas, Patrick Traynor, and
Thomas Shrimpton, University of Florida
Picking Up My Tab: Understanding and Mitigating Synchronized Token Lifting and Spending
in Mobile Payment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
Xiaolong Bai, Tsinghua University; Zhe Zhou, The Chinese University of Hong Kong; XiaoFeng Wang,
Indiana University Bloomington; Zhou Li, IEEE Member; Xianghang Mi and Nan Zhang, Indiana University
Bloomington; Tongxin Li, Peking University; Shi-Min Hu, Tsinghua University; Kehuan Zhang, The Chinese
University of Hong Kong
Web Security I
Extension Breakdown: Security Analysis of Browsers Extension Resources Control Policies. . . . . . . . . . . . . 679
Iskander Sanchez-Rola and Igor Santos, DeustoTech, University of Deusto; Davide Balzarotti, Eurecom
CCSP: Controlled Relaxation of Content Security Policies by Runtime Policy Composition. . . . . . . . . . . . . . 695
Stefano Calzavara, Alvise Rabitti, and Michele Bugliesi, Università Ca’ Foscari Venezia
Same-Origin Policy: Evaluation in Modern Browsers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
Jörg Schwenk, Marcus Niemietz, and Christian Mainka, Horst Görtz Institute for IT Security, Chair for Network
and Data Security, Ruhr-University Bochum
Privacy
Locally Differentially Private Protocols for Frequency Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
Tianhao Wang, Jeremiah Blocki, and Ninghui Li, Purdue University; Somesh Jha, University of Wisconsin
Madison
BLENDER: Enabling Local Search with a Hybrid Differential Privacy Model. . . . . . . . . . . . . . . . . . . . . . . . 747
Brendan Avent and Aleksandra Korolova, University of Southern California; David Zeber and Torgeir Hovden,
Mozilla; Benjamin Livshits, Imperial College London
Computer Security, Privacy, and DNA Sequencing: Compromising Computers with Synthesized DNA,
Privacy Leaks, and More. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
Peter Ney, Karl Koscher, Lee Organick, Luis Ceze, and Tadayoshi Kohno, University of Washington
Systems Security II
BootStomp: On the Security of Bootloaders in Mobile Devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
Nilo Redini, Aravind Machiry, Dipanjan Das, Yanick Fratantonio, Antonio Bianchi, Eric Gustafson, Yan
Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna, UC Santa Barbara
Seeing Through The Same Lens: Introspecting Guest Address Space At Native Speed. . . . . . . . . . . . . . . . . . 799
Siqi Zhao and Xuhua Ding, Singapore Management University; Wen Xu, Georgia Institute of Technology;
Dawu Gu, Shanghai JiaoTong University
Oscar: A Practical Page-Permissions-Based Scheme for Thwarting Dangling Pointers . . . . . . . . . . . . . . . . . 815
Thurston H.Y. Dang, University of California, Berkeley; Petros Maniatis, Google Brain; David Wagner,
University of California, Berkeley
Web Security II
PDF Mirage: Content Masking Attack Against Information-Based Online Services. . . . . . . . . . . . . . . . . . . . 833
Ian Markwood, Dakun Shen, Yao Liu, and Zhuo Lu, University of South Florida
Loophole: Timing Attacks on Shared Event Loops in Chrome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849
Pepe Vila, IMDEA Software Institute & Technical University of Madrid (UPM); Boris Köpf, IMDEA Software
Institute
Game of Registrars: An Empirical Analysis of Post-Expiration Domain Name Takeovers. . . . . . . . . . . . . . . 865
Tobias Lauinger, Northeastern University; Abdelberi Chaabane, Nokia Bell Labs; Ahmet Salih Buyukkayhan,
Northeastern University; Kaan Onarlioglu, www.onarlioglu.com; William Robertson, Northeastern University
Applied Cryptography
Speeding up detection of SHA-1 collision attacks using unavoidable attack conditions. . . . . . . . . . . . . . . . . . 881
Marc Stevens, CWI; Daniel Shumow, Microsoft Research
Phoenix: Rebirth of a Cryptographic Password-Hardening Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .899
Russell W. F. Lai, Friedrich-Alexander-University Erlangen-Nürnberg, Chinese University of Hong Kong;
Christoph Egger and Dominique Schröder, Friedrich-Alexander-University Erlangen-Nürnberg; Sherman S. M.
Chow, Chinese University of Hong Kong
Vale: Verifying High-Performance Cryptographic Assembly Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917
Barry Bond and Chris Hawblitzel, Microsoft Research; Manos Kapritsos, University of Michigan; K. Rustan
M. Leino and Jacob R. Lorch, Microsoft Research; Bryan Parno, Carnegie Mellon University; Ashay Rane, The
University of Texas at Austin; Srinath Setty, Microsoft Research; Laure Thompson, Cornell University
Software Security
Towards Efficient Heap Overflow Discovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989
Xiangkun Jia, TCA/SKLCS, Institute of Software, Chinese Academy of Sciences; Chao Zhang, Institute for
Network Science and Cyberspace, Tsinghua University; Purui Su, Yi Yang, Huafeng Huang, and Dengguo Feng,
TCA/SKLCS, Institute of Software, Chinese Academy of Sciences
DR. CHECKER: A Soundy Analysis for Linux Kernel Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007
Aravind Machiry, Chad Spensky, Jake Corina, Nick Stephens, Christopher Kruegel, and Giovanni Vigna, UC
Santa Barbara
Dead Store Elimination (Still) Considered Harmful. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025
Zhaomo Yang and Brian Johannesmeyer, University of California, San Diego; Anders Trier Olesen, Aalborg
University; Sorin Lerner and Kirill Levchenko, University of California, San Diego
Side-Channel Attacks II
Telling Your Secrets Without Page Faults: Stealthy Page Table-Based Attacks on Enclaved Execution . . . 1041
Jo Van Bulck, imec-DistriNet, KU Leuven; Nico Weichbrodt and Rüdiger Kapitza, IBR DS, TU Braunschweig;
Frank Piessens and Raoul Strackx, imec-DistriNet, KU Leuven
Understanding Attacks
Understanding the Mirai Botnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1093
Manos Antonakakis, Georgia Institute of Technology; Tim April, Akamai; Michael Bailey, University of
Illinois, Urbana-Champaign; Matt Bernhard, University of Michigan, Ann Arbor; Elie Bursztein, Google;
Jaime Cochran, Cloudflare; Zakir Durumeric and J. Alex Halderman, University of Michigan, Ann Arbor; Luca
Invernizzi, Google; Michalis Kallitsis, Merit Network, Inc.; Deepak Kumar, University of Illinois, Urbana-
Champaign; Chaz Lever, Georgia Institute of Technology; Zane Ma and Joshua Mason, University of Illinois,
Urbana-Champaign; Damian Menscher, Google; Chad Seaman, Akamai; Nick Sullivan, Cloudflare; Kurt
Thomas, Google; Yi Zhou, University of Illinois, Urbana-Champaign
MPI: Multiple Perspective Attack Investigation with Semantics Aware Execution Partitioning. . . . . . . . . . 1111
Shiqing Ma, Purdue University; Juan Zhai, Nanjing University; Fei Wang, Purdue University; Kyu Hyung Lee,
University of Georgia; Xiangyu Zhang and Dongyan Xu, Purdue University
Detecting Android Root Exploits by Learning from Root Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1129
Ioannis Gasparis, Zhiyun Qian, Chengyu Song, and Srikanth V. Krishnamurthy, University of California,
Riverside
Hardware Security
USB Snooping Made Easy: Crosstalk Leakage Attacks on USB Hubs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145
Yang Su, Auto-ID Lab, The School of Computer Science, The University of Adelaide; Daniel Genkin, University
of Pennsylvania and University of Maryland; Damith Ranasinghe, Auto-ID Lab, The School of Computer
Science, The University of Adelaide; Yuval Yarom, The University of Adelaide and Data61, CSIRO
Reverse Engineering x86 Processor Microcode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1163
Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten
Holz, Ruhr-University Bochum
See No Evil, Hear No Evil, Feel No Evil, Print No Evil? Malicious Fill Patterns Detection in
Additive Manufacturing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1181
Christian Bayens, Georgia Institute of Technology; Tuan Le and Luis Garcia, Rutgers University; Raheem Beyah,
Georgia Institute of Technology; Mehdi Javanmard and Saman Zonouz, Rutgers University
Crypto Deployment
A Longitudinal, End-to-End View of the DNSSEC Ecosystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1307
Taejoong Chung, Northeastern University; Roland van Rijswijk-Deij, University of Twente and SURFnet bv;
Balakrishnan Chandrasekaran, TU Berlin; David Choffnes, Northeastern University; Dave Levin, University
of Maryland; Bruce M. Maggs, Duke University and Akamai Technologies; Alan Mislove and Christo Wilson,
Northeastern University
Measuring HTTPS Adoption on the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1323
Adrienne Porter Felt, Google; Richard Barnes, Cisco; April King, Mozilla; Chris Palmer, Chris Bentzel, and
Parisa Tabriz, Google
“I Have No Idea What I’m Doing” - On the Usability of Deploying HTTPS. . . . . . . . . . . . . . . . . . . . . . . . . . 1339
Katharina Krombholz, Wilfried Mayer, Martin Schmiedecker, and Edgar Weippl, SBA Research
Blockchains
SmartPool: Practical Decentralized Pooled Mining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409
Loi Luu, National University of Singapore; Yaron Velner, The Hebrew University of Jerusalem; Jason Teutsch,
TrueBit Foundation; Prateek Saxena, National University of Singapore
REM: Resource-Efficient Mining for Blockchains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427
Fan Zhang, Ittay Eyal, and Robert Escriva, Cornell University; Ari Juels, Cornell Tech; Robbert van Renesse,
Cornell University
Databases
Ensuring Authorized Updates in Multi-user Database-Backed Applications . . . . . . . . . . . . . . . . . . . . . . . . . 1445
Kevin Eykholt, Atul Prakash, and Barzan Mozafari, University of Michigan Ann Arbor
Qapla: Policy compliance for database-backed systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1463
Aastha Mehta and Eslam Elnikety, Max Planck Institute for Software Systems (MPI-SWS); Katura Harvey,
University of Maryland, College Park and Max Planck Institute for Software Systems (MPI-SWS); Deepak Garg
and Peter Druschel, Max Planck Institute for Software Systems (MPI-SWS)
How Double-Fetch Situations turn into Double-Fetch Vulnerabilities:
A Study of Double Fetches in the Linux Kernel
User Space
the kernel code executes and where its internal data is user data
Source Manual
bug in Linux requires a first fetch that copies the data, Phase 1: Basic Pattern Candidate
Files
Files
Analysis
Coccinelle
usually followed by a first check or use of the copied Source
Source
Files
Files
Matching
Engine
data, then a second fetch that copies the same data again, Phase 2: Refined Pattern
Source
and a second use of the same data. Although the dou- Rule 0 : Basic pattern
Double
Files
Fetch Trigger &
Rule 1 : No pointer change Consequence
ble fetch can be located by matching the patterns of fetch Rule 2 : Pointer aliasing Context
Information
Rule 3 : Explicit type conversion
operations, the use of the fetched data varies a lot. For Rule 4 : Combination of element
fetch and pointer fetch
Bug Details
example, in addition to being used for validation, the first Rule 5 : Loop involvement Categorization
4 Evaluation
Most importantly, we found five previously unknown
In this section, we present the evaluation of our study,
double-fetch bugs which include four size checking sce-
which includes two parts: the statistics of the manual
narios and one shallow copy scenario which also be-
analysis, and the results of the refined approach when
longs to the size checking scenario. Three of them are
applied to three open source kernels: Linux, Android,
exploitable vulnerabilities. The five bugs have been re-
and FreeBSD. We obtained the most up-to-date versions
ported and they all have been confirmed by the develop-
available at the time of the analysis.
ers and have meanwhile been fixed. From the statistical
result, we can observe the following:
4.1 Statistics and Analysis
1. 57 out of 90 (63%) of the candidates were driver
In Linux 4.5, there are 52,881 files in total and 39,906 of related and 22 out of 30 (73%) of the size checking
them are source files (with a file extension of .c or .h), cases, 9 out of 11 (82%) of the type selection cases
which are our analysis targets (other files are ignored). and 19 out of 31 (61%) of the shallow copy cases
17,532 source files belong to drivers (44%). After the occur in drivers.
basic pattern matching of the source files and the man-
ual inspection to remove false positives, we obtained 90 2. 4 out of 5 (80%) of the double-fetch bugs we found
double-fetch candidate files for further inspection. We inside drivers and belong to the size checking cate-
categorized the candidates into the three double-fetch gory.
scenarios Size Checking, Type Selection and Shallow Overall, this leads to the conclusion that most double
Copy. They are the most common cases on how a double fetches do not cause double-fetch bugs and that double
fetch occurs while user space data is copied to the kernel fetches are more likely to occur in drivers. However, as
space and how the data is then used in the kernel. We soon as a double fetch is due to size checking, developers
have discussed these scenarios in detail with real double- have to be careful: Four out of 22 size checking scenarios
fetch bug examples in the previous section. As shown in drivers turned out to be double-fetch bugs.
in Table 1, of the 90 candidates we found, 30 were re-
lated to the size checking scenario, 11 were related to the
4.2 Analysis of Three Open Source Kernels
type selection scenario, and 31 were related to the shal-
low copy scenario, accounting for 33%, 12%, and 34% Based on the double fetch basic pattern matching and
respectively. 18 candidates did not fit into one of the manual analysis, we refined our double fetch pattern
three scenarios. and developed a new double-fetch bug detection analysis
Furthermore, 57 out of the 90 candidates were part of based on the Coccinelle engine. In order to fully evalu-
Linux drivers and among them, 22 were size checking re- ate our approach, we analyzed three popular open source
lated, 9 were type selection related and 19 were shallow kernels, namely Linux, Android, and FreeBSD. Results
copy related. are shown in Table 2.
fied 3 out of the above discussed 6 double-fetch bugs (the 5.3 Double-Fetch Bug Prevention
other 3 bugs we found are in files that were not present
Even though we provide an analysis to detect double-
in Linux 3.5.0).
fetch bugs, developers must still be aware of how they
occur and preemptively prevent double-fetch bugs. Hu-
It is likely that Bochspwn could not find these bugs man mistakes are to be expected in driver development
because they were present in drivers. Indeed, dynamic when dealing with variable messages leading to new
approaches cannot support drivers without correspond- double-fetch situations.
ing hardware or simulations of hardware. Bochspwn re- (1) Don’t Copy the Header Twice. Double-fetch situa-
ported an instruction coverage of only 28% for the ker- tions can be completely avoided if the second fetch only
nel, while our approach statically analyses the complete copies the message body and not the complete message
source code. which copies the header a second time. For example, the
double-fetch vulnerability in Android 6.0.1 (Linux 3.18)
As for efficiency, our approach takes only a few min- is resolved in Linux 4.1 by only copying the body in the
utes to conduct a path-sensitive exploration of the source second fetch.
code of the whole Linux kernel. In contrast, Bochspwn (2) Use the Same Value. A double-fetch situation turns
introduces a severe runtime overhead. For instance, their into a bug when there is a use of the “same” data from
simulator needs 15 hours to boot the Windows kernel. both fetch operations because a (malicious) user can
change the data between the two fetches. If develop-
While it only took a few days to investigate the 90 ers only use the data from one of the fetches, problems
double-fetch situations, Jurczyk and Coldwind did not are avoided. According to our investigation, most of the
report the time they needed to investigate the 200KB of double-fetch situations are benign because they only use
double fetch logs generated by their simulator. the first fetched value.
References [18] Lhee, K.-S., and Chapin, S. J. Detection of file-based race con-
ditions. International Journal of Information Security 4, 1-2
[1] Bug 166248 – CAN-2005-2490 sendmsg compat stack over- (2005), 105–119.
flow. https://bugzilla.redhat.com/show bug.cgi?id= [19] Lu, K., Wu, Z., Wang, X., Chen, C., and Zhou, X. RaceChecker:
166248. efficient identification of harmful data races. In 2015 23rd Eu-
[2] Bishop, M., Dilger, M., et al. Checking for race conditions in romicro International Conference on Parallel, Distributed, and
file accesses. Computing systems 2, 2 (1996), 131–152. Network-Based Processing (2015), pp. 78–85.
[3] Brunel, J., Doligez, D., Hansen, R. R., Lawall, J. L., and [20] Lu, S., Park, S., Seo, E., and Zhou, Y. Learning from mistakes:
Muller, G. A foundation for flow-based program matching: Us- a comprehensive study on real world concurrency bug character-
ing temporal logic and model checking. In Proceedings of the istics. In Proceedings of the 13th International Conference on
36th Annual ACM SIGPLAN-SIGACT Symposium on Principles Architectural Support for Programming Languages and Operat-
of Programming Languages (POPL) (2009). ing Systems (ASPLOS) (2008), pp. 329–339.
[4] Cai, X., Gui, Y., and Johnson, R. Exploiting UNIX file-system [21] Lu, S., Park, S., and Zhou, Y. Finding atomicity-violation bugs
races via algorithmic complexity attacks. In 30th IEEE Sympo- through unserializable interleaving testing. IEEE Transactions on
sium on Security and Privacy (2009), pp. 27–41. Software Engineering 38, 4 (2012), 844–860.
[5] Chen, H., and Wagner, D. MOPS: an infrastructure for examining [22] Lu, S., Tucek, J., Qin, F., and Zhou, Y. AVIO: detecting atomicity
security properties of software. In Proceedings of the 9th ACM violations via access interleaving invariants. In ACM SIGARCH
conference on Computer and communications security (2002), Computer Architecture News (2006), vol. 34, pp. 37–48.
pp. 235–244. [23] Lucia, B., Ceze, L., and Strauss, K. Colorsafe: architectural
[6] Chen, J., and MacDonald, S. Towards a better collaboration of support for debugging and dynamically avoiding multi-variable
static and dynamic analyses for testing concurrent programs. In atomicity violations. ACM SIGARCH computer architecture
Proceedings of the 6th workshop on Parallel and distributed sys- news 38, 3 (2010), 222–233.
tems: testing, analysis, and debugging (2008), p. 8. [24] McKenney, P. E. list: Fix double fetch of pointer in
[7] Chou, A., Yang, J., Chelf, B., Hallem, S., and Engler, D. An hlist entry safe(), 2013. https://lists.linuxfoundation.
empirical study of operating systems errors. In Proceedings of org/pipermail/containers/2013-March/031996.html.
the Eighteenth ACM Symposium on Operating Systems Principles [25] Palix, N., Thomas, G., Saha, S., Calvès, C., Lawall, J., and
(SOSP) (2001). Muller, G. Faults in Linux: Ten years later. In Proceedings of
[8] Corbet, J., Rubini, A., and Kroah-Hartman, G. Linux Device the Sixteenth International Conference on Architectural Support
Drivers. O’Reilly Media, Inc., 2005. for Programming Languages and Operating Systems (ASPLOS)
[9] Cowan, C., Beattie, S., Wright, C., and Kroah-Hartman, G. (2011).
RaceGuard: Kernel protection from temporary file race vulner- [26] Palix, N., Thomas, G., Saha, S., Calves, C., Muller, G., and
abilities. In USENIX Security Symposium (2001), pp. 165–176. Lawall, J. Faults in Linux 2.6. ACM Transactions on Computer
[10] Engler, D., and Ashcraft, K. RacerX: effective, static detection Systems (TOCS) 32, 2 (2014), 4.
of race conditions and deadlocks. In Proceedings of the Nine- [27] Payer, M., and Gross, T. R. Protecting applications against
teenth ACM Symposium on Operating Systems Principles (SOSP TOCTTOU races by user-space caching of file metadata. In Pro-
’03) (2003), ACM, pp. 237–252. ceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Vir-
[11] Hammou, S. Exploiting Windows drivers: Double-fetch tual Execution Environments (2012), pp. 215–226.
race condition vulnerability, 2016. http://resources. [28] Pratikakis, P., Foster, J. S., and Hicks, M. LOCKSMITH: Prac-
infosecinstitute.com/exploiting-windows-drivers- tical static race detection for C. ACM Transactions on Program-
double-fetch-race-condition-vulnerability/. ming Languages and Systems (TOPLAS) 33, 1 (2011), 3.
[12] Huang, J., and Zhang, C. Persuasive prediction of concurrency [29] Ryzhyk, L., Chubb, P., Kuz, I., and Heiser, G. Dingo: Taming de-
access anomalies. In Proceedings of the 2011 International Sym- vice drivers. In Proceedings of the 4th ACM European conference
posium on Software Testing and Analysis (2011), pp. 144–154. on Computer systems (2009), pp. 275–288.
† ‡† † † † ‡
Jun Xu , Dongliang Mu , Xinyu Xing , Peng Liu , Ping Chen , and Bing Mao
†
College of Information Sciences and Technology, The Pennsylvania State University
‡
State Key Laboratory for Novel Software Technology, Department of Computer Science and
Technology, Nanjing University
{jxx13,dzm77,xxing, pliu, pzc10}@ist.psu.edu, {maobing}@nju.edu.cn
Abstract analyze these statements and eventually figure out why
a bad value (such as an invalid pointer) was passed to
While a core dump carries a large amount of infor-
the crash site. In general, this procedure can be signif-
mation, it barely serves as informative debugging aids
icantly facilitated (and even automated) if both control
in locating software faults because it carries information
and data flows are given. As such, the research on post-
that indicates only a partial chronology of how program
mortem program analysis primarily focuses on finding
reached a crash site. Recently, this situation has been
out control and data flows of crashing programs. Of
significantly improved. With the emergence of hardware-
all techniques on postmortem program analysis, record-
assisted processor tracing, software developers and secu-
and-replay (e.g., [10, 12, 14]) and core dump analysis
rity analysts can trace program execution and integrate
(e.g., [16, 26, 36]) are most common.
them into a core dump. In comparison with an ordinary
Record-and-replay is a technique that typically instru-
core dump, the new post-crash artifact provides software
ments a program so that one can automatically log non-
developers and security analysts with more clues as to a
deterministic events (i. e., the input to a program as well
program crash. To use it for failure diagnosis, however, it
as the memory access interleavings of the threads) and
still requires strenuous manual efforts.
later utilize the log to replay the program deterministically.
In this work, we propose POMP, an automated tool to
In theory, this technique would significantly benefit root
facilitate the analysis of post-crash artifacts. More specif-
cause diagnosis of crashing programs because develop-
ically, POMP introduces a new reverse execution mecha-
ers and security analysts can fully reconstruct the control
nism to construct the data flow that a program followed
and data flows prior to a crash. In practice, it however is
prior to its crash. By using the data flow, POMP then
not widely adopted due to the requirement of program in-
performs backward taint analysis and highlights those
strumentation and the high overhead it introduces during
program statements that actually contribute to the crash.
normal operations.
To demonstrate its effectiveness in pinpointing program
statements truly pertaining to a program crash, we have In comparison with record-and-reply, core dump analy-
implemented POMP for Linux system on x86-32 platform, sis is a lightweight technique for the diagnosis of program
and tested it against various program crashes resulting crashes. It does not require program instrumentation, nor
from 31 distinct real-world security vulnerabilities. We rely upon the log of program execution. Rather, it facil-
show that, POMP can accurately and efficiently pinpoint itates program failure diagnosis by using more generic
program statements that truly pertain to the crashes, mak- information, i. e., the core dump that an operating system
ing failure diagnosis significantly convenient. automatically captures every time a process has crashed.
However, a core dump provides only a snapshot of the
failure, from which core dump analysis techniques can
1 Introduction infer only partial control and data flows pertaining to pro-
gram crashes. Presumably as such, they have not been
Despite the best efforts of software developers, software treated as the first choice for software debugging.
inevitably contains defects. When they are triggered, a Recently, the advance in hardware-assisted processor
program typically crashes or otherwise terminates ab- tracing significantly ameliorates this situation. With the
normally. To track down the root cause of a software emergence of Intel PT [6] – a brand new hardware feature
crash, software developers and security analysts need to in Intel CPUs – software developers and security ana-
identify those program statements pertaining to the crash, lysts can trace instructions executed and save them in a
Register
A5: lea eax, [ebp-0x10]
T5
A6: push eax ;argument of &var
T6 ebp 0xff28 0xff28 0xff28 0xff28 0xff28 0xff08 0xff08 0xff08 0xff08
A7: call child
T7
A8: push ebp
T8
A9: mov ebp, esp
esp 0xff14 0xff14 0xff14 0xff10 0xff0c 0xff08 0xff08 0xff08 0xff08
T9
A10: mov eax, [ebp+0x8]
T10 0xff1c 0x0002 0x0002 0x0002 0x0002 0x0002 0x0002 0x0002 test test
A11: mov [eax], 0x1 ;a[0]=1
T11
A12: mov eax, [ebp+0x8]
Memory Address
T12 0xff18 0x0001 0x0001 0x0001 0x0001 0x0001 0x0001 0x0001 0x0001 0x0001
A13: add eax, 0x4
T13
A14: mov [eax], 0x2 ;a[1]=2
T14 0xff14 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000
A15: mov eax, 0x0
T15
A16: pop ebp
T16 0xff10 0xff18 0xff18 0xff18 0xff18 0xff18 0xff18 0xff18 0xff18 0xff18
A17: ret
T17 A18: add esp, 0x4
0xff0c A18 A18 A18 A18 A18 A18 A18 A18 A18
T18 A19: mov eax, [ebp-0xc]
T19 A20: call eax ;crash site
0xff08 0xff28 0xff28 0xff28 0xff28 0xff28 0xff28 0xff28 0xff28 0xff28
T20
Figure 1: A post-crash artifact along with the memory footprints recovered by reversely executing the trace enclosed in the artifact.
Note that, for simplicity, all the memory addresses and the value in registers are trimmed and represented with two hex digits. Note
that A18 and test indicate the addresses at which the instruction and function are stored.
flows, this is not sufficient for completing program fail- to the crash. As we have shown in the section above, it is
ure diagnosis. Again, take for example the execution tedious and tough for software developers (or security an-
trace shown in Figure 1. When backward analysis passes alysts) to dig through the trace and pinpoint the root cause
through instruction A15 and reaches instruction A14, of a program crash. Therefore, the objective of this work
through forward analysis, a security analyst can quickly is to identify only those instructions that truly contribute
discover that the value in register eax after the execution to the crash. In other words, given a post-crash artifact,
of A14 is dependent upon both instruction A12 and A13. our goal is to highlight and present to software developers
As a result, an instinctive reaction is to retrieve the value (or security analysts) the minimum set of instructions that
stored in the memory region specified by [ebp+0x8] contribute to a program crash. Here, our hypothesis is that
shown in instruction A12. However, memory indicated the achievement of this goal can significantly reduce the
by [ebp+0x8] and [eax] shown in instruction A14 manual efforts of finding out the root cause of a software
might be alias of each other. Without an approach to re- failure.
solve memory alias, one cannot determine if the definition
in instruction A14 interrupts the data flow from instruc-
tions A12 and A13. Thus, program failure diagnosis has 3.2 Design Principle
to discontinue without an outcome.
To accomplish the aforementioned objective, we de-
sign POMP to perform postmortem analysis on binaries
3 Overview – though in principle this can be done on a source code
level – in that this design principle can provide software
In this section, we first describe the objective of this re- developers and security analysts with the following bene-
search. Then, we discuss our design principle followed fits. Without having POMP tie to a set of programs written
by the basic idea on how POMP performs postmortem in a particular programming language, our design prin-
program analysis. ciple first allows software developers to employ a single
tool to analyze the crashes of programs written in vari-
3.1 Objective ous language (e.g., assembly code, C/C++ or JavaScript).
Second, our design choice eliminates the complication
The goal of software failure diagnosis is to identify the introduced by the translation between source code and
root cause of a failure from the instructions enclosed in binaries in that a post-crash artifact carries an execution
an execution trace. Given a post-crash artifact containing trace in binaries which can be directly consumed by anal-
an execution trace carrying a large amount of instructions ysis at the binary level. Third, with the choice of our
that a program has executed prior to its crash, however, design, POMP can be generally applied to software failure
any instructions in the trace can be potentially attributable triage or categorization in which a post-crash artifact is
A18 def: esp = esp+4 0xff14 In the first step, the algorithm first parses an execution
trace reversely. For each instruction in the trace, it extracts
A19 use: ebp 0xff28 A19 use: ebp 0xff28
uses and definitions of corresponding variables based on
A19 use: [ebp-0xc] 0x0002 A19 use: [ebp-0xc] 0x0002 the semantics of that instruction and then links them to
A19 use: eax ?? A19 use: eax ??
a use-define chain previously constructed. For example,
given an initial use-define chain derived from instructions
A19 def: eax = [ebp-0xc] 0x0002 A19 def: eax = [ebp-0xc] 0x0002 A20 and A19 shown in Figure 1, POMP extracts the use
A20 use: eax 0x0002 A20 use: eax 0x0002 and definition from instruction A18 and links them to the
Before After head of the chain (see Figure 2).
As we can observe from the figure, a definition (or
Figure 2: A use-define chain before and after appending new
relations derived from instruction A18. Each node is partitioned
use) includes three elements – instruction ID, use (or def-
into three cells. From left to right, the cells carry instruction inition) specification and the value of the variable. In
ID, definition (or use) specification and the value of the variable. addition, we can observe that a use-define relation in-
Note that symbol ?? indicates the value of that variable is cludes not only the relations between operands but also
unknown. those between operands and those base and index regis-
ters enclosed (see the use and definition for instruction
A19 shown in Figure 2).
ing data flow and thus pinpoint instructions that truly
Every time appending a use (or definition), our algo-
contribute to a crash. In our work, POMP automates this
rithm examines the reachability for the corresponding
procedure by using backward taint analysis. To illustrate
variable and attempts to resolve those variables on the
this, we continue the aforementioned example and take
chain. More specifically, it checks each use and defi-
the memory footprints shown in Figure 1. As is described
nition on the chain and determines if the value of the
earlier, in this case, the bad value in register eax was
corresponding variable can be resolved. By resolving,
passed through instruction A19 which copies the bad
we mean the variable satisfies one of the following con-
value from memory [ebp-0xC] to register eax. By
ditions – ¬ the definition (or use) of that variable could
examining the memory footprints restored, POMP can eas-
reach the end of the chain without any other intervening
ily find out that the memory indicated by [ebp-0xC]
definitions; it could reach its consecutive use in which
shares the same address with that indicated by [eax]
the value of the corresponding variable is available; ® a
in instruction A14. This implies that the bad value is ac-
corresponding resolved definition at the front can reach
tually propagated from instruction A14. As such, POMP
the use of that variable; ¯ the value of that variable can
highlights instructions A19 and A14, and deems they
be directly derived from the semantics of that instruction
are truly attributable to the crash. We elaborate on the
(e.g., variable eax is equal to 0x00 for instruction mov
backward taint analysis in Section 4.
eax, 0x00).
To illustrate this, we take the example shown in
Figure 2. After our algorithm concatenates definition
4 Design def:esp=esp+4 to the chain, where most variables
have already been resolved, reachability examination in-
Looking closely into the example above, we refine an dicates this definition can reach the end of the chain.
algorithm to perform reverse execution and memory foot- Thus, the algorithm retrieves the value from the post-
print recovery. In the following, we elaborate on this crash artifact and assigns it to esp (see the value in cir-
algorithm followed by the design detail of our backward cle). After this assignment, our algorithm further prop-
taint analysis. agates this updated definition through the chain, and at-
tempts to use the update to resolve variables, the values
of which have not yet been assigned. In this case, none
of the definitions and uses on the chain can benefit from
4.1 Reverse Execution this propagation. After the completion of this propaga-
tion, our algorithm further appends use use:esp and
Here, we describe the algorithm that POMP follows when repeats this process. Slightly different from the process
performing reverse execution. In particular, our algorithm for definition def:esp=esp+4, for this use, variable
follows two steps – use-define chain construction and esp is not resolvable through the aforementioned reach-
memory alias verification. In the following, we elaborate ability examination. Therefore, our algorithm derives
on them in turn. the value of esp from the semantics of instruction A18
Data flow
A10 use: eax ?? A13 use: eax ??
Data flow
sis, the walk-through discovers that a register carries an A2 use: [R2] ?? A3 use: [R4] ??
Data flow
invalid address and that invalid value should have the
A2 use: R2 ?? A3 use: R3 ??
crashing program terminate at a site ahead of its actual
Data flow
crash site. A1 def: R0=[R1] ?? A3 def: R3=[R4] ??
Table 2: The list of program crashes resulting from various vulnerabilities. CVE-ID specifies the ID of the CVEs. Trace length
indicates the lines of instructions that POMP reversely executed. Size of mem shows the size of memory used by the crashed
program (with code sections excluded). # of taint and Ground truth describe the lines of instructions automatically
pinpointed and manually identified, respectively. Mem addr unknown illustrates the amount of memory locations, the addresses
of which are unresolvable.
crash. We took our manual analysis as ground truth and Considering pinpointing a root cause does not require
compared them with the output of POMP. In this way, reversely executing the entire trace recorded by Intel PT,
we validated the effectiveness of POMP in facilitating it is worth of noting that, we selected and utilized only a
failure diagnosis. More specifically, we compared the partial execution trace for evaluation. In this work, our
instructions identified manually with those pinpointed by selection strategy follows an iterative procedure in which
POMP. The focuses of this comparison include ¬ examin- we first introduced instructions of a crashing function
ing whether the root cause of that crash is enclosed in the to reverse execution. If this partial trace is insufficient
instruction set POMP automatically identified, investi- for spotting a root cause, we traced back functions previ-
gating whether the output of POMP covers the minimum ously invoked and then included instructions function-by-
instruction set that we manually tracked down, and ® function until that root cause can be covered by POMP.
exploring if POMP could significantly prune the execution
trace that software developers (or security analysts) have
to manually examine. 6.3 Experimental Results
In order to evaluate the efficiency of POMP, we We show our experimental results in Table 2. Except
recorded the time it took when spotting the instructions for test cases 0verkill and aireplay-ng, we ob-
that truly pertain to each program crash. For each test serve, every root cause is included in a set of instructions
case, we also logged the instructions that POMP reversely that POMP pinpointed. Through a comparison mentioned
executed in that this allows us to study the relation be- above, we also observe each set encloses the correspond-
tween efficiency and the amount of instructions reversely ing instructions we manually identified (i. e., ground
executed. truth). These observations indicate that POMP is effective
[2] Libelf - free software directory. https://directory.fsf.org/ [18] K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Or-
wiki/Libelf. govan, G. Nichols, D. Grant, G. Loihle, and G. Hunt.
Debugging in the (very) large: Ten years of implementa-
[3] Linux programmer’s manual. http://man7.org/linux/man- tion and experience. In Proceedings of the ACM SIGOPS
pages/man7/signal.7.html. 22Nd Symposium on Operating Systems Principles, 2009.
[4] Offensive security exploit database archive. https://www. [19] W. Gu, Z. Kalbarczyk, R. K. Iyer, Z.-Y. Yang, et al. Char-
exploit-db.com/. acterization of linux kernel behavior under errors. In DSN,
volume 3, pages 22–25, 2003.
[5] The z3 theorem prover. https://github.com/Z3Prover/z3.
[20] S. Hangal and M. S. Lam. Tracking down software bugs
[6] Processor tracing. https://software.intel.com/en-us/blogs/ using automatic anomaly detection. In Proceedings of the
2013/09/18/processor-tracing, 2013. 24th International Conference on Software Engineering,
2002.
[7] T. Akgul and V. J. Mooney, III. Instruction-level reverse
execution for debugging. In Proceedings of the 2002 ACM [21] C. Hou, G. Vulov, D. Quinlan, D. Jefferson, R. Fujimoto,
SIGPLAN-SIGSOFT Workshop on Program Analysis for and R. Vuduc. A new method for program inversion.
Software Tools and Engineering, 2002. In Proceedings of the 21st International Conference on
Compiler Construction, 2012.
[8] T. Akgul and V. J. Mooney III. Assembly instruction level
reverse execution for debugging. ACM Trans. Softw. Eng. [22] B. Kasikci, B. Schubert, C. Pereira, G. Pokam, and G. Can-
Methodol., 2004. dea. Failure sketching: A technique for automated root
cause diagnosis of in-production failures. In Proceedings
[9] T. Akgul, V. J. Mooney III, and S. Pande. A fast assem- of the 25th Symposium on Operating Systems Principles,
bly level reverse execution method via dynamic slicing. 2015.
In Proceedings of the 26th International Conference on
Software Engineering, 2004. [23] B. Liblit and A. Aiken. Building a better backtrace: Tech-
niques for postmortem program analysis. Technical report,
[10] S. Artzi, S. Kim, and M. D. Ernst. Recrash: Making soft- 2002.
ware failures reproducible by preserving object states. In
Proceedings of the 22Nd European Conference on Object- [24] R. Manevich, M. Sridharan, S. Adams, M. Das, and
Oriented Programming, 2008. Z. Yang. Pse: Explaining program failures via postmortem
static analysis. In Proceedings of the 12th ACM SIGSOFT
[11] G. Balakrishnan and T. Reps. Analyzing memory accesses Twelfth International Symposium on Foundations of Soft-
in x86 executables. In cc, pages 5–23, 2004. ware Engineering, 2004.
[12] J. Bell, N. Sarda, and G. Kaiser. Chronicler: Lightweight [25] D. Molnar, X. C. Li, and D. A. Wagner. Dynamic test
recording to reproduce field failures. In Proceedings of the generation to find integer bugs in x86 binary linux pro-
2013 International Conference on Software Engineering, grams. In Proceedings of the 18th Conference on USENIX
2013. Security Symposium, 2009.
[13] B. Biswas and R. Mall. Reverse execution of programs. [26] P. Ohmann. Making your crashes work for you (doctoral
SIGPLAN Not., 1999. symposium). In Proceedings of the 2015 International
Symposium on Software Testing and Analysis, 2015.
[14] Y. Cao, H. Zhang, and S. Ding. Symcrash: Selective
recording for reproducing crashes. In Proceedings of the [27] F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: treat-
29th ACM/IEEE International Conference on Automated ing bugs as allergies—a safe method to survive software
Software Engineering, 2014. failures. In ACM SIGOPS Operating Systems Review,
volume 39, pages 235–248. ACM, 2005.
[15] H. Cleve and A. Zeller. Locating causes of program fail-
ures. In Proceedings of the 27th International Conference [28] M. Renieris and S. P. Reiss. Fault localization with nearest
on Software Engineering, 2005. neighbor queries. In Proceedings of IEEE/ACM Interna-
tional Conference on Automated Software Engineering,
[16] W. Cui, M. Peinado, S. K. Cha, Y. Fratantonio, and V. P. 2003.
Kemerlis. Retracer: Triaging crashes by reverse execution
from partial memory dumps. In Proceedings of the 38th [29] S. K. Sahoo, J. Criswell, C. Geigle, and V. Adve. Using
International Conference on Software Engineering, 2016. likely invariants for automated software fault localization.
three requirements. Firstly, the Environment must be iso- Non-Secure state Secure state
3.1 Transparent Malware Analysis on x86 3.2 Dynamic Analysis Tools on ARM
Ether [20] leverages hardware virtualization to build a Emulation-based systems. DroidScope [52] rebuilds
malware analysis system and achieves high transparency. the semantic information of both the Android OS and
Spider [19] is also based on hardware virtualization, and the Dalvik virtual machine based on QEMU. Copper-
it focuses on both applicability and transparency while Droid [45] is a VMI-based analysis tool that automati-
using memory page instrument to gain higher efficiency. cally reconstructs the behavior of Android malware in-
Since the hardware virtualization has transparency is- cluding inter-process communication (IPC) and remote
sues, these systems are naturally not transparent. LO- procedure call interaction. DroidScibe [18] uses Cop-
PHI [41] leverages additional hardware sensors to moni- perDroid [45] to collect behavior profiles of Android
tor the disk operation and periodically poll memory snap- malware, and automatically classifies them into differ-
shots, and it achieves a higher transparency at the cost of ent families. Since the emulator leaves footprints, these
incomplete view of system states. systems are natural not transparent.
MalT [54] increases the transparency by involving Hardware virtualization. Xen on ARM [50] mi-
System Manage Mode (SMM), a special CPU mode in grates the hardware virtualization based hypervisor Xen
x86 architecture. It leverages PMU to monitor the pro- to ARM architecture and makes the analysis based
gram execution and switch into SMM for analysis. Com- on hardware virtualization feasible on mobile devices.
paring with MalT, N INJA improves in the following as- KVM/ARM [17] uses standard Linux components to im-
pects: 1) The PMU registers on MalT are accessible by prove the performance of the hypervisor. Although the
privileged malware, which breaks the transparency by hardware virtualization based solution is considered to
... ...
CodeItem In GIC, each interrupt is assigned to Group 0 (secure in-
dex_file_ const/4 v0, 0
... type 0 const/4 v1, 1 terrupts) or Group 1 (non-secure interrupts) by a group
type 1
...
add-int, v0, v0, v1
return v0
of 32-bit GICD IGROUPR registers. Each bit in each
type n
Class
...
GICD IGROUPR register represents the group information
...
method 0
of a single interrupt, and value 0 indicates Group 0 while
dex_cache_
dex_type_idx_ method 1
...
value 1 means Group 1. For a given interrupt ID n,
...
method n
the index of the corresponding GICD IGROUPR register
...
is given by n / 32, and the corresponding bit in the reg-
ArtMethod code 0
code 1 ister is n mod 32. Moreover, the GIC maintains a target
... ...
declaring_class_ code n process list in GICD ITARGETSR registers for each inter-
dex_method_index_
... ... rupt. By default, the ATF configures the secure interrupts
to be handled in Cortex-A57 core 0.
Figure 3: Semantics in the Function ExecuteGotoImpl. As mentioned in Section 4.1, N INJA uses secure PMI
and NMI to trigger a reliable switch. As the secure inter-
rupts are handled in Cortex-A57 core 0, we run the tar-
we use function ExecuteGotoImpl as an example to get application on the same core to reduce the overhead
explain our approach. In Android, the bytecode of a Java of the communication between cores. In Juno board,
method is organized as a 16-bit array, and ART passes the interrupt ID for PMI in Cortex-A57 core 0 is 34.
the bytecode array to the function ExecuteGotoImpl Thus, we clear the bit 2 of the register GICD IGROUPR1
together with the current execution status such as the (34 mod 32 = 2, 34 / 32 = 1) to mark the interrupt 34 as
current thread, caller and callee methods, and the call secure. Similarly, we configure the interrupt 195, which
frame stack that stores the call stack and parameters. is triggered by pressing a GPIO button, to be secure by
Then, the function ExecuteGotoImpl interprets the clearing the bit 3 of the register GICD IGROUPR6.
bytecode in the array following the control flows, and a
local variable dex pc indicates the index of the current
interpreting bytecode in the array. By manual checking 5.3 The Trace Subsystem
the decompiled result of the function, we find that the 5.3.1 Instruction Tracing
pointer to the bytecode array is stored in register X27
while variable dex pc is kept by register X21, and the N INJA uses ETM embedded in the CPU to trace the exe-
call frame stack is maintained in register X19. Figure 3 cuted instructions. Figure 4 shows the ETM and related
shows the semantics in the function ExecuteGotoImpl. components in Juno board. The funnels shown in the
By combining registers X21 and X27, we can locate the figure are used to filter the output of ETM, and each of
current executing bytecode. Moreover, a single frame them is controlled by a group of CoreSight Trace Funnel
in the call frame stack is represented by an instance of (CSTF) registers [9]. The filtered result is then output
StackFrame with the variable link pointing to the to Embedded Trace FIFO (ETF) which is controlled by
previous frame. The variable method indicates the Trace Memory Controller (TMC) registers [10].
current executing Java method, which is represented In our case, as we only need the trace result from the
by an instance of ArtMethod. Next, we fetch the core 0 in the Cortex-A57 cluster, we set the EnS0 bit in
declaring class of the Java method following the pointer CSTF Control Register of funnel 0 and funnel 2, and
declaring class . The pointer dex cache in the clear other slave bits. To enable the ETF, we set the
declaring class points to an instance of DexCache which TraceCaptEn bit of the TMC CTL register.
is used to maintain a cache for the DEX file, and the The ETM is controlled by a group of trace regis-
variable dex file in the DexCache finally points to ters. As the target application is always executed in
the instance of DexFile, which contains all information non-secure EL0 or non-secure EL1, we make the ETM
of a DEX file. Detail description like the name of the only trace these states by setting all EXLEVEL S bits and
method can be fetched via the index of the method (i.e., clearing all EXLEVEL NS bits of the TRCVICTLR register.
dex method index ) in the method array maintained Then, N INJA sets the EN bit of TRCPRGCTLR register to
by the DexFile. Note that both ExecuteGotoImpl start the instruction trace. In regard to stop the trace, we
and ExecuteSwitchImpl functions have four different first clear the EN bit of TRCPRGCTLR register to disable
implement the breakpoint based on the instruction step- address at this moment
... trap
lating the reading to the PMCR EL0 register and returning
MRS X0, PMCR_EL0
MOV X1, #31 exception
MOV X0, #0x41013000
the default value of the register. Before the MRS instruc-
AND X0, X1, X1 LSR #10 return
tion is executed, a trap is triggered to switch to the secure
...
domain. N INJA then analyzes the instruction that triggers
the trap and learns that the return value of PMCR EL0 is
Figure 6: Protect the PMCR EL0 Register via Traps.
stored to the general-purpose register X0. Thus, we put
the default value 0x41013000 to the general-purpose reg-
The skid problem is a well-known problem [42, 49] ister X0 and resume to the normal domain. Note that the
and affects N INJA since the current executing instruction PC register of the normal domain should also be modified
is not the one that triggers the PMI when the PMI ar- to skip the MRS instruction. We protect both the registers
rives the processor. Thus, the DS may not exactly step that we modified (e.g., PMCR EL0, PMCNTENSET EL0)
the execution of the processor. Although the skid prob- and the registers modified by the hardware as a result
lem cannot be completely eliminated, the side-effect of of our usage (e.g., PMINTENCLR EL1, PMOVSCLR EL0).
the skid does not affect our system significantly, and we
provide a detailed analysis and evaluation in Section 7.5.
6.1.2 Memory Mapped Interface
N INJA TaintDroid [22] TaintART [44] DroidTrace [56] CrowDroid [15] DroidScope [52] CopperDroid [45] NDroid [38]
No VM/emulator X X X X X
No ptrace/strace X X X X X X
No modification to Android X X X X X X
SLOC of TCB (K) 27 56, 355 56, 355 12, 723 12, 723 489 489 489
storage usage of the ETM, we can use real-time con- ileOutput, and SmsManager.sendTextMessage. As
tinuous export via either a dedicated trace port capable the Android uses the system calls sys openat to open a
of sustaining the bandwidth of the trace or an existing file and sys write to write a file, we set breakpoints at
interface on the SoC (e.g., a USB or other high-speed the address of these calls. Note that the second parame-
port) [11]. ter of sys openat represents the full path of the target
file and the second parameter of sys write points to a
buffer writing to a file. Thus, after the breakpoints are hit,
7.2 Tracing and Debugging Samples we see that sample writing IMEI 353626078711780 to
the file /data/data/de.ecspride/files/out.txt.
We pickup two samples ActivityLifecycle1 and
The API SmsManager.sendTextMessage uses binder
PrivateDataLeak3 from DroidBench [21] project and
to achieve IPC with the lower-layer SmsService in An-
use N INJA to analyze them. We choose these two spe-
droid system, and the semantics of the IPC is described
cific samples since they exhibit representative malicious
in CopperDroid [45]. By intercepting the system call
behavior like leaking sensitive information via local file,
sys ioctl and following the semantics, we finally find
text message, and network connection.
the target of the text message “+49” and the content of
Analyzing ActivityLifecycle1. To get an overview
the message 353626078711780.
of the sample, we first enable the Android API tracing
feature to inspect the APIs that read sensitive informa-
tion (source) and APIs that leak information (sink), and 7.3 Transparency Experiments
find a suspicious API call sequence. In the sequence,
7.3.1 Accessing System Instruction Interface
the method TelephonyManager.getDeviceId and
method HttpURLConnection.connect are invoked in To evaluate the protection mechanism of the system in-
turn, which indicates a potential flow that sends IMEI to a struction interface, we write an Android application that
remote server. As we know the network packets are sent reads the PMCR EL0 and PMCNTENSET EL0 registers via
via the system call sys sendto, we attempt to intercept MRS instruction. The values of these two registers rep-
the system call and analyze the parameters of the system resent whether a performance counter is enabled. We
call. In Android, the system calls are invoked by corre- first use the application to read the registers with N INJA
sponding functions in libc.so, and we get the address disabled, and the result is shown in the upper rectan-
of the function for the system call sys sendto by disas- gle of Figure 7a. The last bit of the PMCR EL0 regis-
sembling libc.so. Thus, we use N INJA to set a break- ter and the value of the PMCNTENSET EL0 register are 0,
point at the address, and the second parameter of the sys- which means that all the performance counters are dis-
tem call, which is stored in register X1, shows that the abled. Then we press a GPIO button to enable the An-
sample sends a 181 bytes buffer to a remote server. Then, droid API tracing feature of N INJA and read the regis-
we output the memory content of the buffer and find that ters again. From the console output shown in Figure 7b,
it is a HTTP GET request to host www.google.de with we see that the access to the registers is successfully
path /search?q=353626078711780. Note that the dig- trapped into EL3. And the output shows that the real
its in the path is exactly the IMEI of the device. values of the PMCR EL0 and PMCNTENSET EL0 registers
Analyzing PrivateDataLeak3. Similar to the previ- are 0x41013011 and 0x20, respectively, which indicates
ous analysis, the Android API tracing helps us to find a that the counter PMEVCNTR5 EL0 is enabled. However,
suspicious API call sequence consisting of the methods the lower rectangle in Figure 7a shows that the value of
TelephonyManager.getDeviceId, Context.openF- the registers fetched by the application keep unchanged.
2.1 Cache attacks P RIME +P ROBE [35, 21, 25, 34, 29] is the oldest and
largest family of cache attacks, and also the most general.
Cache attacks [35, 7, 11, 21, 25, 29, 33, 34, 43] P RIME +P ROBE does not rely on shared memory, unlike
are a well-known class of side-channel attacks which most other cache attacks (including F LUSH +R ELOAD
seek to gain information about which memory lo- and its variants, described below). The original form of
cations are accessed by some victim program, and P RIME +P ROBE [35, 34] targets the L1 cache, but recent
at what times. In an excellent survey, Ge et work [21, 25, 29] extends it to target the L3 cache in In-
al. [4] group such attacks into three broad categories: tel processors, enabling P RIME +P ROBE to work across
P RIME +P ROBE, F LUSH +R ELOAD, and E VICT +T IME. cores and without relying on hyperthreading (Simultane-
Since E VICT +T IME is only capable of monitoring mem- ous Multithreading [38]). Like all L3 cache attacks, L3
ory accesses at the program granularity (whether a P RIME +P ROBE can detect accesses to either instructions
given memory location was accessed during execution or data; in addition, L3 P RIME +P ROBE trivially works
or not), in this paper we focus on P RIME +P ROBE and across VMs.
F LUSH +R ELOAD, which are much higher resolution and
have received more attention in the literature. Cache at- P RIME +P ROBE targets a single cache set, detecting
tacks have been shown to be effective for successfully accesses by any other program (or the operating system)
recovering AES [25], ElGamal [29], and RSA [43] keys, to any address in that cache set. In its active portion’s ini-
performing keylogging [8], and spying on messages en- tialization phase (called “prime”), the attacker accesses
crypted with TLS [23]. enough cache lines from the cache set so as to completely
Figure 1 outlines all of the attacks which we will con- fill the cache set with its own data. Later, in the mea-
sider. At a high level, each attack consists of a pre-attack surement phase (called “probe”), the attacker reloads the
portion, in which important architecture- or runtime- same data it accessed previously, this time carefully ob-
specific information is gathered; and then an active por- serving how much time this operation took. If the victim
tion which uses that information to monitor memory ac- did not access data in the targeted cache set, this oper-
cesses of a victim process. The active portion of exist- ation will proceed quickly, finding its data in the cache.
ing state-of-the-art attacks itself consists of three phases: However, if the victim accessed data in the targeted cache
an “initialization” phase, a “waiting” phase, and a “mea- set, the access will evict a portion of the attacker’s primed
surement” phase. The initialization phase prepares the data, causing the reload to be slower due to additional
cache in some way; the waiting phase gives the victim cache misses. Thus, a slow measurement phase implies
process an opportunity to access the target address; and the victim accessed data in the targeted cache set during
then the measurement phase performs a timed operation the waiting phase. Note that this “probe” phase can also
to determine whether the cache state has changed in a serve as the “prime” phase for the next repetition, if the
way that implies an access to the target address has taken monitoring is to continue.
place. Two different kinds of initial one-time setup are re-
Specifics of the initialization and measurement phases quired for the pre-attack portion of this attack. The first
vary by cache attack (discussed below). Some cache at- is to establish a timing threshold above which the mea-
tack implementations make a tradeoff in the length of the surement phase is considered “slow” (i.e. likely suffering
waiting phase between accuracy and resolution—shorter from extra cache misses). The second is to determine a
waiting phases give more precise information about the set of addresses, called an “eviction set”, which all map
timing of victim memory accesses, but may increase to the same (targeted) cache set (and which reside in dis-
the relative overhead of the initialization and measure- tinct cache lines). Finding an eviction set is much easier
ment phases, which may make it more likely that a vic- for an attack targeting the L1 cache than for an attack tar-
tim access could be “missed” by occurring outside of geting the L3 cache, due to the interaction between cache
one of the measured intervals. In our testing, not all addressing and the virtual memory system, and also due
cache attack implementations and targets exhibited ob- to the “slicing” in Intel L3 caches (discussed further in
vious experimental tradeoffs for the waiting phase dura- Sections 2.2.1 and 2.2.2).
Establish timing
threshold
Timing threshold Timing threshold Timing threshold Timing threshold - -
Transaction Transaction
Waiting
Bypass “Measurement”
Wait…
Start timer Start timer Start timer Start timer Start timer
Measurement
Abort status
Detect victim Abort status
Time > threshold Time < threshold Time > threshold Time < threshold indicates Cause #6
indicates Cause #7
access if or #8
Figure 1: Comparison of the operation of various cache attacks, including our novel attacks.
2.1.2 F LUSH +R ELOAD deduplication, which is how F LUSH +R ELOAD can work
across VMs. Nonetheless, this step’s reliance on shared
The other major class of cache attacks is memory is a critical weakness in F LUSH +R ELOAD, lim-
F LUSH +R ELOAD [7, 11, 43]. F LUSH +R ELOAD iting it to only be able to monitor targets in shared mem-
targets a specific address, detecting an access by any ory.
other program (or the operating system) to that exact
address (or another address in the same cache line). This
makes F LUSH +R ELOAD a much more precise attack In F LUSH +R ELOAD’s initialization phase, the attacker
than P RIME +P ROBE, which targets an entire cache set “flushes” the target address out of the cache using Intel’s
and is thus more prone to noise and false positives. CLFLUSH instruction. Later, in the measurement phase,
F LUSH +R ELOAD also naturally works across cores the attacker “reloads” the target address (by accessing
because of shared, inclusive, L3 caches (as explained it), carefully observing the time for the access. If the
in Section 2.2.1). Again, like all L3 cache attacks, access was “fast”, the attacker may conclude that another
F LUSH +R ELOAD can detect accesses to either instruc- program accessed the address, causing it to be reloaded
tions or data. Additionally, F LUSH +R ELOAD can work into the cache.
across VMs via the page deduplication exploit [43].
The pre-attack of F LUSH +R ELOAD, like that of
P RIME +P ROBE, involves determining a timing thresh- An improved variant of F LUSH +R ELOAD,
old, but is limited to a single line instead of an entire F LUSH +F LUSH [7], exploits timing variation in the
“prime” phase. However, F LUSH +R ELOAD does not re- CLFLUSH instruction itself; this enables the attack to
quire determining an eviction set. Instead, it requires combine its measurement and initialization phases,
the attacker to identify an exact target address; namely, much like P RIME +P ROBE. A different variant,
an address in the attacker’s virtual address space which E VICT +R ELOAD [8], performs the initialization phase
maps to the physical address the attacker wants to mon- by evicting the cacheline with P RIME +P ROBE’s “prime”
itor. Yarom and Falkner [43] present two ways to do phase, allowing the attack to work without the CLFLUSH
this, both of which necessarily involve shared memory; instruction at all—e.g., when the instruction has been
one exploits shared libraries, and the other exploits page disabled, as in Google Chrome’s NaCl [6].
Prime+Abort
% of lines having a second highest
detection rate at least that high
100% Prime+Probe
50%
0%
Figure 2: “Double coverage” of prototype groups generated by P RIME +A BORT- and P RIME +P ROBE-based versions
of Algorithm 1. With P RIME +P ROBE, some tested cachelines are reliably detected by more than one prototype eviction
set. In contrast, with P RIME +A BORT each tested cacheline is reliably detected by only one prototype eviction set.
cess) in each transaction. Its high speed and low over- Mastik: every probe step, we actually perform multiple probes, “count-
heads allow it to catch each victim access in a separate ing” only the first one. In our case we perform five probes at a time,
still alternating between forwards and backwards probes. All of the
transaction. results which we present for the “unmodified” implementation include
Figure 4 shows the performance of unmodified this slight modification.
Figure 4: Access detection rates for unmodified P RIME +P ROBE in the “control” and “treatment” conditions. Data
were collected over 100 trials, each involving several different victim access speeds. Shaded regions indicate the
range of the middle 75% of the data; lines indicate the medians. The y = x line is added for reference and indicates
perfect performance for the “treatment” condition (all events detected but no false positives or oversampling).
during every probe, but this approach will observe only chosen to compare P RIME +P ROBE and P RIME +A BORT,
a single streak and indicate a single event occurred. as these attacks do not rely on shared memory. Follow-
Observing the two competing implementations of ing their methods, rather than using previously published
P RIME +P ROBE on our hardware reveals an interesting results directly, we rerun previous attacks alongside ours
tradeoff. The original implementation has good high fre- to ensure fairness, including the same hardware setup.
quency performance, but suffers from both oversampling Figures 6 and 7 provide the results of this experiment.
and a high number of false positives. In contrast, the In this chosen-plaintext attack, we listen for accesses to
modified implementation has poor high frequency per- the first cacheline of the first T-Table (Te0) while run-
formance, but does not suffer from oversampling and ning encryptions. We expect that when the first four bits
exhibits fewer false positives. For the remainder of of our plaintext match the first four bits of the key, the
this paper we consider the modified implementation of algorithm will access this cacheline one extra time over
P RIME +P ROBE only, as we expect that its improved the course of each encryption compared to when the bits
accuracy and fewer false positives will make it more do not match. This will manifest as causing more events
desirable for most applications. Finally, we note that to be detected by P RIME +A BORT or P RIME +P ROBE re-
P RIME +A BORT combines the desirable characteristics spectively, allowing the attacker to predict the four key
of both P RIME +P ROBE implementations, as it exhibits bits. The attack can then be continued for each byte of
the fewest false positives, does not suffer from oversam- plaintext (monitoring a different cacheline of Te0 in each
pling, and has good high frequency performance, with a case) to reveal the top four bits of each key byte.
top speed around one million events per second. In our experiments, we used a key whose first four
bits were arbitrarily chosen to be 1110, and for each
4.4 Attacks on AES method we performed one million encryptions with each
possible 4-bit plaintext prefix (a total of sixteen mil-
In this section we evaluate the performance of lion encryptions for P RIME +A BORT and sixteen mil-
P RIME +A BORT in an actual attack by replicating the at- lion for P RIME +P ROBE). As shown in Figures 6 and
tack on OpenSSL’s T-table implementation of AES, as 7, both methods correctly predict the first four key bits
conducted by Gruss et al. [7]. As those authors ac- to be 1110, although the signal is arguably cleaner and
knowledge, this implementation is no longer enabled stronger when using P RIME +A BORT.
by default due to its susceptibility to these kinds of at-
tacks. However, as with their work, we use it for the
5 Potential Countermeasures
purpose of comparing the speed and accuracy of com-
peting attacks. Gruss et al. compared P RIME +P ROBE, Many countermeasures against side-channel attacks have
F LUSH +R ELOAD, and F LUSH +F LUSH [7]; we have already been proposed; Ge et al. [4] again provide an
5.0% 5.0%
0.0% 0.0%
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
First four bits of plaintext First four bits of plaintext
Figure 6: P RIME +A BORT attack against AES. Shown is, Figure 7: P RIME +P ROBE attack against AES. Shown is,
for each condition, the percentage of additional events for each condition, the percentage of additional events
that were observed compared to the condition yielding that were observed compared to the condition yielding
the fewest events. the fewest events.
excellent survey. Examining various proposed defenses across cores, implementing this countermeasure against
in the context of P RIME +A BORT reveals that some are P RIME +A BORT would require providing each user pro-
effective against a wide variety of attacks including cess time-sliced exclusive access to the LLC. This would
P RIME +A BORT, whereas others are impractical or in- mean that processes from different users could never run
effective against P RIME +A BORT. This leads us to ad- simultaneously, even on different cores, which seems im-
vocate for the prioritization and further development of practical.
certain approaches over others. Disabling TSX: A countermeasure which would os-
We first examine classes of side-channel counter- tensibly target P RIME +A BORT’s workings in particular
measures that are impractical or ineffective against would be to disable TSX entirely, similarly to how hy-
P RIME +A BORT and then move toward countermeasures perthreading has been disabled entirely in cloud environ-
which are more effective and practical. ments such as Microsoft Azure [30]. While this is tech-
Timer-Based Countermeasures: A broad class of nically feasible—in fact, due to a hardware bug, Intel al-
countermeasures ineffective against P RIME +A BORT are ready disabled TSX in many Haswell CPUs through a
approaches that seek to limit the availability of precise microcode update [17]—TSX’s growing prevalence (Ta-
timers, either by injecting noise into timers to make them ble 2), as well as its adoption by applications such as
less precise, or by restricting access to timers in general. glibc (pthreads) and the JVM [24], indicates its im-
There are a wide variety of proposals in this vein, includ- portance and usefulness to the community. System ad-
ing [15], [27], [31], [39], and various approaches which ministrators are probably unlikely to take such a drastic
Ge et al. classify as “Virtual Time” or “Black-Box Miti- step.
gation”. P RIME +A BORT should be completely immune Auditing: More practical but still not ideal is the class of
to all timing-related countermeasures. countermeasures Ge et al. refer to as Auditing, which is
Partitioning Time: Another class of countermeasures based on behavioral analysis of running processes. Hard-
that seems impractical against P RIME +A BORT is the ware performance counters in the target systems can be
class Ge et al. refer to as Partitioning Time. These coun- used to monitor LLC cache misses or miss rates, and thus
termeasures propose some form of “time-sliced exclu- detect when a P RIME +P ROBE- or F LUSH +R ELOAD-
sive access” to shared hardware resources. This would style attack is being conducted [1, 7, 46] (as any at-
technically be effective against P RIME +A BORT, because tack from those families will introduce a large number
the attack is entirely dependent on running simultane- of cache misses—at least in the victim process). As
ously with its victim process; any context switch causes a a P RIME +P ROBE-style attack, P RIME +A BORT would
transactional abort, so the P RIME +A BORT process must be just as vulnerable to these countermeasures as other
be active in order to glean any information. However, cache attacks are. However, any behavioral auditing
since P RIME +A BORT targets the LLC and can monitor scheme is necessarily imperfect and subject to misclas-
(2)
Target pixel in red
(5)
(3) Pixel multiplication <div> Filtered rendering
Target Target
(1) iframe of target page (6)
pixel white pixel black
Figure 3: Cross-Origin SVG Filter Pixel Stealing Attack in Firefox, reproduced from [2] with permission
on the <iframe> using the scrollTop and content. However, since the rendering time of the SVG
scrollLeft properties. filter is visible to the attacker page, and the rendering
time is dependent on the <iframe> content, the attack-
3. The target pixel is duplicated into a larger container ing page is able to violate this policy and learn pixel in-
<div> using the -moz-element CSS property. formation.
This creates a <div> that is arbitrarily sized and
consists only of copies of the target pixel. 3 New floating point timing observations
4. The SVG filter that runs in variable time Andryso et al. [2] presented a number of timing varia-
(feConvolveMatrix) is applied to the the pixel tions in floating point computation based on subnormal
duplication <div> and special value arguments. We expand this category to
note that any value with a zero significand or exponent
5. The rendering time of the filter is measured us- exhibits different timing behavior on most Intel CPUs.
ing requestAnimationFrame to get a call- Figure 9 shows a summary of our findings for our pri-
back when the next frame is completed and mary test platform running an Intel i5-4460 CPU. Unsur-
performance.now for high resolution timing. prisingly, double precision floating point numbers show
6. The rendering time is compared to the threshold de- more types of, and larger amounts of, variation than sin-
termined during the learning phase and categorized gle precision floats.
as white or black. Figures 4, 5, 6, and 7 are crosstables showing average
cycle counts for division and multiplication on double
Since the targeted <iframe> and the attacker page and single precision floats on the Intel i5-4460. We re-
are on different origins, the attacking page should not fer to the type of operation (add, subtract, divide, etc) as
be able to learn any information about the <iframe>’s the operation, and the specific combination of operands
and operation as the computation. Cells highlighted in It is important to note that the Andrysco et al. [2] fo-
blue indicate computations that averaged 1 cycle higher cused on the performance difference between subnormal
than the mode across all computations for that operation. and normal operands, while we observe that there are ad-
Cells in orange indicate the same for 1 cycle less than ditional classes of values worth examining. The specific
the mode. Bold face indicates a computation that had a differences on powers-of-two are more difficult to detect
standard deviation of > 1 cycle (none of the tests on the with a naive analysis as they cause a slight speedup when
Intel i5-4460 had standard deviations above 1 cycle). All compared to the massive slowdown of subnormals.
other crosstables in this paper follow this format unless
otherwise noted.
We run each computation (operation and argument
4 Fixed point defenses in Firefox
pair) in a tight loop for 40,000,000 iterations, take the to-
tal number of CPU cycles during the execution, remove In version 28 Firefox switched to a new set of SVG fil-
loop overheads, and find the average cycles per compu- ter implementations that caused the attack presented by
tation. This process is repeated for each operation and Andrysco et al. [2] to stop functioning. Many of these
argument pair and stored. Finally, we run the entire test- implementations no longer used floating point math, in-
ing apparatus 10 times and store all the results. Thus, we stead using their own fixed point arithmetic.
execute each computation 400,000,000 times split into As the feConvolveMatrix implementation now
10 distinct samples. This apparatus measures the steady- consists entirely of integer operations, we cannot use
state execution time of each computation. floating point timing side channels to exploit it. We in-
The entirety of our data across multiple generations of stead examined a number of the other SVG filter imple-
Intel and AMD CPUs, as well as tools and instructions mentations and found that several had not yet been ported
for generating this data, are available at https://cs to the new fixed point implementation, such as the light-
eweb.ucsd.edu/~dkohlbre/floats. ing filters.
Value Exponent Significand can generate intermediate values requiring the upper 32-
bits. Thus, none of the filters we examined using fixed
Zero All Zeros Zero point had any instruction data timing based side chan-
Infinity All Ones Zero nels. Handling the full range of floating point function-
Not-a-Number All Ones Non-zero ality in a fixed point and constant time way is expensive
Subnormal All Zeros Non-zero and complex, as seen in [2].
A side effect of a simple implementation is that it can-
Figure 8: IEEE-754 Special Value Encoding (Repro-
not handle more complex operations that could induce
duced with permission from [2])
NaNs or infinities and must process them.
Operation Default FTZ & DAZ -ffast-math 4.2 Lighting filter attack
Single Precision Our Firefox SVG timing attack makes use of the
Add/Sub – – – feSpecularLighting lighting model with an
Mul S – – fePointLight. This particular filter in this
Div S – – configuration is not ported to fixed point, and
Sqrt M Z – performs a scaling operation over the input al-
pha channel. The surfaceScale property in
Double Precision feSpecularLighting controls this scaling opera-
Add/Sub – – – tion and can be set to an arbitrary floating point value
Mul S – – when creating the filter. With this tool, we perform the
Div M Z Z following attack similar to the one in section 2.2. We
Sqrt M Z Z need only to modify step 4 as seen below to enable the
use of the new lighting filter attack.
Figure 9: Observed sources of timing differences under
different settings on an Intel i5-4460. – : no variation, S 1. Steps 1-3 are the same as section 2.2.
: Subnormals are slower, Z : all zero exponent or signifi-
cand values are faster, M : mixture of several effects 4.1. Apply an feColorMatrix to the pixel mul-
tiplier <div> that sets the alpha channel based en-
tirely on the input color values. This sets the alpha
4.1 Fixed point implementation channel to 1 for a black pixel input, and 0 for a white
The fixed point implementation used in Firefox SVG fil- pixel input.
ters is a simple 32-bit format with no Not-a-Number, In- 4.2. Apply the timing variable
finity, or other special case handling. Since they make feSpecularLighting filter with a sub-
use of the standard add/subtract/multiply operations for normal surfaceScale and an attached
32-bit integers, we know of no timing side channels fePointLight as the timing vulnerable fil-
based on operands for this implementation. Integer di- ter.
vision is known to be timing variable based on the up-
per 32-bits of 64-bit operands, but none of the filters 5. Steps 5 and 6 are the same as section 2.2.
Figure 10: HTML and style design for the pixel multiplying structure used in our attacks on Safari and Chrome
Divisor
Dividend 0.0 1.0 1e10 1e+200 1e-300 1e-42 256 257 1e-320
Cycle count
0.0 6.58 6.59 6.58 6.55 6.59 6.54 6.54 6.56 6.56
1.0 6.55 6.55 12.23 12.19 12.22 12.22 6.56 12.25 6.56
1e10 6.58 6.59 12.22 12.22 12.21 12.21 6.59 12.23 6.59
1e+200 6.57 6.59 12.22 12.20 12.17 12.21 6.58 12.17 6.57
1e-300 6.59 6.57 12.18 12.23 12.24 12.22 6.59 12.24 6.57
1e-42 6.58 6.56 12.21 12.25 12.23 12.18 6.56 12.21 6.58
256 6.57 6.60 12.20 12.22 12.24 12.24 6.57 12.23 6.54
257 6.57 6.58 12.22 12.23 12.25 12.20 6.57 12.23 6.58
1e-320 6.57 6.58 6.60 6.51 6.59 6.57 6.58 6.55 6.58
Figure 11: Division timing for double precision floats on Intel i5-4460+FTZ/DAZ
loops over doubles, or tight square root operations over rently uses single precision floats, and thus only need
floats. Thus, our attack must circumvent the use of the avoid sqrt operations as far as we know. However, any
FTZ and DAZ flags altogether. use of double precision floats will additionally require
Chrome enables the FTZ and DAZ control flags when- avoidance of division. We did not observe any currently
ever a filter is set to run on the CPU, which disallows vulnerable uses of single precision sqrt, or of double pre-
our Firefox or Safari attacks from applying directly to cision floating point operations in the Skia codebase.
Chrome. However, we found that the FTZ and DAZ flags We notified Google of this attack and a fix is in
are not set when a filter is going to execute on the GPU. progress.
This would normally only be useful for a GPU-based at-
tack but we can force the feConvolveMatrix filter 6.2 Frame timing on Chrome
to abort from GPU acceleration at the last possible mo-
ment and fall back to the CPU implementation by having An additional obstacle to our Chrome attack was obtain-
a kernel matrix over the maximum supported GPU size ing accurate frame render times. Unlike on Firefox or
of 36 elements. Chrome does not enable the FTZ and Safari, adding a filter to a <div>’s style and then calling
DAZ flags when it executes this fallback, allowing our getAnimationFrame is insufficient to be sure that
timing attack to use subnormal values. the time until the callback occurs will accurately repre-
sent the rendering time of the filter. In fact, the frame
We force the target <div> to start on the GPU render-
that the filter is actually rendered on differs by platform
ing path by applying a CSS transform:rotateY()
and is not consistent on Linux. We instead run algorithm
to it. This is a well known trick for causing future anima-
1 to get the approximate rendering time of a given frame.
tions and filters to be performed on the GPU, and it works
Since we only care about the relative rendering time be-
consistently. Without this, the feConvolveMatrix
tween white and black pixels, the possibly extra time in-
GPU implementation would never fire, as it will not
cluded doesn’t matter as long as it is moderately consis-
choose the GPU over the CPU on its own. It is only be-
tent. This technique allowed our attack to operate on all
cause of our ability to force CPU fallback with the FTZ
tested platforms without modification.
and DAZ flags disabled that allows our CPU Chrome at-
tack to function.
7 Revisiting the effectiveness of Escort
Note that even if FTZ/DAZ are enabled in all cases
there are still scenerios that show timing variation as seen Escort [15] proposes defenses against multiple types of
in figures 11 and 9. Chrome’s Skia configuration cur- timing side channels, notably a defense using SIMD vec-
tor operations to protect against the floating point attack Double Precision
presented by Andrysco et al in [2]. Add/Sub S S
Single Instruction, Multiple Data (SIMD) instructions Mul S –
are an extension to the x86_64 ISA designed to improve Div S –
the performance of vector operations. These instructions Sqrt S –
allow 1-4 independent computations of the same opera-
tion (divide, add, subtract, etc) to be performed at once Figure 13: Timing differences observed for libdrag
vs default operations on an AMD Phenom II X2 550. –
using large registers. By placing the first set of operands
: no variation, S : Subnormals are slower, Z : all zero
in the top half of the register, and the second set of exponent or significand values are faster, M : mixture of
operands in the bottom half, multiple computations can several effects
be easily performed with a single opcode. Intel does not
provide significant detail about the execution of these in-
structions and does not provide guarantees about their secret operands to obscure the running time of the se-
performance behavior. cret operands. Escort places the dummy arguments in
one lane of the SIMD instruction, and the sensitive argu-
7.1 Escort overview
ments in another lane. Since the instruction only retires
Escort performs several transforms during compilation when the full set of computations are complete, the run-
designed to remove timing side channels. First, they ning time of the entire operation is hypothesized to be
modify ’elementary operations’ (floating point math op- dependent only on the slowest operation. This is true if
erations for the purpose of this paper). Second, they per- and only if the different lanes are computed in parallel.
form a number of basic block linearizations, array access To obscure the running time of the sensitive operands,
changes, and branch removals to transform the control Escort places two subnormal arguments in the dummy
flow of the program to constant time and minimize side lane of all modified operations under the assumption that
effects. this will exercise the slowest path through the hardware.
We do not evaluate the efficacy of the higher level con- Escort will replace most floating point operations it en-
trol flow transforms and instead evaluate only the ele- counters. However, if it can prove (using the Z3 SMT
mentary operations. solver [4]) that the operation will never have subnormal
Escort’s tool is to construct a set of dummy operands values as operands it declines to replace the operation.
(the escort) that are computed at the same time as the This means that if a function filters out subnormals be-
Figure 14: Division timing for double precision floats on Intel i5-4460+Escort
Divisor
Dividend 0.0 1.0 1.0e-10 1.0e-323 1.0e-43 1.0e100 256 257
Runtime (Seconds)
0.0 10.09 10.08 10.08 10.08 10.08 10.08 10.08 10.10
1.0 10.08 10.08 10.55 10.08 10.55 10.55 10.08 10.55
1.0e-10 10.08 10.08 10.55 10.08 10.55 10.55 10.08 10.55
1.0e-323 10.08 10.08 10.08 10.08 10.08 10.08 10.08 10.08
1.0e-43 10.08 10.08 10.55 10.08 10.55 10.55 10.08 10.55
1.0e100 10.08 10.08 10.55 10.08 10.55 10.55 10.08 10.55
256 10.08 10.08 10.55 10.08 10.57 10.55 10.08 10.57
257 10.09 10.08 10.55 10.08 10.57 10.55 10.08 10.55
Figure 15: Division timing for double precision floats on Intel i5-4460 macro-test
Figure 16: Division timing for single precision floats on Nvidia GeForce GT 430
and CSS transforms; Safari and Chrome. Currently, Sa- proximate geographic location, cache timing can reveal
fari only supports a subset of CSS transformations on the the user’s location, as shown by Jia et al. [10].
GPU, and none of the SVG transforms. Chrome supports JavaScript can also ask the browser to make a cross-
a subset of the CSS and SVG filters on the GPU. Firefox origin request and then learn (via callback) how long the
intends to port filters to the GPU, but there is currently response took to arrive and be processed. Timing chan-
no support. nels can be introduced by the code that runs on the server
to generate the response; by the time it takes the response
8.2 Performance to be transmitted over the network, which will depend on
We performed a series of CUDA benchmarks on an how many bytes it contains; or by the browser code that
Nvidia GeForce GT 430 to determine the impact of sub- attempts to parse the response. These cross-site timing
normal values on computation time. The results for divi- attacks were introduced by Bortz, Boneh, and Nandy [3],
sion are shown in figure 16. All other results (add, sub, who showed they could be used to learn the number of
mul) were constant time regardless of the inputs.. items in a user’s shopping cart. Evans [5] and, later, Gel-
As figure 16 shows, subnormals induce significant ernter and Herzberg [7], showed they could be used to
slowdowns on divsion operations for single precision confirm the presence of a specific string in a user’s search
floats. Unfortunately, no SVG filters implemented in history or webmail mailbox. Van Goethem, Joosen, and
Chrome on the GPU perform tight division loops. Thus, Nikiforakis [17] observed that callbacks introduced to
extracting timing differences from the occational divi- support HTML5 features allow attackers to time individ-
sion they do perform is extremely difficult. ual stages in the browser’s response-processing pipeline,
thereby learning response size more reliably than with
If a filter were found to perform tight division loops, or
previous approaches.
a GPU that has timing variation on non-division opera-
The interaction of new browser features — TypedAr-
tions were found, the same attacks as in previous sections
rays, which translate JavaScript variable references to
could be ported to the GPU accelerated filters.
memory accesses more predictably, and nanosecond-
We believe that even without a specific attack, the resolution clocks — allow attackers to learn whether spe-
demonstration of timing variation based on operand val- cific lines have been evicted from the processor’s last-
ues in GPUs should invalidate “move to the GPU” as a level cache. Yossi Oren first showed that such mi-
defensive strategy. croarchitectural timing channels can be mounted from
JavaScript [14], and used them to learn gross system ac-
9 Related work
tivity. Recently, Gras et al. [8] extended Oren’s tech-
Felten and Schneider were the first to mount timing side- niques to learn where pages are mapped in the browser’s
channel attacks against browsers. They observed that re- virtual memory, defeating address-space layout random-
sources already present in the browser’s cache are loaded ization. In response, browsers rounded down the clocks
faster than ones that must be requested from a server, provided to JavaScript to 5 µs granularity. Kohlbren-
and that this can be used by malicious JavaScript to learn ner and Shacham [12] proposed a browser architecture
what pages a user has visited [6]. Felten and Schneider’s that degrades the clocks available to JavaScript in a more
history sniffing attack was later refined by Zalewski [18]. principled way, drawing on ideas from the “fuzzy time”
Because many sites load resources specific to a user’s ap- mitigation [9] in the VAX VMM Security Kernel [11].
[8] B. Gras, K. Razavi, E. Bosman, H. Bos, and C. Giuffrida, [18] M. Zalewski, “Rapid history extraction through non-
“ASLR on the line: Practical cache attacks on the MMU,” destructive cache timing,” Online: http://lcamtuf.cored
in Proceedings of NDSS 2017, A. Juels, Ed. Internet ump.cx/cachetime/, Dec. 2011.
Society, Feb. 2017.
bd31fb21454609b125ade1ad569ebcc2a2b9b73c 8aed2a7548362e88e84a7feb795a3a97e8395008
2 https://github.com/openssl/openssl/commit/ 4 https://rt.openssl.org/Ticket/Display.html?id=
3e00b4c9db42818c621f609e70569c7d9ae85717 3149&user=guest&pass=guest
Latency
shift and subtraction operations. To that end, we identify
the routines used in the BN mod inverse method leak-
100
ing side-channel information.
The BN mod inverse method operates with very 10 20
Time
30 40 50
Latency
tual addition. Two different routines are called to per-
form right-shift operations. The BN rshift1 routine 100
[22] N GUYEN , P. Q. AND S HPARLINSKI , I. E. 2003. The insecu- B Supplementary Empirical Data
rity of the Elliptic Curve Digital Signature Algorithm with partially
known nonces. Des. Codes Cryptography 30, 2, 201–217. Table 6 contains the raw data used to produce Figure 6.
[23] P ERCIVAL , C. 2005. Cache missing for fun and profit. In BSD-
Can 2005, Ottawa, Canada, May 13-14, 2005, Proceedings.
Table 6: Empirical number of extracted bits for vari-
[24] P EREIDA G ARC ÍA , C., B RUMLEY, B. B., AND YAROM , Y. ous sequence lengths. Each sequence length consisted of
2016. “Make sure DSA signing exponentiations really are constant- 226 trials, over which we calculated the mean (with de-
time”. In Proceedings of the 2016 ACM SIGSAC Conference on
viation), maximum, and minimum number of recovered
Computer and Communications Security, Vienna, Austria, October
24-28, 2016. 1639–1650. LSBs. See Figure 6 for an illustration.
Abstract The higher the level of semantics desired, the more spe-
cialized the analysis, requiring more expert knowledge.
Function type signatures are important for binary analy-
Commercial binary analysis tools widely used in the
sis, but they are not available in COTS binaries. In this
industry rely on domain-specific knowledge of compiler
paper, we present a new system called E KLAVYA which
conventions and specialized analysis techniques for bi-
trains a recurrent neural network to recover function type
nary analysis. Identifying idioms common in binary code
signatures from disassembled binary code. E KLAVYA
and designing analysis procedures, both principled and
assumes no knowledge of the target instruction set se-
heuristic-based, have been an area that is reliant on hu-
mantics to make such inference. More importantly,
man expertise, often engaging years of specialized bi-
E KLAVYA results are “explicable”: we find by analyz-
nary analysts. Analysis engines need to be continuously
ing its model that it auto-learns relationships between in-
updated as compilers evolve or newer architectures are
structions, compiler conventions, stack frame setup in-
targeted. In this work, we investigate an alternative line
structions, use-before-write patterns, and operations rel-
of research, which asks whether we can train machines
evant to identifying types directly from binaries. In
to learn features from binary code directly, without spec-
our evaluation on Linux binaries compiled with clang
ifying compiler idioms and instruction semantics explic-
and gcc, for two different architectures (x86 and x64),
itly. Specifically, we investigate the problem of recov-
E KLAVYA exhibits accuracy of around 84% and 81% for
ering function types / signatures from binary code — a
function argument count and type recovery tasks respec-
problem with wide applications to control-flow harden-
tively. E KLAVYA generalizes well across the compilers
ing [54, 34, 53] and data-dependency analysis [31, 40]
tested on two different instruction sets with various opti-
on binaries — using techniques from deep learning.
mization levels, without any specialized prior knowledge
The problem of function type recovery has two sub-
of the instruction set, compiler or optimization level.
problems: recovering the number of arguments a func-
tion takes / produces and their types. In this work, we
1 Introduction are interested in recovering argument counts and C-style
primitive data types .1 Our starting point is a list of func-
Binary analysis of executable code is a classical problem tions (bodies), disassembled from machine code, which
in computer security. Source code is often unavailable can be obtained using standard commercial tools or us-
for COTS binaries. As the compiler does not preserve a ing machine learning techniques [7, 43]. Our goal is to
lot of language-level information, such as types, in the perform type recovery without explicitly encoding any
process of compilation, reverse engineering is needed semantics specific to the instruction set being analyzed
to recover the semantic information about the original or the conventions of the compiler used to produce the
source code from binaries. Recovering semantics of ma- binary. We restrict our study to Linux x86 and x64 ap-
chine code is important for applications such as code plications in this work, though the techniques presented
hardening [54, 34, 53, 26, 52], bug-finding [39, 47, 10], extend naturally to other OS platforms.
clone detection [18, 38], patching/repair [17, 16, 41] and
analysis [12, 22, 21]. Binary analysis tasks can vary
from reliable disassembly of instructions to recovery of Approach. We use a recurrent neural network (RNN)
control-flow, data structures or full functional semantics. architecture to learn function types from disassembled
∗ Lead authors are alphabetically ordered. 1 int, float, char, pointers, enum, union, struct
Number of Arguments
Arch Task Opt. Metrics Accuracy
0 1 2 3 4 5 6 7 8 9
Data Distribution 0.059 0.380 0.288 0.170 0.057 0.023 0.012 0.004 0.004 0.001
O0 Precision 0.958 0.974 0.920 0.868 0.736 0.773 0.600 0.388 0.231 0.167 0.9113
Recall 0.979 0.953 0.899 0.913 0.829 0.795 0.496 0.562 0.321 0.200
Data Distribution 0.059 0.374 0.290 0.169 0.059 0.026 0.013 0.003 0.004 0.001
O1 Precision 0.726 0.925 0.847 0.819 0.648 0.689 0.569 0.474 0.456 0.118 0.8348
Recall 0.872 0.911 0.836 0.756 0.759 0.703 0.719 0.444 0.758 0.133
Task1
Data Distribution 0.056 0.375 0.266 0.187 0.057 0.032 0.015 0.004 0.005 0.001
O2 Precision 0.692 0.907 0.828 0.758 0.664 0.620 0.606 0.298 0.238 0.250 0.8053
Recall 0.810 0.912 0.801 0.645 0.782 0.730 0.637 0.262 0.357 0.300
Data Distribution 0.045 0.387 0.275 0.184 0.051 0.029 0.016 0.004 0.005 0.002
O3 Precision 0.636 0.935 0.862 0.801 0.570 0.734 0.459 0.243 0.231 0.200 0.8391
Recall 0.760 0.921 0.849 0.724 0.691 0.747 0.637 0.196 0.375 0.167
x86
Data Distribution 0.068 0.307 0.313 0.171 0.070 0.034 0.018 0.009 0.005 0.002
O0 Precision 0.935 0.956 0.910 0.957 0.910 0.789 0.708 0.808 0.429 0.500 0.9270
Recall 0.911 0.975 0.963 0.873 0.856 0.882 0.742 0.568 0.692 0.600
Data Distribution 0.066 0.294 0.320 0.173 0.073 0.034 0.019 0.009 0.005 0.003
O1 Precision 0.725 0.821 0.667 0.692 0.463 0.412 0.380 0.462 0.182 0.000 0.6934
Recall 0.697 0.822 0.795 0.574 0.420 0.466 0.284 0.115 0.167 0.000
Task2
Data Distribution 0.065 0.283 0.326 0.179 0.068 0.036 0.021 0.011 0.005 0.002
O2 Precision 0.721 0.761 0.655 0.639 0.418 0.535 0.484 0.667 0.200 0.000 0.6660
Recall 0.607 0.798 0.792 0.495 0.373 0.434 0.517 0.308 0.286 0.000
Data Distribution 0.051 0.248 0.346 0.188 0.076 0.038 0.023 0.013 0.008 0.003
O3 Precision 0.600 0.788 0.626 0.717 0.297 0.452 0.250 0.200 0.143 0.000 0.6534
Recall 0.682 0.822 0.801 0.509 0.321 0.326 0.190 0.071 0.167 0.000
Data Distribution 0.061 0.385 0.288 0.166 0.056 0.021 0.012 0.004 0.004 0.0
O0 Precision 0.858 0.957 0.914 0.916 0.818 0.891 0.903 0.761 0.875 0.333 0.9203
Recall 0.913 0.941 0.936 0.930 0.719 0.853 0.829 0.944 0.667 0.800
Data Distribution 0.057 0.379 0.283 0.174 0.060 0.022 0.013 0.005 0.004 0.001
O1 Precision 0.734 0.897 0.843 0.884 0.775 0.829 0.882 0.788 0.778 0.500 0.8602
Recall 0.766 0.899 0.901 0.817 0.677 0.815 0.714 0.839 0.359 0.818
Task1
Data Distribution 0.055 0.384 0.260 0.187 0.061 0.027 0.014 0.004 0.006 0.001
O2 Precision 0.624 0.900 0.816 0.842 0.775 0.741 0.866 0.708 0.667 0.545 0.8380
Recall 0.686 0.886 0.863 0.822 0.667 0.764 0.785 0.836 0.519 0.600
Data Distribution 0.044 0.382 0.290 0.173 0.054 0.028 0.018 0.004 0.002 0.002
O3 Precision 0.527 0.908 0.767 0.832 0.654 0.878 0.848 0.613 0.667 0.600 0.8279
Recall 0.680 0.864 0.867 0.794 0.602 0.761 0.857 0.826 0.444 0.600
x64
Data Distribution 0.071 0.309 0.312 0.170 0.068 0.032 0.018 0.009 0.005 0.002
O0 Precision 0.971 0.988 0.986 0.991 0.952 0.962 0.733 0.839 0.714 1.000 0.9748
Recall 0.981 0.992 0.985 0.980 0.972 0.969 0.873 0.565 0.556 0.500
Data Distribution 0.066 0.297 0.319 0.175 0.070 0.034 0.019 0.010 0.005 0.002
O1 Precision 0.625 0.811 0.690 0.891 0.780 0.773 0.531 0.576 0.333 - 0.7624
Recall 0.649 0.833 0.853 0.662 0.697 0.780 0.680 0.487 0.059 0.000
Task4
Data Distribution 0.059 0.272 0.336 0.179 0.071 0.037 0.020 0.012 0.006 0.003
O2 Precision 0.669 0.814 0.733 0.911 0.785 0.761 0.486 0.353 0.333 - 0.7749
Recall 0.658 0.833 0.882 0.697 0.688 0.761 0.548 0.273 0.167 0.000
Data Distribution 0.048 0.213 0.361 0.190 0.086 0.042 0.029 0.013 0.006 0.004
O3 Precision 0.636 0.824 0.775 0.912 0.913 0.720 0.400 0.250 0.000 - 0.7869
Recall 0.875 0.884 0.912 0.722 0.764 0.720 0.429 0.111 0.000 0.000
pared to other instructions. subl $0x10, %esp 0.383451 movl %ecx, %r15d 0.431204
fldl 0x1c(%esp) 1.000000 movq %rdx, %r14 0.399483
Further, E KLAVYA seems to correctly place empha- movl 0x18(%esp), %ecx 0.511937 movzbl (%rsi), %ebp 1.000000
flds 0x8050a90 0.873672 testb $0x20, 0x20b161(%rip) 0.336855
sis on the instruction which reads a particular regis- fxch %st(1) 0.668212 jne 0x2d 0.254520
ter before writing to it. This matches our intuitive ... movl 0x18(%r14), %eax
movq 0x20b15c(%rip), %rcx
0.507721
0.280275
way of finding arguments by identifying “use-before- ...
(c) “dtotimespec” compiled with (d) “print icmp header” compiled with
write” instructions (with liveness analysis). For exam- gcc-32-O3 (2nd argument - float) clang-64-O1 (2nd argument - pointer)
ple, in the function check sorted (Table 3(d)), the
register rcx is used in a number of instructions. The
saliency map marks the most important instruction to be not shown the salience maps for these examples here.
the correct one that uses the register before write. Fi-
nally, the function EmptyTerminal also shows ev- Operations to type. With a similar analysis of saliency
idence E KLAVYA is not blindly memorizing register maps, we find that E KLAVYA learns instruction patterns
names (e.g. rcx) universally for all functions. It cor- to identify types. For instance, as shown in examples
rectly de-emphasizes that the instruction movq %ecx, of Table 4, the saliency map highlights the relative im-
%eax is not related to argument passing. In this example, portance of instructions. One can see that instructions
rcx has been clobbered before in the instruction movl that use byte-wide registers (e.g. dl) are given impor-
$0x10, %ecx on rcx before reaching the movq in- tance when E KLAVYA predicts the type to be char. This
struction, and E KLAVYA accurately recognizes that rcx matches our semantic understanding that the char type
is not used as an argument here. We have manually ana- is one byte and will often be used in operands of the cor-
lyzed this finding consistently on 20 random samples we responding bit-width. Similarly, we find that in cases
analyzed. where E KLAVYA predicts the type to be a pointer,
the instructions marked as important have indirect reg-
Argument accesses after local frame creation. In our ister base addressing with the right registers carrying the
analyzed samples, E KLAVYA marks the arithmetic in- pointer values. Where float is correctly predicted, the
struction that allocates the local stack frame as relatively instructions highlighted involve XMM registers or float-
important. This is because in the compilers we tested, the ing point instructions. These findings consistently ex-
access to arguments begins after the stack frame pointer hibit in our sampled sets, showing that E KLAVYA mirrors
has been adjusted to allocate the local frame. E KLAVYA our intuitive understanding of the semantics.
learns this convention and emphasizes its importance in
locating instructions that access arguments (see Table 3). 5.3.3 Network Mispredictions
We highlight two other findings we have confirmed
manually. First, E KLAVYA correctly identifies arguments We provide a few concrete examples of E KLAVYA mis-
passed on the stack as well. This is evident in 20 func- predictions. These examples show that principled pro-
tions we sampled from the set of functions that accept gram analysis techniques would likely discern such er-
arguments on stack, which is a much more common phe- rors; therefore, E KLAVYA does not mimic a full liveness
nomenon in x86 binaries that have fewer registers. Sec- tracking function yet. To perform this analysis, we in-
ond, the analysis of instructions passing arguments from spect a random subset of the mispredictions for each of
the body of the caller is nearly as accurate as that from the tasks using the saliency map. In some cases, we can
that of callees. A similar saliency map based analysis of speculate the reasons for mispredictions, though there
the caller’s body identifies the right registers and setup of are best-effort estimates. Our findings are presented in
stack-based arguments are consistently marked as rela- the form of 2 case studies below.
tively high in importance. Due to space reasons, we have As shown in Table 5, the second argument is mis-
Relative Relative
Instruction Instruction
pushq %rbx
Score
0.175079 ...
Score
6 Related Work
movq %rdi, %rbx 0.392229 pushq %rbx 0.025531
callq 0x3fc 1.000000 subq $0x100, %rsp 0.163929
testq %rax, %rax 0.325375 movq %rdi, -0xe8(%rbp) 0.314619 Machine Learning on Binaries. Extensive literature
je 0x1004 0.579551 movq %rsi, -0xf0(%rbp) 0.235489
popq %rbx 0.164043 movl %edx, %eax 0.308323 exists on applying machine learning for other binaries
retq 0.135274 movq %rcx, -0x100(%rbp) 0.435364
movq %rbx, %rdi 0.365685 movl %r8d, -0xf8(%rbp) 0.821577 analysis tasks. Such tasks include malware classifi-
callq 0xe6d 0.665486 movq %r9, -0x108(%rbp) 1.000000
movb %al, -0xf4(%rbp) 0.24482 cation [42, 5, 30, 20, 36, 15] and function identifica-
(a) “ck fopen” compiled with clang and O1
...
(b) “prompt” compiled with gcc and O0
tion [37, 7, 43]. The closest related work to ours is by
(true type of first argument is pointer but (number of arguments is 6 but Shin et al. [43], which apply RNNs to the task of function
predicted as int) predicted as 7)
boundary identification. These results have high accu-
racy, and such techniques can be used to create the inputs
predicted as an integer in the first example, while for E KLAVYA. At a technical level, our work employs
in the second case study, the second argument is mis- word-embedding techniques and we perform in-depth
predicted as a pointer. From these two examples, it analysis of the model using dimensionality reduction,
is easy to see how the model has identified instructions analogical reasoning and saliency maps. These analy-
which provide hints to what the types are. In both cases, sis techniques have not been used in studying the learnt
the highlighted instructions suggest possibilities of mul- models for binary analysis tasks. For function identi-
tiple types and the mispredictions corresponds to one of fication, Bao et al. [7] utilize weighted prefix trees to
it. The exact reasons for mispredictions are unclear but improve the efficiency of function identification. Many
this seems to suggest that the model is not robust against other works use traditional machine learning techniques
situations where there can be multiple type predictions such as n-grams analysis [42], SVMs [36], and condi-
for different argument positions. We speculate that this tional random fields [37] for binary analysis tasks (dif-
is due to the design choice of training for each specific ferent from ours).
argument position a separate sub-network which poten- Word embedding is a commonly used technique in
tially requires the network to infer calling conventions such tasks, since these tasks require a way to repre-
from just type information. sent words as vectors. These word embeddings can
In the same example as above, the first argument is generally be categorized into two approaches, count-
mispredicted as well. The ground truth states that the based [13, 32] and prediction-based [24, 28]. Neural net-
first argument is a pointer, whereas E KLAVYA pre- works are also frequently used for tasks like language
dicts an integer. This shows another situation where translation [11, 49], parsing [46, 45].
the model makes a wrong prediction, namely when the
usage of the argument within the function body provides Function Arguments Recovery. In binary analysis,
insufficient hints for the type usage. recovery of function arguments [51, 23, 14] is an im-
We group all mispredictions we have analyzed into portant component used in multiple problems. Some ex-
three categories: insufficient information, high argument amples of the tasks include hot patching [33] and fine-
counts and off-by-one errors. A typical example of a mis- grained control-flow integrity enforcement [51]. To sum-
prediction due to lack of information is when the func- marize, there are two main approaches used to recover
tion takes in more arguments than it actually uses. The the function argument: liveness analysis and heuristic
first example in Table 6 shows an example of it. methods based on calling convention and idioms. Veen
Typically, for a functions with high argument counts et. al. [51] in their work make use of both these methods
…
Security Domain
rows are contained in the allocated memory. If the oper- Memory Tracking
ating system maps the attacker’s allocated memory to the
physical memory containing the aggressor rows, the at-
tacker has satisfied the first condition. Since the attacker Figure 3: CATT constrains bit flips to the process’ secu-
has no influence on the mapping between virtual mem- rity domain.
ory and physical memory, she cannot directly influence
this step, but she can increase her probability by repeat-
rowhammer attacks. Many commonly used systems
edly allocating large amounts of memory. Once control
(see Table 1) include vulnerable RAM.
over the aggressor rows is achieved, the attacker releases
all allocated memory except the parts which contain the
aggressor rows. Next, victim data that should be manip- 4 Design of CATT
ulated has to be placed on the victim row. Again, the at-
tacker cannot influence which data is stored in the phys- In this section, we present the high-level idea and design
ical memory and needs to resort to a probabilistic ap- of our practical software-based defense against rowham-
proach. The attacker induces the creation of many copies mer attacks. Our defense, called CATT,2 tackles the ma-
of the victim data with the goal that one copy of the vic- licious effect of rowhammer-induced bit flips by instru-
tim data will be placed in the victim row. The attacker menting the operating system’s memory allocator to con-
cannot directly verify whether the second step was suc- strain bit flips to the boundary where the attacker’s mali-
cessful, but can simply execute the rowhammer attack cious code executes. CATT is completely transparent to
and validate whether the attack was successful. If not, applications, and does not require any hardware changes.
the second step is repeated until the rowhammer success-
fully executes. Overview. The general idea of CATT is to tolerate bit
Seaborn et al. [20] successfully implemented this ap- flips by confining the attacker to memory that is already
proach to compromise the kernel from an unprivileged under her control. This is fundamentally different from
user process. They gain control over the aggressor rows all previously proposed defense approaches that aimed
and then let the OS create huge amounts of page table en- to prevent bit flips (cf. Section 9). In particular, CATT
tries with the goal of placing one page table entry in the prevents bit flips from affecting memory belonging to
victim row. By flipping a bit in a page table entry, they higher-privileged security domains, e.g., the operating
gained control over a subtree of the page tables allowing system kernel or co-located virtual machines. As dis-
them to manipulate memory access control policies. cussed in Section 2.2, a rowhammer attack requires the
adversary to bypass the CPU cache. Further, the attacker
must arrange the physical memory layout such that the
3 Threat Model and Assumptions
targeted data is stored in a row that is physically adjacent
Our threat model is in line with related work [4, 7, 17, to rows that are under the control of the attacker. Hence,
18, 20, 26]: CATT ensures that memory between these two entities is
physically separated by at least one row.3
• We assume that the operating system kernel is not To do so, CATT extends the physical memory allo-
vulnerable to software attacks. While this is hard to cator to partition the physical memory into security do-
implement in practice it is a common assumption in mains.
the context of rowhammer attacks. Figure 3 illustrates the concept. Without CATT, the
attacker is able to craft a memory layout, where two ag-
• The attacker controls a low-privileged user mode
2 CAn’t
process, and hence, can execute arbitrary code but Touch This
3 Kim et al. [11] mention that the rowhammer fault does not only
has only limited access to other system resources
affect memory cells of directly adjacent rows, but also memory cells of
which are protected by the kernel through manda- rows that are next to the adjacent row. Although we did not encounter
tory and discretionary access control. such cases in our experiments, CATT supports multiple row separation
between adversary and victim data memory. Further detailed discus-
• We assume that the system’s RAM is vulnerable to sion can be found in Section 8.
can be expressed as a power of two (this is referred to Randomization (KASLR) may slightly modify this address according
to a limited offset.
as the order of the allocation). Hence, the buddy allo- 5 The exact location for the split can be chosen at compile time.
cator’s smallest level of granularity is a single memory Hence, the partitioning is not fixed but can be chosen arbitrarily (e.g.,
page. The buddy allocator already partitions the system 20-80, 50-50, 75-25, etc.).
Table 2: Technical specifications of the vulnerable systems used for our evaluation.
Rowhammer Exploit: Success (avg. # of tries) flips across reboots through random sampling.
The three systems mentioned in Table 1 and Table 2
Vanilla System CATT are highly susceptible to reproducible bit flips. Executing
the rowhammer test on these three times and rebooting
S1 3(11) 7(3821)
the system after each test run, we found 133 pages with
S2 3(42) 7(3096) exploitable bit flips for S1, 31 pages for S2, and 23 pages
S3 3(53) 7(3768) for S3.
To install CATT, we patched the Linux kernel of each
Table 3: Results of our security evaluation. We found system to use our modified memory allocator. Recall that
that CATT mitigates rowhammer attacks. We executed CATT does not aim to prevent bit flips but rather con-
the rowhammer test on each system three times and av- strain them to a security domain. Hence, executing the
eraged the amount of bit flips. rowhammer test on CATT-hardened systems still locates
vulnerable pages. However, in the following, we demon-
strate based on a real-world exploit that the vulnerable
accesses activate a row in the respective DRAM module. pages are not exploitable.
This process is repeated 106 times.9 Lastly, the potential
victim address can be checked for bit flips by comparing
its memory content with the fixed pattern bit. The test 6.2 Real-world Rowhammer Exploit
outputs a list of addresses for which bit flips have been To further demonstrate the effectiveness of our mitiga-
observed, i.e., a list of victim addresses. tion, we tested CATT against a real-world rowhammer
exploit. The goal of the exploit is to escalate the privi-
Preliminary Tests for Vulnerable Systems. Using the leges of the attacker to kernel privileges (i.e., gain root
rowhammering testing tool we evaluated our target sys- access). To do so, the exploit leverages rowhammer to
tems. In particular, we were interested in systems that manipulate the page tables. Specifically, it aims to ma-
yield reproducible bit flips, as only those are relevant for nipulate the access permission bits for kernel memory,
practical rowhammer attacks. This is because the attack i.e., reconfigure its access permission policy. A second
requires two steps. First, the attacker needs to allocate option is to manipulate page table entries in such a way
chunks of memory, and test each chunk to identify vul- that they point to attacker controlled memory thereby al-
nerable memory. Second, the attacker needs to exploit lowing the attacker to install new arbitrary memory map-
the vulnerable memory. Since the attacker cannot force pings.10
the system to allocate page tables at a certain physical To launch the exploit, two conditions need to be satis-
position in RAM, the attacker has to repeatedly spray the fied: (1) a page table entry must be present in a vulner-
memory with page tables to increase the chances of hit- able row, and (2) the enclosing aggressor pages must be
ting the desired memory location. Both steps relay on allocated in attacker-controlled memory.
reproducible bit flips. Since both conditions are not directly controllable by
Hence, we configured the rowhammering tool to only the attacker, the attack proceeds as follows: the attacker
report memory addresses where bit flips can be triggered allocates large memory areas. As a result, the operat-
repeatedly. We successively confirmed that this list in- ing system needs to create large page tables to maintain
deed yields reliable bit flips by individually triggering the newly allocated memory. This in turn increases the
the reported addresses and checking for bit flips within probability to satisfy the aforementioned conditions, i.e.,
an interval of 10 seconds. Additionally, we tested the bit a page table entry will eventually be allocated to a victim
9 This
value is the hardcoded default value. Prior research [11, 12] 10 The details of this attack option are described by Seaborn et
reported similar numbers. al. [20].
7 Performance Evaluation 11 The Phoronix benchmarking suite features a large number of tests
Table 4: The benchmarking results for SPEC CPU2006, Phoronix, and LMBench3 indicate that CATT induce no
measurable performance overhead. In some cases we observed negative overhead, hence, performance improvements.
However, we attribute such results to measuring inaccuracy.
7.2 Memory Overhead tests are designed to stress the operating system to iden-
tify problems. We first run these tests on a vanilla Debian
CATT prevents the operating system from allocating cer- 8.2 installation to receive a baseline for the evaluation of
tain physical memory. CATT. We summarize our results in Table 5, and report
The memory overhead of CATT is constant and de- no deviations for our mitigation compared to the base-
pends solely on number of memory rows per bank. Per line. Further, we also did not encounter any problems
bank, CATT omits one row to provide isolation between during the execution of the other benchmarks. Thus, we
the security domains. Hence, the memory overhead is conclude that CATT does not affect the stability of the
1/#rows (#rows being rows per bank). While the number protected system.
of rows per bank is dependent on the system architecture,
it is commonly in the order of 215 rows per bank, i.e., the
overhead is 2−15 =ˆ 0, 003%.12 8 Discussion
7.3 Robustness Our prototype implementation targets Linux-based sys-
tems. Linux is open-source allowing us to implement
Our mitigation restricts the operating system’s access to our defense. Further, all publicly available rowhammer
the physical memory. To ensure that this has no effect on attacks target this operating system. CATT can be easily
the overall stability, we performed numerous stress tests ported to memory allocators deployed in other operating
with the help of the Linux Test Project (LTP) [13]. These systems. In this section, we discuss in detail the gener-
12 https://lackingrhoticity.blogspot.de/2015/ ality of our software-based defense against rowhammer.
05 / how - physical - addresses - map - to - rows - and - For a detailed discussion of possible extensions and ad-
banks.html ditional policies we refer to our technical report [5].
Ren Ding* Chenxiong Qian* Chengyu Song William Harris Taesoo Kim
Georgia Tech Georgia Tech UC Riverside Georgia Tech Georgia Tech
Wenke Lee
Georgia Tech
* Equal contribution joint first authors
Abstract (i.e., a static points-to analysis [2, 34]). The rewriter then
(2) rewrites P to check at runtime before performing each
Control-Flow Integrity (CFI), as a means to prevent
indirect control transfer that the target is allowed by the
control-flow hijacking attacks, enforces that each instruc-
static analysis performed in step (1).
tion transfers control to an address in a set of valid targets.
The security guarantee of CFI thus depends on the defi- A significant body of work [1, 21, 41] has introduced
nition of valid targets, which conventionally are defined approaches to implement step (2) for a variety of exe-
as the result of a static analysis. Unfortunately, previous cution platforms and perform it more efficiently. Unfor-
research has demonstrated that such a definition, and thus tunately, the end-to-end security guarantees of such ap-
any implementation that enforces it, still allows practical proaches are founded on the assumption that if an attacker
control-flow attacks. can only cause a program to execute control branches
In this work, we present a path-sensitive variation of determined to be feasible by step (1), then critical appli-
CFI that utilizes runtime path-sensitive point-to analysis cation security will be preserved. However, recent work
to compute the legitimate control transfer targets. We has introduced new attacks that demonstrate that such an
have designed and implemented a runtime environment, assumption does not hold in practice [5, 12, 32]. The lim-
P ITTY PAT, that enforces path-sensitive CFI efficiently by itations of existing CFI solutions in blocking such attacks
combining commodity, low-overhead hardware monitor- are inherent to any defense that uses static points-to infor-
ing and a novel runtime points-to analysis. Our formal mation computed per control location in a program. Cur-
analysis and empirical evaluation demonstrate that, com- rently, if a developer wants to ensure that a program only
pared to CFI based on static analysis, P ITTY PAT ensures chooses valid control targets, they must resort to ensure
that applications satisfy stronger security guarantees, with that the program satisfies data integrity, a significantly
acceptable overhead for security-critical contexts. stronger property whose enforcement typically incurs pro-
hibitively large overhead and/or has deployment issues,
such as requiring the protected program being recompiled
1 Introduction together with all dependent libraries and cannot be ap-
plied to programs that perform particular combinations of
Attacks that compromise the control-flow of a program, memory operations [17, 22–24].
such as return-oriented programming [33], have criti- In this work, we propose a novel, path-sensitive vari-
cal consequences for the security of a computer system. ation of CFI that is stronger than conventional CFI (i.e.,
Control-Flow Integrity (CFI) [1] has been proposed as a CFI that relies on static points-to analysis). A program
restriction on the control-flow transfers that a program satisfies path-sensitive CFI if each control transfer taken
should be allowed to take at runtime, with the goals of by the program is consistent with the program’s entire
both ruling out control-flow hijacking attacks and being executed control path. Path-sensitive CFI is a stronger
enforced efficiently. security property than conventional CFI, both in principle
A CFI implementation can be modeled as program and in practice. However, because it does not place any
rewriter that (1) before a target program P is executed, de- requirements on the correctness of data operations, which
termines feasible targets for each indirect control transfer happen much more frequently, it can be enforced much
location in P, typically done by performing an analysis more efficiently than data integrity. To demonstrate this,
that computes a sound over-approximation of the set of we present a runtime environment, named P ITTY PAT, that
all memory cells that may be stored in each code pointer enforces path-sensitive efficiently using a combination of
0.8
1200
0.6 1000
0.5
800
0.4
600
0.3
0.2 400
0.1
picfi 200
PittyPat
0
0 5 10 15 20 25 30 35 40 0
Points-to Set Size Return Step
(a) Partial CDF of allowed targets on forward edges taken by 403.gcc. (b) π-CFI points-to set of backward edges taken by 403.gcc.
20 50
picfi picfi
PittyPat 45
40
15
Points-to Set Size
30
10 25
20
15
5
10
0 0
Indirect Branch Step Return Step
(c) π-CFI and P ITTY PAT points-to sets for forward edges taken by (d) π-CFI points-to sets for backward edges taken by 444.namd.
444.namd.
Figure 3: Control-transfer targets allowed by π-CFI and P ITTY PAT over 403.gcc and 444.namd.
explained below. Figure 3a shows that P ITTY PAT can con- ure 3a, both π-CFI and P ITTY PAT allowed up to 218
sistently maintain significantly smaller points-to sets for transfer targets; for each callsite, P ITTY PAT allowed no
forward edges than that of π-CFI, leading to a stronger more targets than π-CFI. The targets at such callsites
security guarantee. Figure 3a indicates that when pro- are loaded from vectors and arrays of function pointers,
tecting practical programs, an approach such as π-CFI which P ITTY PAT’s current points-to analysis does not
that validates per location allows a significant number of reason about precisely. It is possible that future work on
transfer targets at each indirect callsite, even using dy- a points-to analysis specifically designed for reasoning
namic information. In comparison, P ITTY PAT uses the precisely about such data structures over a single path
entire history of branches taken to determine that at the of execution—a context not introduced by any previous
vast majority of callsites, only a single address is a valid work on program analysis for security—could produce
target. The difference in the number of allowed targets significantly smaller points-to sets.
can be explained by the different heuristics adopted in π-
CFI, which monotonically accumulates allowed points-to A similar difference between π-CFI and P ITTY PAT is
targets without any disabling schemes once targets are demonstrated by the number of transfer targets allowed
taken, and the precise, context-sensitive points-to analysis for other benchmarks. In particular, Figure 3c contains
implemented in P ITTY PAT. Similar difference between similar data for the 444.namd benchmark. 444.namd, a
π-CFI and P ITTY PAT can also be found in all other C C++ program, contains many calls to functions loaded
benchmarks from SPEC CPU2006. from vtables, a source of imprecision for implementations
of CFI that can be exploited by attackers [32]. P ITTY PAT
For the remaining 4% of transfers not included in Fig- allows a single transfer target for all forward edges as
Table 2: “Name” contains the name of the benchmark. “KLoC” contains the number of lines of code in the benchmark.
Under “Payload Features,” “Exp” shows if the benchmark contains an exploit and “Tm (sec)” contains the amount of
time used by the program, when given the payload. Under “π-CFI Featues”, “P ITTY PAT Features,” and “CETS+SB
Features,” “Alarm” contains a flag denoting if a given framework determined that the payload was an attack and aborted;
“Overhd (%)” contains the time taken by the framework, expressed as the ratio over the baseline time.
temporal memory safety and SoftBound provides spa- All of the above approaches enforce security defined by
tial memory safety; both enforce full data integrity for the results of a flow-sensitive points-to analysis; previous
C benchmarks, which entails control security. However, work has produced attacks [5, 12, 32] that are allowed by
both approaches induce significant overhead, and cannot any approach that relies on such information. P ITTY PAT
be applied to programs that perform particular combi- is distinct from all of the above approaches because it
nations of memory-unsafe operation [17]. Our results computes and uses the results of a points-to analysis com-
thus indicate a continuous tradeoff between security and puted for the exact control path executed. As a result, it
performance among exisiting CFI solution, P ITTY PAT, successfully detects known attacks, such as COOP [32]
and data protection. P ITTY PAT offers control security (see §6.3.2).
that is close to ideal, i.e. what would result from data Previous work has explored the tradeoffs of implement-
integrity, but with a small percentage of the overhead of ing CFI at distinct points in a program’s lifecycle. CF
data-integrity protection. restrictor [30] performs CFI analysis and instrumenta-
tion completely at the source level in an instrumenting
7 Related Work compiler, and further work developed CFI integrated into
production compilers [36]. BinCFI [42] implements CFI
The original work on CFI [1] defined control-flow in- without access to the program source, but only access
tegrity in terms of the results of a static, flow-sensitive to a stripped binary. Modular CFI [25] implements CFI
points-to analysis. A significant body of work has adapted for programs constructed from separate compilation units.
the original definition for complex language features and Unlike each of the above approaches, P ITTY PAT consists
developed sophisticated implementations that enforce it. of a background process that performs an online analysis
While CFI is conventionally enforced by validating the of the program path executed.
target of a control transfer before the transfer, control- Recent work on control-flow bending has established
flow locking [3] validates the target after the transfer to limitations on the security of any framework that enforces
enable more efficient use of system caches. Compact only conventional CFI [5], and proposes that future work
Control Flow Integrity and Randomization (CCFIR) [41] explore CFI frameworks that validate branch targets us-
optimizes the performance of validating a transfer target ing an auxiliary structure, such as a shadow stack. The
by randomizing the layout of allowable transfer targets conclusions of work on control-flow bending are strongly
at each jump. Opaque CFI (O-CFI) [21] ensures that consistent with the motivation of P ITTY PAT: the key con-
an attacker who can inspect the rewritten code cannot tribution of P ITTY PAT is that it enforces path-sensitive
learn additional information about the targets of control CFI, provably stronger than conventional CFI, and does so
jumps that are admitted as valid by the rewritten code. not only by maintaining a shadow stack of points-to infor-
[3] B LETSCH , T., J IANG , X., AND F REEH , V. Mitigating [17] K UZNETSOV, V., S ZEKERES , L., PAYER , M., C ANDEA ,
code-reuse attacks with control-flow locking. In ACSAC G., S EKAR , R., AND S ONG , D. Code-pointer integrity.
(2011). In OSDI (2014).
[4] B UROW, N., C ARR , S. A., B RUNTHALER , S., PAYER , [18] L ATTNER , C., L ENHARTH , A., AND A DVE , V. Mak-
M., NASH , J., L ARSEN , P., AND F RANZ , M. Control- ing context-sensitive points-to analysis with heap cloning
flow integrity: Precision, security, and performance. arXiv practical for the real world. In PLDI (2007).
preprint arXiv:1602.04056 (2016). [19] L IU , Y., S HI , P., WANG , X., C HEN , H., Z ANG , B., AND
G UAN , H. Transparent and efficient cfi enforcement with
[5] C ARLINI , N., BARRESI , A., PAYER , M., WAGNER , D.,
intel processor trace. In HPCA (2017).
AND G ROSS , T. R. Control-flow bending: On the effec-
tiveness of control-flow integrity. In USENIX Security [20] The LLVM compiler infrastructure project. http://llvm.
(2015). org/, 2016. Accessed: 2016 May 12.
[6] C ARLINI , N., AND WAGNER , D. ROP is still dangerous: [21] M OHAN , V., L ARSEN , P., B RUNTHALER , S., H AMLEN ,
Breaking modern defenses. In USENIX Security (2014). K. W., AND F RANZ , M. Opaque control-flow integrity.
In NDSS (2015).
[7] C HENG , Y., Z HOU , Z., Y U , M., D ING , X., AND D ENG ,
R. H. Ropecker: A generic and practical approach for [22] NAGARAKATTE , S., Z HAO , J., M ARTIN , M. M., AND
defending against ROP attacks. In NDSS (2014). Z DANCEWIC , S. Softbound: highly compatible and com-
plete spatial memory safety for c. In PLDI (2009).
[8] C OUSOT, P., AND C OUSOT, R. Abstract interpretation: a
unified lattice model for static analysis of programs by con- [23] NAGARAKATTE , S., Z HAO , J., M ARTIN , M. M., AND
struction or approximation of fixpoints. In POPL (1977). Z DANCEWIC , S. Cets: compiler enforced temporal safety
for c. In ISMM (2010).
[9] DAVI , L., KOEBERL , P., AND S ADEGHI , A.-R.
[24] N ECULA , G. C., M C P EAK , S., AND W EIMER , W.
Hardware-assisted fine-grained control-flow integrity: To-
Ccured: Type-safe retrofitting of legacy code. In PLDI
wards efficient protection of embedded systems against
(2002).
software exploitation. In DAC (2014).
[25] N IU , B., AND TAN , G. Modular control-flow integrity. In
[10] E MAMI , M., G HIYA , R., AND H ENDREN , L. J. Context- PLDI (2014).
sensitive interprocedural points-to analysis in the presence
of function pointers. In PLDI (1994). [26] N IU , B., AND TAN , G. Per-input control-flow integrity.
In CCS (2015).
[11] E VANS , I., F INGERET, S., G ONZÁLEZ , J., OT-
GONBAATAR , U., TANG , T., S HROBE , H., S IDIROGLOU - [27] N YSTROM , E. M., K IM , H.-S., AND W EN - MEI , W. H.
D OUSKOS , S., R INARD , M., AND O KHRAVI , H. Missing Bottom-up and top-down context-sensitive summary-
the point (er): On the effectiveness of code pointer in- based pointer analysis. In International Static Analysis
tegrity. In SP (2015). Symposium (2004).
[12] E VANS , I., L ONG , F., OTGONBAATAR , U., S HROBE , [28] O NE , A. Smashing the stack for fun and profit. Phrack
H., R INARD , M., O KHRAVI , H., AND S IDIROGLOU - magazine 7, 49 (1996).
D OUSKOS , S. Control jujutsu: On the weaknesses of [29] PAPPAS , V., P OLYCHRONAKIS , M., AND K EROMYTIS ,
fine-grained control flow integrity. In CCS (2015). A. D. Transparent ROP exploit mitigation using indirect
branch tracing. In USENIX Security (2013).
[13] G E , X., C UI , W., AND JAEGER , T. Griffin: Guarding
control flows using intel processor trace. In ASPLOS [30] P EWNY, J., AND H OLZ , T. Control-flow restrictor:
(2017). Compiler-based CFI for iOS. In ACSAC (2013).
[14] G U , Y., Z HAO , Q., Z HANG , Y., AND L IN , Z. Pt-cfi: [31] R EINDERS , J. Processor tracing - Blogs@Intel. https:
Transparent backward-edge control flow violation detec- //blogs.intel.com/blog/processor-tracing/,
tion using intel processor trace. In CODASPY (2017). 2013. Accessed: 2016 May 12.
[37] VAN DER V EEN , V., A NDRIESSE , D., G ÖKTA Ş , E., A.1 Syntax
G RAS , B., S AMBUC , L., S LOWINSKA , A., B OS , H., Figure 5 contains the syntax of a space of program instructions,
AND G IUFFRIDA , C. Practical context-sensitive CFI. In Instrs. An instruction may compute the value of an operation
CCS (2015). in ops over values stored in registers and store the result in a
register, may allocate a fresh memory cell (Eqn. 1), may load a
[38] W HALEY, J., AND L AM , M. S. Cloning-based context-
value stored in the address in one operand register into a target
sensitive pointer alias analysis using binary decision dia-
register, may store a value in an operand register at the address
grams. In PLDI (2004).
stored in a target register (Eqn. 2), may test if the value in a
[39] W ILANDER , J., N IKIFORAKIS , N., YOUNAN , Y., register is non-zero and if so transfer control to an instruction
K AMKAR , M., AND J OOSEN , W. RIPE: Runtime in- at the address stored in an operand register, may perform an
trusion prevention evaluator. In ACSAC (2011). indirect call to a target address stored in an operand, or may
return from a call (Eqn. 3). Although all operations are assumed
[40] X IA , Y., L IU , Y., C HEN , H., AND Z ANG , B. Cfimon: to be binary, when convenient we will depict operations as using
Detecting violation of control flow integrity using perfor- fewer registers (e.g., a copy instruction copy r0,r1 in §4.2).
mance counters. In DSN (2012). A program is a map from instruction addresses to instructions.
That is, for space of instruction addresses IAddrs containing a
[41] Z HANG , C., W EI , T., C HEN , Z., D UAN , L., S ZEKERES , designated initial address ι ∈ IAddrs, the language of programs
L., M C C AMANT, S., S ONG , D., AND Z OU , W. Prac- is Lang = IAddrs → Instrs.
tical control flow integrity and randomization for binary Instrs does not contain instructions similar to those in an
executables. In SP (2013). architecture with a complex instruction-set, which may, e.g., per-
[42] Z HANG , M., AND S EKAR , R. Control flow integrity for form operations directly on memory. The design of P ITTY PAT
COTS binaries. In Usenix Sec. (2013). directly generalizes to analyze programs that use such an instruc-
tion set. In particular, the actual implementation of P ITTY PAT
[43] Z HU , J., AND C ALMAN , S. Symbolic pointer analysis monitors programs compiled for x86.
revisited. In PLDI (2004).
A.2 Semantics
Each program P ∈ Lang defines a language of sequences of
program states, called runs, that are generated by executing a
sequence of instructions in P from an initial state. In particular,
each program P defines two languages of runs. The first is
the language of well-defined runs, in which each step from the
current state is defined by the semantics of the next instruction
in P. The second is the language of feasible runs contain some
state q from which P executes an instruction that is not defined at
q (e.g., dereferencing an invalid address). When the successive
state of q is not defined and the program takes a step of execution,
the program may potentially perform an operation that subverts
security.
B Formal definition of points-to analysis Example 1 For program dispatch (§2.1) and any control do-
main D that maps each instruction pointer to a set of instruction
A control analysis takes a program P and computes a sound over- addresses, the most precise description of dispatch restricted
approximation of the instruction pointers that may be stored in to function pointers is given in §2.2.
Definition 5 For each program P ∈ Lang and control do- Definition 9 For all programs P, P′ ∈ Lang and each control
main D ∈ Doms, let ρ ∈ TransRels be such that for all in- domain D ∈ Doms, if P′ satisfies control security under ρ[D]
struction addresses a, a′ ∈ Addrs, each instruction i ∈ Instrs (Defn. 8) with respect to P, then P′ satisfies path-sensitive CFI
with τ[D](µ[D, P](a), i, a′ ) ̸= None[D] and all states q, q′ ∈ modulo D with respect to P.
States with (µ[D, P](a), q) and (µ[D, P](a′ ), q′ ), it holds that
((q, i), q′ ) ∈ ρ. Then ρ is the flow-sensitive transition relation Path-sensitive CFI is conceptually similar to, but stronger
of D and P. than, context-sensitive CFI [37], which places a condition on
only bounded suffixes of a program’s control path before the
For each domain D and program P, the flow-sensitive transition program attempts to execute a critical security event, such as a
relation of D and P is denoted FS[D, P]. system call.
For each control domain D and program P, the most precise Path-sensitive CFI is as strong as CFI.
flow-sensitive description of P in D (Appendix D) defines an
instance of generalized control security that is equivalent to Lemma 1 For each control domain D and all programs P, P′ ∈
CFI [1]. Lang such that P′ satisfies path-sensitive CFI modulo D with
respect to P, P′ satisfies CFI modulo D with respect to P.
Definition 6 For all programs P, P′ ∈ Lang and each control-
analysis domain D ∈ Doms, if P′ satisfies generalized control Lemma 1 follows immediately from the fact that any control-
security under FS[D, P] (Appendix D, Defn. 5) with respect to transfer target that is along a given control path must be a valid
P, then P′ satisfies CFI modulo D with respect to P. target in a meet-over-all-paths solution.
Defn. 6 is equivalent to “ideal” CFI as defined in previous work Path-sensitive CFI is in fact strictly stronger than CFI.
to establish fundamental limitations on CFI [5].
Lemma 2 For some control domain D and programs P, P′ ∈
Lang, P′ satisfies CFI with respect to P modulo D but P′ does
C.2 Path-sensitive CFI not satisfy path-sensitive CFI with modulo D respect to P.
The problem of enforcing CFI is typically expressed as instru- Lemma 2 is immediately proven using any domain D that is
menting a given program P to form a new program P′ that sufficiently accurate between two control states and a program
allows each indirect control transfer in each of its executions P that generates state with either control configuration at a
only if the target of the transfer is valid according to a flow- particular program point.
sensitive description of the control-flow graph of P. To present
our definition of path-sensitive CFI, we will introduce a general
definition of control security parameterized on a given transition D Formal definition of online analysis
relation ρ. P′ satisfies generalized control security under ρ with
respect to P if (1) P′ preserves each well-defined run of P and The behavior of the analyzer module is determined by a
(2) each feasible run of P′ has instruction addresses identical to fixed control-analysis domain D (§3.2, Defn. 3). We refer to
the instruction addresses of some run of P under ρ. P ITTY PAT instantiated to use control domain D for points-to
analysis as P ITTY PAT[D].
Definition 7 For each transition relation ρ ∈ TransRels, let As the analyzer module executes, it maintains a control-
programs P, P′ ∈ Lang be such that (1) Runs[WellDef, P] ⊆ domain abstract state d ∈ A[D]. In each step of execution, the
Runs[WellDef, P′ ]; (2) for each run r′ ∈ Runs[Feasible, P′ ], analyzer module receives from the monitor process the next
there is some run r ∈ Runs[ρ, P] such that IPs(r) = IPs(r′ ). control-transfer target taken by the monitored program P, and
Then P′ satisfies generalized control security under ρ with re- either chooses to raise an alarm that transferring control to the
spect to P. target would cause P to break path-sensitive CFI modulo D, or
updates its state and allows P to take its next step of execution.
We now define path-sensitive CFI, an instance of generalized
control security that is strictly stronger than CFI. Each control In each step of execution, the analyzer module receives the
domain D defines a transition relation over program states that next control target a ∈ IAddrs taken by the monitored program,
are described by abstract states of D connected by the abstract and either raises an alarm or updates its maintained control
transformer of D. description d as a result. If a is not a feasible target from d over
the next sequence of non-branch instructions, then the analyzer
Definition 8 For each control domain D ∈ Doms (§3.2, module throws an alarm signaling that control flow has been
Defn. 3), let ρ[D] ∈ TransRels, be such that for each abstract subverted, and aborts.
state a ∈ A[D], each state q ∈ States such that (a, q) ∈ γ[D],
and each instruction i ∈ Instrs and state q′ ∈ States such that Theorem 1 For D ∈ Doms and P ∈ Lang, the program P′ sim-
(τ[D](a, i, ip[q′ ]), q′ ) ∈ γ[D], it holds that (q, i, q′ ) ∈ ρ[D]. Then ulated by running P in P ITTY PAT[D] satisfies path-sensitive
ρ[D] is the transition relation modulo D. CFI modulo D with respect to P (Defn. 9).
Concretization relation γ ⊆ A × States relates each stack τ(((a, R) :: F, m), ld r0, r1, a′ ) =
and memory of points-to information to each concrete state ((a′ , R[r1 7→
[
m(c)]) :: F, m)
with a similarly structured stack and heap. For each n ∈ N, let c∈R(r0)
a0 , . . . , an ∈ IAddrs, R0 , . . . , Rn ∈ RegMaps, and R′0 , . . . , R′n ∈
RegPtsMaps be such that for each i ≤ n and each register r ∈ The abstract transformers for other instructions, such as data
Regs, if Ri (r) ∈ Addrs, then Ri (r) ∈ R′i (r). Let m ∈ Mems and operations that perform pointer arithmetic, are defined similarly,
m′ ∈ CellPtsTo be such that for each cell c ∈ Cells, m(c) ∈ and we do not give explicit definitions here in order to simplify
m′ (c). Then: the presentation.
(([(i0 , R′0 ), . . . , (in , R′n )], m′ ), Example 3 Consider descriptions of states of dispatch and its
([(i0 , R0 ), . . . , (in , Rn )], m)) ∈ γ instruction call handler (§2.1). For abstract state
CR3 #VMEXIT
Hypervisor
Update 1 Hypervisor Components
Set MTF/TF
TOCTTOU vulnerabilities. The middleware recorded the The user address 0x3b963c was accessed by the ker-
behavior characteristics into log files to help locate vul- nel instructions more than once, so there may be a TOCT-
nerabilities. TOU vulnerability. dwprot.sys, the Dr. Web driver
Detecting UNPROBE. Taking a vulnerability in Avast program, used the instruction at the address 0x89370d54
11.2.2262 as an example, through the log analyzer, the to fetch the value from the user address (i.e., 0x3b963c),
following data were obtained from the Digtool’s log file and then it invoked the ProbeForRead function to check
for Avast 11.2.2262: it. At the address 0x89370d7b, the dwprot.sys fetched
the value again to use it. Therefore, there was a typical
NtAllocateVirtualMemory : TOCTTOU vulnerability in dwprot.sys.
Eip : 89993 f3d , Address : 0023 f304 , rw : R
Eip : 84082 ed9 , Address : 0023 f304 , PROBE ! With the help of Digtool, 18 kernel-level TOCTTOU
K i F a s tS y s t e m C a l l R e t vulnerabilities were found in the five anti-virus software
programs tested. The results are shown in Table 2. For
aswSP.sys, the Avast driver program, used the in- security reasons, we only give the system calls for which
struction at the address 0x89993f3d to fetch the vulnerabilities exist.
value from the user address (i.e., 0x23f304) with- Digtool may produce false positives that originate
out checking. The subsequent checking instruc- from the fact that it detects TOCTTOU vulnerabilities
tion at the address 0x84082ed9 belonged to the through double-fetch. Further manual analysis is needed
NtAllocateVirtualMemory function. Therefore, there to confirm that double-fetch is a TOCTTOU vulnerabil-
was a typical UNPROBE vulnerability in aswSP.sys. ity.
Using Digtool, 23 similar vulnerabilities were found
in the five anti-virus software programs tested. The re-
5.1.2 Detecting Vulnerabilities via Memory Foot-
sults are shown in Table 1. For security reasons, we only
prints
give the system calls for which vulnerabilities exist.
When the log analyzer points out a potential UN- We chose 32-bit Windows 10 as a test target. Some zero-
PROBE vulnerability, and the tested driver only uses the day vulnerabilities discovered by Digtool were selected
10000
1000
100
10
1
NtMapVie NtWriteVi NtFreeVir
NtCreate NtCreateS NtLoadDri NtAlpcCo NtOpenEv NtCreateT NtCreate
wOfSectio rtualMem tualMem WinRAR
Mutant ection ver nnectPort ent imer Key
n ory ory
Windows 112.4 109.5 128.2 125 156 469 719 115.4 112.4 300.2 147
Digtool(unrecorded) 565.4 547 584.4 575.2 609.4 1578 1565.4 562.6 565.6 750 331.4
Digtool(recorded) 9893.2 9226.7 20398.3 2052 16771 10101.7 24969 8987 8187.2 22997.7 1977.8
bochs 34398.3 31844.2 48943.6 36572.5 53940.2 71573 244921.8 33296.8 32573 103036.5 142104.3
[29] Amatul Mohosina and Mohammad Zulkernine. De- [41] Felix Wilhelm. Tracing privileged memory ac-
serve: a framework for detecting program security cesses to discover software vulnerabilities. 2015.
vulnerability exploitations. In Software Security [42] Junyuan Zeng, Yangchun Fu, and Zhiqiang Lin.
and Reliability (SERE), 2012 IEEE Sixth Interna- Pemu: A pin highly compatible out-of-vm dynamic
tional Conference on, pages 98–107. IEEE, 2012. binary instrumentation framework. In Proceedings
of the 11th Annual International Conference on
[30] George C Necula, Scott McPeak, and Westley Virtual Execution Environments, Istanbul, Turkey,
Weimer. Ccured: Type-safe retrofitting of legacy March 2015.
code. In ACM SIGPLAN Notices, volume 37, pages
128–139. ACM, 2002. [43] Junyuan Zeng and Zhiqiang Lin. Towards auto-
matic inference of kernel object semantics from bi-
[31] Nicholas Nethercote and Julian Seward. Valgrind: nary code. In International Symposium on Recent
a framework for heavyweight dynamic binary in- Advances in Intrusion Detection, pages 538–561.
strumentation. In ACM Sigplan notices, volume 42, Springer, 2015.
pages 89–100. ACM, 2007.
to the logic for further processing 8 . Afterwards, the 3.2 User Mode Agent
agent can continue to run any untraced clean-up routines
before issuing another HC_GET_INPUT to start the next We expect a user mode agent to run inside the virtual-
loop iteration. ized target OS. In principle, this component only has to
synchronize and gather new inputs by the fuzzing logic
via the hypercall interface and use it to interact with the
guest’s kernel. Example agents are programs that try to
3.1 Fuzzing Logic mount inputs as file system images, pass specific files
such as certificates to kernel parser or even execute a
The fuzzing logic is the command and controlling com-
chain of various syscalls.
ponent of kAFL. It manages the queue of interesting
In theory, we only need one such component. In prac-
inputs, creates mutated inputs, and schedules them for
tice, we use two different components: The first program
evaluation. In most aspects, it is based on the algo-
is the loader component. Its job is to accept an arbitrary
rithms used by AFL. Similarly to AFL, we use a bitmap
binary via the hypercall interface. This binary represents
to store basic block transitions. We gather the AFL
the user mode agent and is executed by the loader com-
bitmap from the VMs through an interface to QEMU-PT
ponent. Additionally, the loader component will check
and decide which inputs triggered interesting behaviour.
if the agent has crashed (which happens often in case of
The fuzzing logic also coordinates the number of VMs
syscall fuzzing) and restarts it if necessary. This setup
spawned in parallel. One of the bigger design differences
has the advantage that we can pass any binary to the
to AFL is that kAFL makes extensive use of multipro-
VM and reuse VM snapshots for different fuzzing com-
cessing and parallelism, where AFL simply spawns mul-
ponents.
tiple independent fuzzers which synchronize their input
queues sporadically1 . In contrast, kAFL executes the de-
terministic stage in parallel, and all threads work on the 3.3 Virtualization Infrastructure
most interesting input. A significant amount of time is
The fuzzing logic uses QEMU-PT to interact with KVM-
spent in tasks that are not CPU-bound (such as guests
PT to spawn the target VMs. KVM-PT allows us to trace
that delay execution). Therefore, using many parallel
individual vCPUs instead of logical CPUs. This com-
processes (upto 5-6 per CPU core) drastically improves
ponent configures and enables Intel PT on the respec-
performance of the fuzzing process due to a higher CPU
tive logical CPU before the CPU switches to guest ex-
load per core. Lastly, the fuzzing logic communicates
ecution and disables tracing during the VM-Exit tran-
with the user interface to display current statistics in reg-
sition. This way, the associated CPU will only pro-
ular intervals.
vide trace data of the virtualized kernel itself. QEMU-
1 AFL recently added experimental support for distributing the PT is used to interact with the KVM-PT interface to
deterministic stage, see https://github.com/mirrorer/afl/blob/ configure and toggle Intel PT from user space and ac-
master/docs/parallel_fuzzing.txt#L60-L66. cess the output buffer to decode the trace data. The
such large amounts of incoming data, the decoder must (id(A)/2 ⊕ id(B)) % SIZE_OF_BITMAP
be implemented with a focus on efficiency. Otherwise,
the decoder may become the major bottleneck during Instead of the compile-time random, kAFL uses the
the fuzzing process. Nevertheless, the decoder must also addresses of the basic blocks. Each time the transition is
be precise, as inaccuracies during the decoding process observed, the corresponding byte in the bitmap is incre-
would result in further errors. This is due to the nature of mented. After finishing the fuzzing iteration, each entry
Intel PT decoding since the decoding process is sequen- of the bitmap is rounded such that only the highest bit re-
tial and is affected by previously decoded packets. mains set. Then the bitmap is compared with the global
static bitmap to see if any new bit was found. If a new bit
To ease efforts to implement an Intel PT software
was found, it is added to the global bitmap and the input
decoder, Intel provides its own decoding engine called
that triggered the new bit is added to the queue. When
libipt [4]. libipt is a general-purpose Intel PT decod-
a new interesting input is found, a deterministic stage is
ing engine. However, it does not fit our purposes very
executed that tries to mutate each byte individually.
well because libipt decodes trace data in order to pro-
Once the deterministic stage is finished, the non-
vide execution data and flow information. Furthermore,
deterministic phase is started. During this non-
libipt does not cache disassembled instructions and has
deterministic phase, multiple mutations are performed at
performed poorly in our use cases.
random locations. If the deterministic phase finds new
Since kAFL only relies on flow information and the
inputs, the non-deterministic phase will be delayed un-
fuzzing process is repeatedly applied to the same code,
til all deterministic phases of all interesting inputs have
it is possible to optimize the decoding process. Our In-
been performed. If an input triggers an entirely new tran-
tel PT software decoder acts like a just-in-time decoder,
sition (as opposed to a change in the number of times the
which means that code sections are only considered if
transition was taken), it will be favored and fuzzed with
they are executed according to the decoded trace data.
a higher priority.
To optimize further look-ups, all disassembled code sec-
tions are cached. In addition, we simply ignore packets
that are not relevant for our use case. 5 Evaluation
Since our PT decoder is part of QEMU-PT, trace data
is directly processed if the ToPA base region is filled. Based on our implementation, we now describe the
The decoding process is applied in-place since the buffer different fuzzing campaigns we performed to evaluate
is directly accessible from user space via mmap(). Un- kAFL. We evaluate kAFL’s fuzzing performance across
like other Intel PT drivers, we do not need to store large different platforms. Section 5.5 provides an overview of
amounts of trace data in memory or on storage devices all reported vulnerabilities, crashes, and bugs that were
for post-mortem decoding. Eventually, the decoded trace found during the development process of kAFL. We also
data is translated to the AFL bitmap format. evaluate kAFL’s ability to find a previously known vul-
nerability. Finally, in Section 5.6 the overall fuzzing
performance of kAFL is compared to ProjectTriforce,
4.3 AFL Fuzzing Logic the only other OS-independent feedback fuzzer avail-
able. TriforceAFL is based on the emulation backend
We give a brief description of the fuzzing parts of AFL of QEMU instead of hardware-assisted virtualization and
because the logic we use to perform scheduling and mu- Intel PT. The performance overhead of KVM-PT is dis-
tations closely follows that of AFL. The most important cussed in Section 5.7. Additionally, a performance com-
aspect of AFL is the bitmap used to trace which basic parison of our PT decoder and an Intel implementation
block transitions where encountered. Each basic block of a software decoder is given in Section 5.8.
has a randomly assigned ID, and each transition from ba- If not stated otherwise, the benchmarks were per-
sic block A to another basic block B is assigned an offset formed on a desktop system with an Intel i7-6700 pro-
into the bitmap according to the following formula: cessor and 32GB DDR4 RAM. To avoid distortions due
• macOS: APFS Memory Corruption10 We performed five repeated experiments for each op-
erating system. Additionally we tested TriforceAFL with
Red Hat has assigned a CVE number for the first re- the Linux target driver. TriforceAFL is unable to fuzz
ported security flaw, which triggers a null pointer def- Windows and macOS. To compare TriforceAFL with
erence and a partial memory corruption in the kernel kAFL, the associated TriforceLinuxSyscallFuzzer was
ASN.1 parser if an RSA certificate with a zero expo- slightly modified to work with our vulnerable Linux ker-
nent is presented. For the second reported vulnerabil- nel module. Unfortunately, it was not possible to com-
ity, which triggers a memory corruption in the ext4 file pare kAFL against Oracle’s file system fuzzer [34] due
to technical issues with its setup.
3 https://access.redhat.com/security/cve/cve-2016-0758
4 https://access.redhat.com/security/cve/cve-2016-8650
During each run, we fuzzed the JSON parser for 30
5 http://seclists.org/fulldisclosure/2016/Nov/76 minutes. The averaged and rounded results are displayed
6 https://access.redhat.com/security/cve/cve-2016-8650 in Table 1. As we can see, the performance of kAFL
7 http://seclists.org/fulldisclosure/2016/Nov/75
is very similar across different systems. It should be re-
8 http://seclists.org/bugtraq/2016/Nov/1
9 Reported
marked that the variance in this experiment is rather high
to Microsoft Security.
10 Reported to Apple Product Security. 11 http://zserge.com/jsmn.html
Table 1: kAFL and TriforceAFL fuzzing performance on the JSON sample driver.
[25] P. Goodman. Shin GRR: Make Fuzzing Fast [38] J. Viide, A. Helin, M. Laakso, P. Pietikäinen,
Again. https://blog.trailofbits.com/2016/ M. Seppänen, K. Halunen, R. Puuperä, and J. Rön-
11/02/shin-grr-make-fuzzing-fast-again/. ing. Experiences with model inference assisted
Accessed: 2017-06-29. fuzzing. In USENIX Workshop on Offensive Tech-
nologies (WOOT), 2008.
[26] I. Haller, A. Slowinska, M. Neugschwandtner, and
H. Bos. Dowsing for overflows: A guided fuzzer to [39] T. Wang, T. Wei, G. Gu, and W. Zou. TaintScope:
find buffer boundary violations. In USENIX Secu- A checksum-aware directed fuzzing tool for auto-
rity Symposium, 2013. matic software vulnerability detection. In IEEE
Symposium on Security and Privacy, 2010.
[27] C. Holler, K. Herzig, and A. Zeller. Fuzzing with
code fragments. In USENIX Security Symposium, [40] M. Woo, S. K. Cha, S. Gottlieb, and D. Brumley.
2012. Scheduling black-box mutational fuzzing. In ACM
Conference on Computer and Communications Se-
[28] Intel. Intel® 64 and IA-32 Architectures Soft-
curity (CCS), 2013.
ware Developer’s Manual (Order number: 325384-
058US, April 2016).
[29] A. Kleen. simple-pt: Simple Intel CPU pro-
cessor tracing on Linux. https://github.com/
andikleen/simple-pt.
[30] A. Kleen and B. Strong. Intel Processor Trace on
Linux. Tracing Summit 2015, 2015.
[31] Microsoft. FSCTL_DISMOUNT_VOLUME.
https://msdn.microsoft.com/en-us/
library/windows/desktop/aa364562(v=
vs.85).aspx, 2017.
[32] Microsoft. VHD Reference. https:
//msdn.microsoft.com/en-us/library/
windows/desktop/dd323700(v=vs.85).aspx,
2017.
Priyam Biswas1 , Alessandro Di Federico2 , Scott A. Carr1 , Prabhu Rajasekaran3 , Stijn Volckaert3 ,
Yeoul Na3 , Michael Franz3 , and Mathias Payer1
1 Department
of Computer Science, Purdue University
{biswas12, carr27}@purdue.edu, mathias.payer@nebelwelt.net
2 Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano
alessandro.difederico@polimi.it
3 Department of Computer Science, University of California, Irvine
int main() {
non_variadic function_ptr = variadic;
5 Implementation
function_ptr(1, 2);
} We implemented HexVASAN as a sanitizer for the
LLVM compiler framework [31], version 3.9.1. We
In this case, the function call in main to function_ptr have chosen LLVM for its robust features on analyzing
appears to the compiler as a non-variadic function call, and transforming arbitrary programs as well as extract-
since the type of the function pointer is not variadic. ing reliable type information. The sanitizer can be en-
Therefore, our pass will not instrument the call site, lead- abled from the C/C++ frontend (clang) by providing the
ing to potential errors. -fsanitize=vasan parameter at compile-time. No an-
To handle such (rare) situations appropriately, we notations or other source code changes are required for
would have to instrument all non-variadic call sites too, HexVASAN. Our sanitizer does not require visibility of
leading to an unjustified overhead. Moreover, the code whole source code (see Section 4.3), but works on indi-
above represents undefined behavior in C [27, 6.3.2.3p8] vidual compilation units. Therefore link-time optimiza-
and C++ [26, 5.2.10p6], and might not work on certain ar- tion (LTO) is not required and thus fits readily into exist-
chitectures where the calling convention for variadic and ing build systems. In addition, HexVASAN also supports
non-variadic function calls are not compatible. The GNU signal handlers.
C compiler emits a warning when a function pointer is HexVASAN consists of two components: a static in-
cast to a different type, therefore we require the devel- strumentation pass and a runtime library. The static in-
oper to correct the code before applying HexVASAN. strumentation pass works on LLVM IR, adding the nec-
Central management of the global state. To allow the essary instrumentation code to all variadic functions and
HexVASAN runtime to be linked into the base system li- their callees. The support library is statically linked to
braries, such as the C standard library, we made it a static the program and, at run-time, checks the number and
library. Turning the runtime into a shared library is pos- type of variadic arguments as they are used by the pro-
sible, but would prohibit its use during the early process gram. In the following we describe the two components
initialization – until the dynamic linker has processed all in detail.
of the necessary relocations. Our runtime therefore ei- Static instrumentation. The implementation of the
ther needs to be added solely to the C standard library static instrumentation pass follows the description in
(so that it is initialized early in the startup process) or Section 4. We first iterate through all functions, looking
the runtime library must carefully use weak symbols to for CallInst instructions targeting a variadic function
ensure that each symbol is only defined once if multiple (either directly or indirectly), then we inspect them and
libraries are compiled with our countermeasure. create for each one of them a read-only GlobalVariable
C++ exceptions and longjmp. If an exception is raised of type vcsd t. As shown in Listing 2, vcsd t is com-
while executing a variadic function (or one of its callees), posed by an unsigned integer representing the number
the variadic function may not get a chance to clean up the of arguments of the considered call site and a pointer to
metadata for any va_lists it has initialized, nor may the an array (another GlobalVariable) with an integer el-
caller of this variadic function get the chance to clean up ement for each argument of type t. type t is an inte-
the type information it has pushed onto the VCS. Other ger uniquely identifying a data type obtained using the
functions manipulating the thread’s stack directly, such hashType function presented in Algorithm 2. At this
as longjmp, present similar issues. point a call to pre call is injected before the call site,
C++ exceptions can be handled by modifying the with the newly create VCSD as a parameter, and a call to
LLVM C++ frontend (i.e., clang) to inject an object post call is injected after the call site.
with a lifetime spanning from immediately before a vari- During the second phase, we first identify all va_start,
adic function call to immediately after. Such an object va_copy, and va_end operations in the program. In the IR
would call pre_call in its constructor and post_call in code, these operations appear as calls to the LLVM in-
Listing 2: Simplified C++ representation of the Listing 3: Error message reported by HexVASAN
instrumented code for Listing 1.
Table 1: Detection coverage for several types of illegal calls to variadic functions. X indicates detection, 7 indicates
non-detection. “A.T.” stands for address taken.
GCC VTV, and Visual C++ CFG allow all of the ille- Although it is sudo itself that calls the format string
gal calls, even if the non-variadic type signature does not function (fprintf), HexVASAN can only detect the vio-
match that of the intended call target. lation on the callee side. We therefore had to build hard-
pi-CFI allows calls to the avg_longs function from the ened versions of not just the sudo binary itself, but also
sum_ints indirect call site. avg_longs is address-taken the C library. We chose to do this on the FreeBSD plat-
and it has the same static type signature as the intended form, as its standard C library can be easily built using
call target. pi-CFI does not allow illegal calls to non- LLVM, and HexVASAN therefore readily fits into the
address-taken functions or functions with different static FreeBSD build process. As expected, HexVASAN does
type signatures. pi-CFI also does not allow calls to vari- detect any exploit that triggers the vulnerability, produc-
adic functions from non-variadic call sites. ing the error message shown in Listing 4.
All implementations of CFI allow direct overwrites of
$ ln -s /usr/bin/sudo %x%x%x%x
variadic arguments, as long as the original control flow
$ ./%x%x%x%x -D9 -A
of the program is not violated.
--------------------------
Error: Index greater than Argument Count
6.2 Exploit Detection Index is 1
Backtrace:
To evaluate the effectiveness of our tool as a real-world [0] 0x4053bf <__vasan_backtrace+0x1f> at sudo
exploit detector, we built a HexVASAN-hardened ver- [1] 0x405094 <__vasan_check_index+0xf4> at sudo
sion of sudo 1.8.3. sudo allows authorized users to ex- [2] 0x8015dce24 <__vfprintf+0x2174> at libc.so
ecute shell commands as another user, often one with [3] 0x8015dac52 <vfprintf_l+0x212> at libc.so
a high privilege level on the system. If compromised, [4] 0x8015daab3 <vfprintf_l+0x73> at libc.so
sudo can escalate the privileges of non-authorized users, [5] 0x40bdaf <sudo_debug+0xdf> at sudo
[6] 0x40ada3 <main+0x6c3> at sudo
making it a popular target for attackers. Versions 1.8.0
[7] 0x40494f <_start+0x17f> at sudo
through 1.8.3p1 of sudo contained a format string vul-
nerability (CVE-2012-0809) that allowed exactly such a Listing 4: Exploit detection in sudo.
compromise. This vulnerability could be exploited by
passing a format string as the first argument (argv[0]) of
the sudo program. One such exploit was shown to by-
pass ASLR, DEP, and glibc’s FORTIFY SOURCE pro- 6.3 Prevalence of variadic functions
tection [20]. In addition, we were able to verify that GCC
5.4.0 and clang 3.8.0 fail to catch this exploit, even when To collect variadic function usage in real software,
annotating the vulnerable function with the format func- we extended our instrumentation mechanism to collect
tion attribute [5] and setting the compiler’s format string statistics about variadic functions and their calls. As
checking (-Wformat) to the highest level. shown in Table 2, for each program, we collect:
ETH_PAUSE, "ETH_PAUSE",
ETHCTRL_DATA, "ETHCTRL_DATA", 0.95
"ETHCTRL_REGISTER_DSAP", 0.9
ETHCTRL_DEREGISTER_DSAP,
"ETHCTRL_DEREGISTER_DSAP",
ETHCTRL_SENDPAUSE, "ETHCTRL_SENDPAUSE",
0, NULL
Figure 2: Run-time overhead of HexVASAN in the
);
SPECint CPU2006 benchmarks, compared to baseline
Listing 5: Variadic violation in omnetpp. LLVM 3.9.1 performance.
Table 4: Performance overhead in micro-benchmarks. Control-flow hijacking mitigations cannot prevent at-
tackers from overwriting variadic arguments directly.
At best, they can prevent variadic functions from be-
7 Related work ing called through control-flow edges that do not ap-
pear in legitimate executions of the program. We there-
HexVASAN can either be used as an always-on runtime fore argue that HexVASAN and these mitigations are
monitor to mitigate exploits or as a sanitizer to detect orthogonal. Moreover, prior research has shown that
bugs, sharing similarities with the sanitizers that exist many of the aforementioned implementations fail to fully
primarily in the LLVM compiler. Similar to HexVASAN, prevent control-flow hijacking as they are too impre-
these sanitizers embed run-time checks into a program cise [8, 17, 19, 23], too limited in scope [53, 57], vulner-
by instrumenting potentially dangerous program instruc- able to information leakage attacks [18], or vulnerable
tions. to spraying attacks [24, 45]. We further showed in Sec-
AddressSanitizer [54] (ASan), instruments memory tion 6.1 that variadic functions exacerbate CFI’s impre-
accesses and allocation sites to detect spatial memory cision problems, allowing additional leeway for adver-
errors, such as out-of-bounds accesses, as well as tem- saries to attack variadic functions.
poral memory errors, such as use-after-free bugs. Unde-
fined Behavior Sanitizer [52] (UBSan) instruments vari- Defenses that protect against direct overwrites or mis-
ous types of instructions to detect operations whose se- use of variadic arguments have thus far only focused on
mantics are not strictly defined by the C and C++ stan- format string attacks, which are a subset of the possible
dards, e.g., increments that cause signed integers to over- attacks on variadic functions. LibSafe detects potentially
flow, or null-pointer dereferences. Thread Sanitizer [55] dangerous calls to known format string functions such
(TSAN) instruments memory accesses and atomic opera- as printf and sprintf [60]. A call is considered dan-
tions to detect data races, deadlocks, and various misuses gerous if a %n specifier is used to overwrite the frame
of synchronization primitives. Memory Sanitizer [58] pointer or return address, or if the argument list for the
(MSAN) detects uses of uninitialized memory. printf function is not contained within a single stack
CaVer [32] is a sanitizer targeted at verifying correct- frame. FormatGuard [12] instruments calls to printf
ness of downcasts in C++. Downcasting converts a base and checks if the number of arguments passed to printf
class pointer to a derived class pointer. This operation matches the number of format specifiers used in the for-
may be unsafe as it cannot be statically determined, in mat string.
general, if the pointed-to object is of the derived class
type. TypeSan [25] is a refinement of CaVer that reduces Shankar et al. proposed to use static taint analysis to
overhead and improves the sanitizer coverage. detect calls to format string functions where the format
UniSan [34] sanitizes information leaks from the ker- string originates from an untrustworthy source [56]. This
nel. It ensures that data is initialized before leaving the approach was later refined by Chen and Wagner [10] and
kernel, preventing reads of uninitialized memory. used to analyze thousands of packages in the Debian 3.1
All of these sanitizers are highly effective at finding Linux distribution. TaintCheck [39] also detects untrust-
specific types of bugs, but, unlike HexVASAN, they do worthy format strings, but relies on dynamic taint analy-
not address misuses of variadic functions. The aforemen- sis to do so.
tioned sanitizers also differ from HexVASAN in that they
typically incur significant run-time and memory over- FORTIFY SOURCE of glibc provides some lightweight
head. checks to ensure all the arguments are consumed. How-
Different control-flow hijacking mitigations offer par- ever, it can be bypassed [2] and does not check for type-
tial protection against variadic function attacks by mismatch. Hence, none of these aforementioned solu-
preventing adversaries from calling variadic functions tions provide comprehensive protection against variadic
through control-flow edges that do not appear in legit- argument overwrites or misuse.
[29] J ELINEK , J. FORTIFY SOURCE. https://gcc.gnu.org/ml/ [49] PAYER , M., BARRESI , A., AND G ROSS , T. R. Fine-grained
gcc-patches/2004-09/msg02055.html, 2004. control-flow integrity through binary hardening. In Conference
on Detection of Intrusions and Malware & Vulnerability Assess-
[30] K UZNETSOV, V., S ZEKERES , L., PAYER , M., C ANDEA , G., ment (DIMVA) (2015).
S EKAR , R., AND S ONG , D. Code-pointer integrity. In USENIX
Symposium on Operating Systems Design and Implementation [50] P EWNY, J., AND H OLZ , T. Control-flow restrictor: Compiler-
(OSDI) (2014). based CFI for iOS. In Annual Computer Security Applications
Conference (ACSAC) (2013).
[31] L ATTNER , C., AND A DVE , V. Llvm: A compilation framework
for lifelong program analysis & transformation. In IEEE/ACM [51] P RAKASH , A., H U , X., AND Y IN , H. vfGuard: Strict Protection
International Symposium on Code Generation and Optimization for Virtual Function Calls in COTS C++ Binaries. In Symposium
(CGO) (2004). on Network and Distributed System Security (NDSS) (2015).
[32] L EE , B., S ONG , C., K IM , T., AND L EE , W. Type casting verifi- [52] P ROJECT, G. C. Undefined behavior sanitizer.
cation: Stopping an emerging attack vector. In USENIX Security https://www.chromium.org/developers/testing/
Symposium (2015). undefinedbehaviorsanitizer.
[33] L INUX P ROGRAMMER ’ S M ANUAL. va start (3) - Linux Manual [53] S CHUSTER , F., T ENDYCK , T., L IEBCHEN , C., DAVI , L.,
Page. S ADEGHI , A.-R., AND H OLZ , T. Counterfeit object-oriented
programming: On the difficulty of preventing code reuse attacks
[34] L U , K., S ONG , C., K IM , T., AND L EE , W. Unisan: Proactive in c++ applications. In IEEE Symposium on Security and Privacy
kernel memory initialization to eliminate data leakages. In ACM (S&P) (2015).
Conference on Computer and Communications Security (CCS)
(2016). [54] S EREBRYANY, K., B RUENING , D., P OTAPENKO , A., AND
V YUKOV, D. Addresssanitizer: a fast address sanity checker.
[35] M ASHTIZADEH , A. J., B ITTAU , A., B ONEH , D., AND In USENIX Annual Technical Conference (2012).
M AZI ÈRES , D. Ccfi: cryptographically enforced control flow
integrity. In ACM Conference on Computer and Communications [55] S EREBRYANY, K., AND I SKHODZHANOV, T. Threadsanitizer:
Security (CCS) (2015). Data race detection in practice. In Workshop on Binary Instru-
mentation and Applications (2009).
[36] M ATZ , M., H UBICKA , J., JAEGER , A., AND M ITCHELL , M.
[56] S HANKAR , U., TALWAR , K., F OSTER , J. S., AND WAGNER ,
System v application binary interface. AMD64 Architecture Pro-
D. Detecting format string vulnerabilities with type qualifiers. In
cessor Supplement, Draft v0.99 (2013).
USENIX Security Symposium (2001).
[37] M ICROSOFT C ORPORATION. Control Flow Guard (Windows).
[57] S NOW, K. Z., M ONROSE , F., DAVI , L., D MITRIENKO , A.,
https://msdn.microsoft.com/en-us/library/windows/
L IEBCHEN , C., AND S ADEGHI , A. Just-in-time code reuse: On
desktop/mt637065(v=vs.85).aspx, 2016.
the effectiveness of fine-grained address space layout randomiza-
[38] M OHAN , V., L ARSEN , P., B RUNTHALER , S., H AMLEN , K., tion. In IEEE Symposium on Security and Privacy (S&P) (2013).
AND F RANZ , M. Opaque control-flow integrity. In Symposium
[58] S TEPANOV, E., AND S EREBRYANY, K. Memorysanitizer: Fast
on Network and Distributed System Security (NDSS) (2015).
detector of uninitialized memory use in c++. In IEEE/ACM In-
[39] N EWSOME , J., AND S ONG , D. Dynamic taint analysis for auto- ternational Symposium on Code Generation and Optimization
matic detection, analysis, and signature generation of exploits on (CGO) (2015).
commodity software. In Symposium on Network and Distributed
[59] T ICE , C., ROEDER , T., C OLLINGBOURNE , P., C HECKOWAY,
System Security (NDSS) (2005).
S., E RLINGSSON , Ú., L OZANO , L., AND P IKE , G. Enforcing
[40] NISSIL, R. http://cve.mitre.org/cgi-bin/cvename.cgi? forward-edge control-flow integrity in gcc & llvm. In USENIX
name=CVE-2009-1886. Security Symposium (2014).
We first investigate the instruction-dependent form of the • Both operands to the ALU instructions contribute,
leakage in a simple setting, where differential effects except in the case of those involving immediate
from other factors do not yet play a role. For this pur- values (which, again, essentially have no second
pose, we perform the same fixed sequence mov-instr-mov operand).
5,000 times for each selected instruction, as the two 32- • Only the second operand to the load and store in-
bit operands vary. We measure the power consumption structions contributes significantly.
(or EM, in the case of the M4) associated with each se- • Bus transitions contribute to the instructions on im-
quence, and identify as a suitable point the maximum mediate values, and also to cmp.
peak4 in the clock cycle during which the instruction Whilst we again advise against over-interpreting the R-
leaks. We fit the model (1) to the (drift-adjusted) vec- squareds (see Section 2.1), a comparison between the
tor of measurements at this point. first rows of 3 and 4 indicates that, in the case of the
Table 3 in the Appendix confirms the overall signif- ALU and shift instructions, model (1) accounts for sub-
icance of the model for each M0 instruction. This sup- stantially less of the variation in the M4 EM traces than
ports our point selection and the intuition that the leakage it does of the variation in the M0 power traces. Although
2 The significance level should be understood as the probability of this could be taken as evidence that the instructions in
rejecting the null hypothesis when it is in fact true. question leak more in the case of the M0 and less in
3 The number large enough to imply inconsistency with the distribu-
tional assumption fixing the probability of error at α. 5 ‘Significant at the 5% level’ is a shorthand way of saying that the
4 This choice is specific to our measurements and is by no means the null hypothesis of ‘no effect’ is rejected by the F-test when the proba-
only option. bility of a false rejection (Type I error) is fixed at 5%.
Cluster
Cluster
Cluster
Cluster
We eventually want to allow for possible differences in
the leakage behaviours of instructions depending on ad- 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
jacent activity in sequences of code (as per [27]). This Silhouette Value Silhouette Value Silhouette Value Silhouette Value
model (1) for both the M0 and the M4. We use the av- 1 1 1
erage Euclidean distance between instruction models to
0 0 0
form a hierarchy of clusters (represented by the dendro- 0 8 16 24 0 8 16 24 0 8 16 24
Bit
grams in Fig. 6). Adjusting the inconsistency threshold6 Loads Multiply
between 0.7 and 1.2 produces the groupings reported in 3 3
O1
Tables 5 and 6. In the case of the M0, these align nicely 2 2 O2
β×103
In summary, our exploratory analysis of the data- 5.2 Exploring Register Effects
dependent form of the instruction leakages confirms
many of our a priori intuitions about the architecture The ARM Cortex-M0 architecture distinguishes between
and supports our model building approach as sensible low (r0–r7) and high (r8–r15) registers. The latter, which
and meaningful. It also indicates that we can lessen can only be accessed by the mov instruction, are used for
the burden of the task by reducing the number of dis- fast temporary storage. These were observed by inspec-
tinct instructions to be modelled to a meaningfully rep- tion to have different leakage characteristics to the low
resentative subset of the initial 21. Reducing unneces- registers. However, due to their singular usage we con-
sary complexity in the instruction set increases the scope sider them outside of the scope of this particular analysis
for adding meaningfully explanatory complexity to the and focus only on the low registers. For the purposes of
models themselves, which we proceed to do in the next future extensions to our methodology, we propose mod-
section for the power consumption of the M0. elling high register movs as an additional distinct instruc-
tion.
We test for variation between the eight low registers
5 Building Complex Models for the M0 by collecting 5,000 traces for each source register (rn )
and destination register (rd ) (evenly distributed over the
From this point forward we concentrate on the M0 and possible source/destination pairs, making 625 per pair)
seek to build more complex, sequence-dependent models as movs are performed on random inputs. We then fit
for five instructions chosen to represent the groups iden- model (1) with the addition of dummy variables for
tified by the clustering analysis of Sect. 4.2: eors, lsls, source register and for destination register, and compare
str, ldr and muls. The model coefficients for each of this against a model with the further addition of register/-
these are shown in Fig. 7 (see Appendix). As we would data interaction dummies, in order to test the significance
hope, they can be observed (by comparing with Fig. 3) of the latter.
to match well the mean coefficients for the groups that We find that the registers do have a jointly significant
they represent, with the possible exception of str, which effect on the leakage data-dependency (see LHS of Tab. 7
has smaller coefficients on the first byte than the average in Appendix A). Considered separately, only the source
within its group. register effect remains significant; at the 5% level we do
We are confident that these five are adequate for un- not reject the null hypothesis that the destination register
derstanding the leakage behaviour of all 21. Restricting has no effect. Moreover, the effect can be isolated (by
the analysis in this way enables exhaustive exploration of testing one ‘source register interaction’ at a time relative
the effects of preceding and subsequent operations when to the model with no source register interactions) to just
instructions are performed in sequence.7 half the source registers (r0, r1, r4 and r7).
This analysis suggests that the inclusion of (some)
7 Such an approach implicitly makes the further assumption that in- source register effects would increase the ability of
structions within each identified cluster are affected similarly by the the model to accurately approximate the data-dependent
sequence of which they are a part. leakage. However, such an extension would add consid-
MEASURED
correlation
A simple way to check how well a model corresponds to 0.5
real leakage behaviour is to compute the (Pearson) cor-
0
relation between the model predictions for a particular
100 200 300 400 500 600
instruction (operating on a set of known inputs) and the ELMO Generated Model
measured traces corresponding to a code sequence con- 1
MEASURED
correlation
taining that same instruction (operating on the same in-
0.5
puts). This procedure can be used to demonstrate the im-
provement of our derived models over weaker, assumed 0
models, such as the Hamming weight. 100 200 300 400 500 600
Figure 4 juxtaposes the correlation traces produced by ELMO Generated Model
1
SIMULATED
the Hamming weight prediction of the leakage associ-
correlation
ated with the first round output as the M0 performs AES 0.5
(top), and by the ELMO prediction corresponding to the
same intermediate being loaded into the register (mid- 0
dle). It can be clearly seen that the ELMO model gen- 100 200 300 400 500 600
Clock cycle
erates larger peaks, and more of them. The bottom of
Figure 4 shows, for comparison, the peaks which are ex-
hibited when the model predictions are correlated with Figure 4: Correlation traces for ELMO-predicted inter-
an equivalent set of ELMO-emulated traces. These in- mediate values (top) and Hamming weight model pre-
dicate the same leakage points as displayed in the mea- dictions (middle) in 500 real M0 traces; correlation trace
sured traces, with the advantage of enhanced definition for ELMO-predicted intermediate values in the equiva-
thanks to the lack of (data-independent) noise in the sim- lent set of ELMO-emulated traces (bottom).
ulations. It thus emerges that Hamming weight-based
simulations do not give a full picture of the true leakage
of an implementation on an M0, and should not be relied ciated with random inputs. In all cases the Welch’s two-
upon for pre-empting data sensitivities. The same picture sample t-test for equality of means is then performed;
emerges for the other instructions but we do not include results that are larger than a defined threshold, which we
an exhaustive analysis for the sake of brevity. In conclu- indicate via a dotted line in our figures, are taken as evi-
sion, our models represent a marked improvement over dence for a leak.
simply using the Hamming weight.
6.3.1 Detecting ‘Subtle’ Leaks
6.3 Evaluating Models via Leakage Detec- We now choose a code sequence relating to a suppos-
tion edly protected AES operation. The code sequence im-
plements a standard countermeasure called masking [1].
Further to the capability of our models to improve cor- Masking essentially distributes all intermediate vari-
relation analysis, we now show that they can also be ap- ables into shares which are statistically independent, but
plied to the task of (automated) leakage detection on as- whose composition (typically by way of exclusive-or) re-
sembly implementations. They can thereby be used to sults in the (unmasked) variables. Consequently, stan-
spot ‘subtle’ leaks – that is, leaks that would be difficult dard DPA attacks [18] no longer succeed. The ease of
for non-specialist software engineers to understand and implementation in software and ability to provide some
pinpoint. sort of proof of leakage resilience has made masking a
To aid readability we briefly overview the leakage popular side channel attack countermeasure, on the re-
detection procedures proposed by Goodwill et al. [10]. ceiving end of considerable attention from academia and
These are based on classical statistical hypothesis tests, industry alike. However, it is also well-known that im-
and can be categorised as specific or non-specific. Spe- plementations of masking schemes can produce subtle
cific tests divide the traces into two subsets based on unanticipated leakages [17].
some known intermediate value such as an output bit of We faithfully implemented a masking scheme for AES
an S-box or the equality (or otherwise) of a round output (as described in [17]) in Thumb assembly to avoid the
byte to a particular value. The non-specific ‘fixed-versus- potential introduction of masking flaws by the compiler
random’ test acquires traces associated with a particular (from C to assembly). The code sequence, which we
fixed data input and compares them against traces asso- will analyse and discuss, relates to an operation called
SIMULATED
function. In a masked implementation, this results in a
t−statistic
0
a register) being rotated and then written back into mem- −20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ory. Table 1 shows the assembly code for ShiftRows. An
10
experienced and side-channel aware implementer who
MEASURED
t−statistic
0
has detailed leakage information about the M0 would −10
now be able to spot a problem with this code: because −20
the ror instruction also leaks a function of the Hamming 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[26] C. Thuillet, P. Andouard, and O. Ly. A smart card Figure 6: Dendrograms representing the hierarchical
power analysis simulator. In Proceedings of the clustering of the M0 (left) and M4 (right instructions ac-
12th IEEE International Conference on Computa- cording to the fitted leakage models.
tional Science and Engineering, CSE 2009, pages
847–852. IEEE Computer Society, 2009. 3
eor 3
lsl 3
str
ldr mul
3 3
[28] N. Veshchikov. SILK: high level of abstraction O1
leakage simulator for side channel analysis. In 2 2 O2
β×103
β× 103
2 2
0 0
0 8 16 24 0 8 16 24
Bit Bit
R-squared 0.276 0.289 0.253 0.227 0.260 0.202 0.147 0.107 0.187 0.296 0.292
Operand 1 19.36 6.09 5.20 15.60 7.32 4.29 0.93 0.82 1.53 32.88 32.70
F-statistic
Operand 2 10.23 -0.00 7.25 4.70 0.00 3.92 22.55 14.59 30.78 7.22 5.18
Transition 1 23.35 29.52 30.40 20.25 24.12 19.35 0.93 1.18 1.40 20.18 18.88
Transition 2 5.20 24.11 9.11 3.76 20.95 10.66 1.22 1.43 0.67 1.92 2.96
Combined 14.51 15.44 12.89 11.19 13.40 9.63 6.55 4.54 8.78 15.98 15.69
movs movs muls orrs rors str strb strh subs subs
#imm #imm
R-squared 0.255 0.455 0.278 0.214 0.315 0.061 0.075 0.067 0.237 0.271
Operand 1 3.18 8.80 32.68 3.17 24.02 1.24 0.72 1.12 13.78 5.38
F-statistic
Operand 2 3.93 -0.00 20.93 3.52 20.28 4.99 7.36 5.18 3.45 0.00
Transition 1 22.83 53.03 2.25 15.44 23.60 1.06 1.29 1.21 24.07 27.96
Transition 2 20.71 63.83 1.05 17.25 1.90 2.46 2.66 3.55 4.68 23.53
Combined 13.04 31.71 14.68 10.34 17.50 2.46 3.10 2.73 11.82 14.16
Table 3: F-tests for significant joint data effects in the power consumption of the M0; tests which fail to reject at the
5% level are shaded grey. Critical values shown in brackets in the row headings. Degrees of freedom for the F-tests
are (128,4871) for the combined test, (32,4871) for the rest.
adds adds ands cmp cmp eors ldr ldrb ldrh lsls lsrs
#imm #imm
R-squared 0.048 0.052 0.049 0.050 0.086 0.051 0.148 0.135 0.124 0.047 0.055
Operand 1 1.58 3.82 1.72 2.16 10.96 3.80 1.13 1.04 1.11 0.76 1.09
F-statistic
Operand 2 3.98 -0.00 3.92 3.49 0.00 2.63 22.97 20.18 18.00 4.59 6.09
Transition 1 1.33 0.63 1.25 0.73 0.81 1.02 0.92 0.90 0.97 0.70 0.51
Transition 2 0.67 4.07 0.73 1.47 2.25 0.87 0.91 0.67 0.87 1.11 1.07
Combined 1.94 2.11 1.97 1.98 3.60 2.06 6.60 5.92 5.41 1.88 2.20
movs movs muls orrs rors str strb strh subs subs
#imm #imm
R-squared 0.029 0.063 0.038 0.038 0.117 0.546 0.814 0.691 0.046 0.068
Operand 1 1.09 0.76 1.30 2.59 15.85 1.13 1.09 1.05 2.62 7.06
F-statistic
Operand 2 1.01 0.00 1.87 1.88 2.00 177.01 649.37 329.94 2.47 0.00
Transition 1 1.13 0.88 1.58 0.77 1.13 0.75 1.01 0.98 0.68 1.57
Transition 2 1.35 8.29 1.16 0.68 1.06 0.87 1.14 0.72 1.32 2.52
Combined 1.15 2.54 1.48 1.51 5.05 45.73 166.32 85.10 1.84 2.80
Table 4: F-tests for significant joint data effects in the EM radiation of the M4; tests which fail to reject at the 5%
level are shaded grey. Critical values shown in brackets in the row headings. Degrees of freedom for the F-tests are
(128,4871) for the combined test, (32,4871) for the rest.
Daniel Gruss∗, Julian Lettner†, Felix Schuster, Olga Ohrimenko, Istvan Haller, Manuel Costa
Microsoft Research
Abstract sources [34,55] and eliminating sharing always comes at
Cache-based side-channel attacks are a serious problem the cost of efficiency. Similarly, the detection of cache
in multi-tenant environments, for example, modern cloud side-channel attacks is not always sufficient, as recently
data centers. We address this problem with Cloak, a demonstrated attacks may, for example, recover the en-
new technique that uses hardware transactional mem- tire secret after a single run of a vulnerable cryptographic
ory to prevent adversarial observation of cache misses algorithm [17, 43, 62]. Furthermore, attacks on singular
on sensitive code and data. We show that Cloak pro- sensitive events are in general difficult to detect, as these
vides strong protection against all known cache-based can operate at low attack frequencies [23].
side-channel attacks with low performance overhead. We In this paper, we present Cloak, a new efficient defen-
demonstrate the efficacy of our approach by retrofitting sive approach against cache side-channel attacks that al-
vulnerable code with Cloak and experimentally confirm- lows resource sharing. At its core, our approach prevents
ing immunity against state-of-the-art attacks. We also cache misses on sensitive code and data. This effectively
show that by applying Cloak to code running inside In- conceals cache access-patterns from attackers and keeps
tel SGX enclaves we can effectively block information the performance impact low. We ensure permanent cache
leakage through cache side channels from enclaves, thus residency of sensitive code and data using widely avail-
addressing one of the main weaknesses of SGX. able hardware transactional memory (HTM), which was
originally designed for high-performance concurrency.
HTM allows potentially conflicting threads to execute
1 Introduction
transactions optimistically in parallel: for the duration
Hardware-enforced isolation of virtual machines and of a transaction, a thread works on a private memory
containers is a pillar of modern cloud computing. While snapshot. In the event of conflicting concurrent mem-
the hardware provides isolation at a logical level, physi- ory accesses, the transaction aborts and all correspond-
cal resources such as caches are still shared amongst iso- ing changes are rolled back. Otherwise, changes become
lated domains, to support efficient multiplexing of work- visible atomically when the transaction completes. Typi-
loads. This enables different forms of side-channel at- cally, HTM implementations use the CPU caches to keep
tacks across isolation boundaries. Particularly worri- track of transactional changes. Thus, current implemen-
some are cache-based attacks, which have been shown tations like Intel TSX require that all accessed memory
to be potent enough to allow for the extraction of sensi- remains in the CPU caches for the duration of a transac-
tive information in realistic scenarios, e.g., between co- tion. Hence, transactions abort not only on real conflicts
located cloud tenants [56]. but also whenever transactional memory is evicted pre-
In the past 20 years cache attacks have evolved from maturely to DRAM. This behavior makes HTM a pow-
theoretical attacks [38] on implementations of crypto- erful tool to mitigate cache-based side channels.
graphic algorithms [4] to highly practical generic attack The core idea of Cloak is to execute leaky algorithms
primitives [43,62]. Today, attacks can be performed in an in HTM-backed transactions while ensuring that all sen-
automated fashion on a wide range of algorithms [24]. sitive data and code reside in transactional memory for
Many countermeasures have been proposed to miti- the duration of the execution. If a transaction suc-
gate cache side-channel attacks. Most of these coun- ceeds, secret-dependent control flows and data accesses
termeasures either try to eliminate resource sharing [12, are guaranteed to stay within the CPU caches. Other-
18, 42, 52, 58, 68, 69], or they try to mitigate attacks wise, the corresponding transaction would abort. As we
after detecting them [9, 53, 65]. However, it is diffi- show and discuss, this simple property can greatly raise
cult to identify all possible leakage through shared re- the bar for contemporary cache side-channel attacks or
∗ Work done during internship at Microsoft Research; affiliated with
even prevent them completely. The Cloak approach can
Graz University of Technology.
be implemented on top of any HTM that provides the
† Work done during internship at Microsoft Research; affiliated with aforementioned basic properties. Hence, compared to
University of California, Irvine. other approaches [11, 42, 69] that aim to provide isola-
undo
a shared line from the cache and, after some waiting,
checks if it was brought back through the victim’s ex- Write 0x40
ecution. Flush+Reload attacks have been studied exten- Write 0x40
nflict
sively in different variations [2, 41, 66]. Apart from the write co
End transaction
CPU caches, the shared nature of other system resources
has also been exploited in side-channel attacks. This
includes different parts of the CPU’s branch-prediction Figure 1: HTM ensures that no concurrent modifications
facility [1, 15, 40], the DRAM row buffer [5, 55], the influence the transaction, either by preserving the old
page-translation caches [21, 28, 36] and other micro- value or by aborting and reverting the transaction.
architectural elements [14].
This paper focuses on mitigating Prime+Probe and evicted to DRAM, transactions necessarily abort in these
Flush+Reload. However, Cloak conceptually also implementations. Nakaike et al. [48] found that all four
thwarts other memory-based side-channel attacks such implementations detected access conflicts at cache-line
as those that exploit the shared nature of the DRAM. granularity and that failed transactions were reverted by
invalidating the cache lines of the corresponding write
2.3 Hardware Transactional Memory sets. Depending on the implementation, read and write
set can have different sizes, and set sizes range from mul-
HTM allows for the efficient implementation of paral- tiple KB to multiple MB of HTM space.
lel algorithms [27]. It is commonly used to elide ex- Due to HTM usually being implemented within
pensive software synchronization mechanisms [16, 63]. the CPU cache hierarchy, HTM has been proposed
Informally, for a CPU thread executing a hardware trans- as a means for optimizing cache maintenance and
action, all other threads appear to be halted; whereas, for performing security-critical on-chip computations:
from the outside, a transaction appears as an atomic oper- Zacharopoulos [64] uses HTM combined with prefetch-
ation. A transaction fails if the CPU cannot provide this ing to reduce the system energy consumption. Guan et al.
atomicity due to resource limitations or conflicting con- [25] designed a system that uses HTM to keep RSA pri-
current memory accesses. In this case, all transactional vate keys encrypted in memory and only decrypt them
changes need to be rolled back. To be able to detect con- temporarily inside transactions. Jang et al. [36] used
flicts and revert transactions, the CPU needs to keep track hardware transaction aborts upon page faults to defeat
of transactional memory accesses. Therefore, transac- kernel address-space layout randomization.
tional memory is typically divided into a read set and a
write set. A transaction’s read set contains all read mem-
ory locations. Concurrent read accesses by other threads 3 Attacker Model
to the read set are generally allowed; however, concur-
We consider multi-tenant environments where tenants do
rent writes are problematic and—depending on the actual
not trust each other, including local and cloud environ-
HTM implementation and circumstances—likely lead to
ments, where malicious tenants can use shared resources
transactional aborts. Further, any concurrent accesses
to extract information about other tenants. For example,
to the write set necessarily lead to a transactional abort.
they can influence and measure the state of caches via
Figure 1 visualizes this exemplarily for a simple transac-
the attacks described in Section 2.2. In particular, an at-
tion with one conflicting concurrent thread.
tacker can obtain a high-resolution trace of its own mem-
ory access timings, which are influenced by operations of
Commercial Implementations. Implementations of the victim process. More abstractly, the attacker can ob-
HTM can be found in different commercial CPUs, tain a trace where at each time frame the attacker learns
among others, in many recent professional and consumer whether the victim has accessed a particular memory lo-
Intel CPUs. Nakaike et al. [48] investigated four com- cation. We consider the above attacker in three realistic
mercial HTM implementations from Intel and other ven- environments which give her different capabilities:
dors. They found that all processors provide comparable
functionality to begin, end, and abort transactions and Cloud We assume that the processor, the OS and the
that all implement HTM within the existing CPU cache hypervisor are trusted in this scenario while other
hierarchy. The reason for this is that only caches can be cloud tenants are not. This enables the attacker to
held in a consistent state by the CPU itself. If data is launch cross-VM Prime+Probe attacks.
All other side-channels, including power analysis, and R4 Prefetching decisions outside of transactions are not
channels based on shared microarchitectural elements influenced by transactional memory accesses.
other than caches are outside our scope.
R1 ensures that all sensitive code and data can be
4 Hardware Transactional Memory as a added to the transactional memory in a deterministic and
leakage-free manner. R2 ensures that any cache line
Side-Channel Countermeasure
evictions are detected implicitly by the HTM and the
The foundation of all cache side-channel attacks are the transaction aborts before any non-cached access is per-
timing differences between cache hits and misses, which formed. R3 and R4 ensure that there is no leakage after
an attacker tries to measure. The central idea behind a transaction has succeeded or aborted.
Cloak is to instrument HTM to prevent any cache misses Unfortunately, commercially available HTM imple-
on the victim’s sensitive code and data. In Cloak, all mentations and specifically Intel TSX do not precisely
sensitive computation is performed in HTM. Crucially, provide R1–R4. In the following Section 5 we discuss
in Cloak, all security-critical code and data is determin- how Cloak can be instantiated on commercially available
istically preloaded into the caches at the beginning of (and not ideal) HTM, what leakage remains in practice,
a transaction. This way, security-critical memory loca- and how this can be minimized.
tions become part of the read or write set and all sub-
sequent, possibly secret-dependent, accesses are guar-
anteed to be served from the CPU caches. Otherwise, 5 Cloak based on Intel TSX
in case any preloaded code or data is evicted from the
cache, the transaction necessarily aborts and is reverted.
Cloak can be built using an HTM that satisfies R1–R4
(See Listing 1 for an example that uses the TSX instruc-
established in the previous section. We propose Intel
tions xbegin and xend to start and end a transaction.)
TSX as an instantiation of HTM for Cloak to mitigate
Given an ideal HTM implementation, Cloak thus pre-
the cache side channel. In this section, we evaluate
vents that an attacker can obtain a trace that shows
how far Intel TSX meets R1–R4 and devise strategies
whether the victim has accessed a particular memory lo-
to address those cases where it falls short. All experi-
cation. More precisely, in the sense of Cloak, ideal HTM
ments we report on in this section were executed on In-
has the following properties:
tel Core i7 CPUs of the Skylake generation (i7-6600U,
R1 Both data and code can be added to a transaction as i7-6700, i7-6700K) with 4MB or 8MB of LLC. The
transactional memory and thus are included in the source code of these experiments will be made available
HTM atomicity guarantees. at http://aka.ms/msr-cloak.
Failure rate
We summarize our findings and then describe the
methodology. 50%
R1 and R2 hold for data. It supports read-only data that
does not exceed the size of the LLC (several MB) 0%
and write data that does not exceed the size of the 0 1,000 2,000 3,000 4,000 5,000
L1 cache (several KB); Array size in KB
R1 and R2 hold for code that does not exceed the size
of the LLC; Figure 2: A TSX transaction over a loop reading an array
of increasing size. The failure rate reveals how much
R3 and R4 hold in the cloud and SGX attacker scenarios data can be read in a transaction. Measured on an i7-
from Section 3, but not in general for local attacker 6600U with 4 MB LLC.
scenarios.
cache lines that were evicted from the L1 cache. This
5.1.1 Requirements 1&2 for Data structure is possibly an on-chip bloom filter, which tracks
Our experiments and previous work [19] find the read set the read-set membership of cache lines in a probabilistic
size to be ultimately constrained by the size of the LLC: manner that may give false positives but no false nega-
Figure 2 shows the failure rate of a simple TSX transac- tives.2 There may exist so far unknown leaks through
tion depending on the size of the read set. The abort rate this data structure. If this is a concern, all sensitive data
reaches 100% as the read set size approaches the limits (including read-only data) can be kept in the write set in
of the LLC (4MB in this case). In a similar experiment, L1. However, this limits the working set to the L1 and
we observed 100% aborts when the size of data written in also requires all data to be stored in writable memory.
a transaction exceeded the capacity of the L1 data cache
(32 KB per core). This result is also confirmed in In- L1 Cache vs. LLC. By adding all data to the write
tel’s Optimization Reference Manual [30] and in previ- set, we can make R1 and R2 hold for data with respect
ous work [19, 44, 64]. to the L1 cache. This is important in cases where vic-
tim and attacker potentially share the L1 cache through
Conflicts and Aborts. We always observed aborts hyper-threading.3 Shared L1 caches are not a concern in
when read or write set cache lines were actively evicted the cloud setting, where it is usually ensured by the hy-
from the caches by concurrent threads. That is, evictions pervisor that corresponding hyper-threads are not sched-
of write set cache lines from the L1 cache and read set uled across different tenants. The same can be ensured
cache lines from the LLC are sufficient to cause aborts. by the OS in the local setting. However, in the SGX set-
We also confirmed that transactions abort shortly after ting a malicious OS may misuse hyper-threading for an
cache line evictions: using concurrent clflush instruc- L1-based attack. To be not constrained to the small L1
tions on the read set, we measured abort latencies in the in SGX nonetheless, we propose solutions to detect and
order of a few hundred cycles (typically with an upper prevent such attacks later on in Section 7.2.
bound of around 500 cycles). In case varying abort times We conclude that Intel TSX sufficiently fulfills R1
should prove to be an issue, the attacker’s ability to mea- and R2 for data if the read and write sets are used ap-
sure them, e.g., via Prime+Probe on the abort handler, propriately.
could be thwarted by randomly choosing one out of many
possible abort handlers and rewriting the xbegin instruc- 5.1.2 Requirements 1&2 for Code
tion accordingly,1 before starting a transaction.
We observed that the amount of code that can be exe-
cuted in a transaction seems not to be constrained by the
Tracking of the Read Set. We note that the data struc-
sizes of the caches. Within a transaction with strictly no
ture that is used to track the read set in the LLC is un-
reads and writes we were reliably able to execute more
known. The Intel manual states that “an implementation-
specific second level structure” may be available, which 2 In Intel’s Software Development Emulator [29] the read set is
probabilistically keeps track of the addresses of read-set tracked probabilistically using bloom filters.
3 Context switches may also allow the attacker to examine the vic-
1 The 16-bit relative offset to a transaction’s abort handler is part of tim’s L1 cache state “postmortem”. While such attacks may be pos-
the xbegin instruction. Hence, for each xbegin instruction, there is a sible, they are outside our scope. TSX transactions abort on context
region of 1 024 cache lines that can contain the abort handler code. switches.
Figure 3: A TSX transaction over a nop-sled with in- 5.1.3 Requirements 3&4
creasing length. A second thread waits and then flushes
As modern processors are highly parallelized, it is diffi-
the first cache line once before the transaction ends. The
cult to guarantee that memory fetches outside a transac-
failure rate starts at 100% for small transaction sizes. If
tion are not influenced by memory fetches inside a trans-
the transaction self-evicts the L1 instruction cache line,
action. For precisely timed evictions, the CPU may still
e.g., when executing more than 32 KB of instructions,
enqueue a fetch in the memory controller, i.e., a race con-
the transaction succeeds despite of the flush. Measured
dition. Furthermore, the hardware prefetcher is triggered
on an i7-6600U with 32 KB L1 cache.
if multiple cache misses occur on the same physical page
within a relatively short time. This is known to intro-
than 20 MB of nop instructions or more than 13 MB of duce noise in cache attacks [24,62], but also to introduce
arithmetic instructions (average success rate ˜10%) on a side-channel leakage [6].
CPU with 8 MB LLC. This result strongly suggests that In an experiment with shared memory and a cycle-
executed code does not become part of the read set and accurate alignment between attacker and victim, we in-
is in general not explicitly tracked by the CPU. vestigated the potential leakage of Cloak instantiated
To still achieve R1 and R2 for code, we attempted to with Intel TSX. To make the observable leakage as strong
make code part of the read or write set by accessing it as possible, we opted to use Flush+Reload for the at-
through load/store operations. This led to mixed results: tack primitive. We investigated how a delay between
even with considerable effort, it does not seem possible the transaction start and the flush operation and a de-
to reliably execute cache lines in the write set without lay between the flush and the reload operations influ-
aborting the transaction.4 In contrast, it is generally pos- ence the probability that an attacker can observe a cache
sible to make code part of the read set through explicit hit against code or data placed into transactional mem-
loads. This gives the same benefits and limitations as ory. The victim in this experiment starts the transaction,
using the read set for data. by placing data and code into transactional memory in a
uniform manner (using either reads, writes or execution).
The victim then simulates meaningful program flow, fol-
Code in the L1 Cache. Still, as discussed in the pre- lowed by an access to one of the sensitive cache lines
vious Section 5.1.1, it can be desirable to achieve R1 and terminating the transaction. The attacker “guesses”
and R2 for the L1 cache depending on the attack sce- which cache line the victim accessed and probes it. Ide-
nario. Fortunately, we discovered undocumented mi- ally, the attacker should not be able to distinguish be-
croarchitectural effects that reliably cause transactional tween correct and wrong guesses.
aborts in case a recently executed cache line is evicted Figure 4 shows two regions where an attacker could
from the cache hierarchy. Figure 3 shows how the trans- observe cache hits on a correct guess. The left region
actional abort rate relates to the amount of code that is corresponds to the preloading of sensitive code/data at
executed inside a transaction. This experiment suggests the beginning of the transaction. As expected, cache hits
that a concurrent (hyper-) thread can cause a transac- in this region were observed to be identical to runs where
tional abort by evicting a transactional code cache line the attacker had the wrong guess. On the other hand,
currently in the L1 instruction cache. We verified that the right region is unique to instances where the attacker
this effect exists for direct evictions through the clflush made a correct guess. This region thus corresponds to a
instruction as well as indirect evictions through cache set window of around 250 cycles, where an attacker could
conflicts. However, self-evictions of L1 code cache lines potentially obtain side-channel information. We explain
(that is, when a transactional code cache line is replaced the existence of this window by the fact that Intel did not
4 In line with our observation, Intel’s documentation [31] states that design TSX to be a side-channel free primitive, thus R3
“executing self-modifying code transactionally may also cause trans- and R4 are not guaranteed to hold and a limited amount
actional aborts”. of leakage remains. We observed identical high-level re-
CL 2
2 x 8 x 64 bytes = 1024 bytes Way 0
…
• Read set in rest of memory:
though not directly applicable to Cloak, Shih et al. [59]
Way 1
CL 63 …
Constrained by LLC Way 7
5.3 Toolset
Figure 5: Allocation of read and write sets in memory to We implemented the read-set preloading strategy from
avoid conflicts in the L1 data cache Section 5.2.1 in a small C++ container template library.
retq The library provides generic read-only and writable ar-
rays, which are allocated in “read” or “write” cache lines
=
c3 41 57 41 56 45 31 c9 41 55 41 54
0f 1f c3 ba 07 00 00 00 55 53 31 ff
nop push push xor push push mov push push xor respectively. The programmer is responsible for arrang-
0x0 0x3 0x5 0x7 0xa 0xc 0xe 0x13 0x14 0x15 ing data in the specialized containers before invoking a
41 b8 ff ff ff ff b9 22 00 00 00 be 08 00 00 40 48 83 ec 18 e8 a3 f4 ff ff 48 89 c2 49 89 c7 48 c1 e2 3c 48 c1 ea 3f 48 85 d2 Cloak-protected function. Further, the programmer de-
mov mov mov sub callq mov mov shl shr test cides which containers to preload. Most local variables
0x17 0x1d 0x22 0x27 0x2b 0x30 0x33 0x36 0x3a 0x3e and input and output data should reside in the containers.
Further, all sub-function calls should be inlined, because
Figure 6: Cache lines are augmented with a multi-byte each call instruction performs an implicit write of a re-
nop instruction. The nop contains a byte c3 which is the turn address. Avoiding this is important for large read
opcode of retq. By jumping directly to the retq byte, sets, as even a single unexpected cache line in the write
we preload each cache line into the L1 instruction cache. set can greatly increase the chances for aborts.
We also extended the Microsoft C++ compiler ver-
sion 19.00. For programmer-annotated functions on
as this would have unwanted side effects. Instead, we in-
Windows, the compiler adds code for starting and end-
sert a multi-byte nop instruction into every cache line, as
ing transactions, ensures that all code cache lines are
shown in Figure 6. This nop instruction does not change
preloaded (via read or execution according to Sec-
the behavior of the code during actual function execution
tion 5.2.2) and, to not pollute the write set, refrains
and only has a negligible effect on execution time. How-
from unnecessarily spilling registers onto the stack af-
ever, the multi-byte nop instruction allows us to incorpo-
ter preloading. Both library and compiler are used in the
rate a byte c3 which is the opcode of retq. Cloak jumps
SGX experiments in Section 7.1.
to this return instruction, loading the cache line into the
instruction L1 cache but not executing the actual func-
tion. In the preloading phase, we perform a call to each 6 Retrofitting Leaky Algorithms
such retq instruction in order to load the correspond-
ing cache lines into the L1 instruction cache. The retq To evaluate Cloak, we apply it to existing weak im-
instruction immediately returns to the preloading func- plementations of different algorithms. We demonstrate
tion. Instead of retq instructions, equivalent jmp reg that in all cases, in the local setting (Flush+Reload) as
instructions can be inserted to avoid touching the stack. well as the cloud setting (Prime+Probe), Cloak is a prac-
tical countermeasure to prevent state-of-the-art attacks.
5.2.3 Splitting Transactions All experiments in this section were performed on a
mostly idle system equipped with a Intel i7-6700K CPU
In case a sensitive function has greater capacity require- with 16 GB DDR4 RAM, running a default-configured
ments than those provided by TSX, the function needs Ubuntu Linux 16.10. The applications were run as regu-
to be split into a series of smaller transactional units. lar user programs, not pinned to CPU cores, but sharing
To prevent leakage, the control flow between these units CPU cores with other threads in the system.
and their respective working sets needs to be input-
independent. For example, consider a function f() that
6.1 OpenSSL AES T-Tables
iterates over a fixed-size array, e.g., in order to update
certain elements. By reducing the number of loop itera- As a first application of Cloak, we use the AES T-table
tions in f() and invoking it separately on fixed parts of implementation of OpenSSL which is known to be sus-
Cache line
tion failing more often if under attack.
8 8 While not under attack the implementation protected
12 12 with Cloak started 0.8% more encryptions than the base-
15 15
line implementation (i.e., with preloading) and less than
0.1% of the transactions failed. This is not surprising, as
the execution time of the protected algorithm is typically
Figure 7: Color matrix showing cache hits on an AES
below 500 cycles. Hence, preemption or interruption of
T-table. Darker means more cache hits. Measurement
the transaction is very unlikely. Furthermore, cache evic-
performed over roughly 2 billion encryptions. Prime+
tions are unlikely because of the small read set size and
Probe depicted on the left, Flush+Reload on the right.
optimized preloading (cf. Section 5.2.1). Taking the ex-
ecution time into account, the implementation protected
Plaintext byte Plaintext byte with Cloak was 0.8% faster than the baseline implemen-
00 40 80 c0 f8 00 40 80 c0 f8
0 0
tation. This is not unexpected, as Zacharopoulos [64]
4 4 already found that TSX can improve performance.
Cache line
Cache line
Cache hits
4 4
Cache line
Cache line
0.6
8 8 0.4
0.2
12 12
0
15 15 0 2 4 6 8
Time (in cycles) ·104
Figure 9: Color matrix showing cache hits on function
code. Darker means more cache hits. Measurement per- Figure 11: Cache traces for the multiply routine (as used
formed over roughly 100 million function executions for in RSA) over 10 000 exponentiations. The secret expo-
Prime+Probe (left) and 10 million function executions nent is depicted as a bit sequence. Measurement per-
for Flush+Reload (right). formed over 10 000 exponentiations. The variant pro-
tected with Cloak does not have visual patterns that cor-
Function Function respond to the secret exponent.
0 4 8 12 15 0 4 8 12 15
0 0
Cache line
0x80700
0x80700
Cloak, we reproduced the attack by Gruss et al. [24]
on a recent version of the GDK library (3.18.9) which
comes with Ubuntu 16.10. We attack the binary search in
gdk keyval from name which is executed upon every
keystroke in a GTK window. As shown in Figure 12, the
cache template matrix of the unprotected binary search
Cache line
Cache line
reveals the search pattern, narrowing down on the darker
area where the letter keys are and thus the search ends.
In case of the implementation protected by Cloak, the
search pattern is disguised. With the keystroke informa-
tion protected by Cloak, we could neither measure a dif-
ference in the perceived latency when typing through a
keyboard, nor measure and overall increase of the system
0x83200
0x83200
input record. As a result, observations of unprotected average cycles per query: baseline
Shuai Wang, Pei Wang, Xiao Liu, Danfeng Zhang, and Dinghao Wu
The Pennsylvania State University
{szw175, pxw172, xvl5190}@ist.psu.edu, zhang@cse.psu.edu, dwu@ist.psu.edu
Figure 3: The architecture of CacheD. Locating Secrets in the Trace. Besides the dumped ex-
ecution trace and the context information, another input
of CacheD is the locations (e.g., a memory location or a
in our evaluation. As previously discussed, existing register) of the secrets in a program. This information
research [19, 20] aims at reasoning the “upper-bound” of serves as the seed for the taint analysis and symbolic
information leakage through abstract interpretation, but execution in later stages.
may not be precise enough due to over-approximation. While the secrets (e.g., the private key or a random
Moreover, CacheD distinguishes itself by being able to number) are usually obvious in the source code, it may
point out where/what the vulnerability is, and provide not be straightforward to identify the location of the se-
examples that are likely to trigger the issue. Considering cret in an execution trace, since variable names are absent
CacheD as a “debugging” or vulnerability detection tool, in the assembly code. Treating this as a typical (man-
it is equally important to adopt its precise and practical ual) reverse engineering task, our approach to searching
techniques on side-channel detection. for the secrets in the assembly is to “correlate” mem-
ory reads with the usage of the key in the source code.
4 Design To do so, we identify the critical function in the source
code where the key is initialized and then search for the
We present CacheD, a tool that delivers scalable detec- function in the assembly code. The search space can be
tion of cache-based timing channels in real-world cryp- further reduced by cutting the assembly code into small
tosystems. Fig. 3 shows the architecture of CacheD. In regions according to the conditional jumps in the context.
general, given a binary executable with secrets as inputs, With further reverse engineering effort in small regions,
we first get a concrete execution trace by monitoring its we can eventually recognize the location of the secret in
execution (§4.1). The trace is then fed into CacheD to the assembly code, as a register, or a sequence of mem-
perform taint analysis; we mark the secret as the taint ory cells in the memory.
seed (§4.2) and propagate the taint information along the Although currently this step is largely manual, it is
trace to identify instructions related to the usage of the likely that it can be automated by a secret-aware com-
secret. piler, which tracks the location of secrets throughout the
CacheD then symbolizes the secret into one or sev- compilation; however, we leave this as future work.
eral symbols (each symbol represents one word), and
performs symbolic execution along the tainted instruc-
tions on the trace (§4.3). During the symbolic interpre-
4.2 Taint Analysis
tation, CacheD builds symbolic formulas for each mem- CacheD leverages symbolic execution to interpret each
ory access along the trace. Symbolic memory access for- instruction along a trace to reason about memory ac-
mulas are further analyzed using a constraint solver to cesses that are dependent on secrets. Our tentative tests
check whether cache behavior variations exist. As afore- show that the symbolic-level interpretation is one perfor-
mentioned, we check the satisfiability of (F(~k) L 6= mance bottleneck of CacheD. However, we notice that
F(~k0 ) L) ∧ C; if satisfiable, the solution to ~k and only a subset of instructions in a trace is dependent on
~k0 represent different secret values that can lead to dif- the secrets. Thus, a natural optimization in our context is
ferent cache behavior of this program point. The only to leverage taint analysis to rule out instructions that are
architecture-specific parameter to CacheD is the cache irrelevant to the secret; the remaining instructions are the
line size. As discussed in §2.1, we set L to be 6 through- focus of the more heavy-weight symbolic execution.
out this paper since most CPUs on the market sets have After reading the execution trace, CacheD first parses
a cache-line size of 64. Next, we elaborate on challenges the instructions into its internal representations. It then
and design of each step in the following sections. starts the taint analysis from the first usage of the se-
cret. Following existing taint analyses (e.g., [48, 52]),
we propagate the taint information along the trace fol-
4.1 Execution Trace Generation
lowing pre-defined tainting rules that we discuss shortly.
CacheD takes a concrete program execution trace as its After the taint analysis, we keep the instructions whose
input. In general, the execution trace can be generated by operands are tainted.
employing dynamic instrumentation tools to monitor the Taint propagation rules define how tainted information
execution of the target program and dump the execution flows through instructions, memories and CPU flags, as
While taint analysis is efficient, symbolic execution and In general, we consider independent vulnerabilities
constraint solving are time consuming in general. Here are highly informative to attackers; independent vul-
we discuss several optimizations in CacheD. nerabilities probably indicate the most-likely attack
surface of the victim, because stealing secret through
Identify Independent Vulnerabilities. To capture “dependent” vulnerabilities need additional efforts to
information flow through memory operations in sym- learn the program memory layout. On the other hand,
bolic execution, we create a fresh key symbol for a memory layouts are feasible and likely to be learned
memory load of unknown positions whenever the base as precomputed data structures are widely deployed in
or memory offset is computed from key symbols. In real-world cryptosystems to speed up the computation.
this section, we propose a finer-grained policy, which Overall, “dependent” vulnerabilities reveal an additional
reveals “independent” vulnerable program points. The attack surface which are commonly ignored by previous
key motivation is that, by studying the underlying mem- research.
ory layout, attackers would be able to learn relations
between the newly-created key symbol and the memory Early Stop Criterion of Symbolic Execution.
addressing formula (which contains one or several key One vulnerable program point (e.g., a table query)
symbols). Hence, we assume further vulnerabilities can be executed for one or more times during the
revealed through the usage of this new key symbol runtime (thus appearing more than once on the execution
would mostly leak the same piece (or a subset) of secret trace). On the other hand, a program point can be
information (we elaborate on this design choice shortly). considered as “vulnerable” as long as one of its usage is
Experiment setup. All the cryptosystems are C/C++ li- cryptosystems (Botan and OpenSSL). Those vulnerabil-
braries. We write simple programs to invoke the test li- ities, to the best of our knowledge, are unknown to ex-
braries for key generation, encryption as well as decryp- isting research (the “unknown” issues). We elaborate on
tion. We generate keys of 128 bits for AES experiments, these identified issues in §7.5.
and keys of 2048 bits for other experiments. After gener- We also evaluate CacheD towards the RSA and
ating the keys, for all test cryptographic algorithms, we ElGamal implementations of Libgcrypt 1.7.3, which are
first use their encryption routines to encrypt a plain text considered as safe from information leakage since there
“hello world”. The encrypted message is then fed into is no secret-dependent memory access. CacheD reports
the decryption procedures. As previously introduced, no vulnerable program point in both the ElGamal and
the execution traces of those decryption procedures are RSA implementations. Indeed, taint analysis of CacheD
logged for analysis. The programs are compiled into bi- identifies zero secret-dependent memory access in both
nary code on 32-bit Ubuntu 12.04, with gcc/g++ com- implementations. Although trace-based analysis is in
piler (version 4.6.3). general not sufficient to “prove” a cryptosystem as free
of information leakage, considering related research as
less scalable ([19]), CacheD presents a scalable and prac-
7.1 Evaluation Result Overview
tical way to study such industrial strength cryptosystems.
Vulnerability Identification. We present the breakdown
of the positives reported by CacheD in Table 2. As shown Processing Time. We also report the processing
in the table, most of the evaluated implementations are time of CacheD in Table 2. The experiments use a
reported to contain vulnerabilities that can lead to cache- machine with a 2.90GHz Intel Xeon(R) E5-2690 CPU
based side-channel attacks. Overall, CacheD reveal 156 and 128GB memory. Table 2 presents the number of
(84 known and 72 unknown) vulnerable program points, processed instructions and the processing time for each
among which 88 (84 known and 4 unknown) program experiment. In general, we evaluate CacheD regarding
points are independent. Considering the large number of five industrial-strength cryptosystems, with over 120
issues discovered by CacheD, we interpret the evaluation million instructions in total. Table 2 shows that all
result as promising. the experiments can be finished within 5 CPU hours.
In general, existing research has pointed out poten- We interpret the processing time of CacheD as very
tial issues that can lead to the cache based side-channel promising, and this evaluation faithfully demonstrates
attacks on the implementation of sliding-window based the high scalability of CacheD in terms of real-world
modular exponentiation [20], and such implementation cryptosystems.
is leveraged by both RSA and ElGamal decryption pro- Our evaluation also shows the proposed optimizations
cedures. In this research CacheD has successfully con- (§5) are effective, which surely improve the overall scal-
firmed such already-reported issues. We present a de- ability of CacheD. Indeed tentative implementation of
tailed study of two independent vulnerable program CacheD (without optimizations) times out after 20 hours
points found in RSA implementation of Libgcrypt 1.6.1 to process the ElGamal Libgcrypt 1.6.1 test case. On the
in §7.3, and also compare our findings of RSA and ElGa- other hand, we observed that CacheD becomes slower
mal with existing literatures in §7.4.1. Besides, consid- largely due to the nature of symbolic execution; with
ering its multiple rounds of table lookup, AES has also more symbolic variables and formulas carried on, every
been pointed out as vulnerable in terms of cache-based further reasoning can take more time. In addition, since
side channel attacks by previous work [15]. CacheD re- symbolic formulas can grow larger during interpretation
ports consistent findings in §7.4.2. (e.g., variables are manipulated for iterations in a loop),
Moreover, CacheD has also successfully identified a solver could probably encounter more challenging prob-
number of vulnerable program points in two widely-used lems during further constraint solving. Also, branch con-
Figure 4: Case study of two independent RSA vulnerable program points in Libgcrypt 1.6.1. The vulnerable program
points and the corresponding memory access instructions on the trace are bold. The tainted variable in source code
and trace are red, and e is secret (line 5 in Fig. 4a). Note that base u size is not tainted regarding the optimization
of RSA precomputed size table (§5).
Fig. 4b). To confirm the findings, we compile four pro- 7.4 Known Vulnerabilities
gram binaries with modified e regarding the solutions.
We also evaluate CacheD by confirming known side-
channel vulnerabilities.
Fig. 4c shows the simulation outputs using gem5. Due
to the limited space, we only provide the first eight
records for the first vulnerable program point (there are 7.4.1 RSA and ElGamal in Libgcrypt
604 records in total). The first column of each output rep-
resents the program counters; the second column shows Function gcry mpi powm (Fig. 4a) found in the
the accessed memory addresses and the last column is Libgcrypt (1.6.1) sliding-window implementation of the
the cache statuses of the accessed cache lines. Note that modular exponentiation algorithm is vulnerable. Note
program counter 0x44156a and 0x441572 represent the that such implementation indeed is used by both RSA
first and second vulnerable program points, respectively. and ElGamal decryption procedures. We have already
Comparing these two results, we can observe that dif- presented results and a case study in §7.3. Besides those
ferent cache lines are accessed (corresponding memory two independent program points, CacheD finds 19 vul-
addresses are marked as red at line 1-2), which further nerable program points in the ElGamal implementation
leads to timing difference of three cache hit vs. miss and 20 points in the RSA implementation Table 2.
(corresponding cache statuses are marked as red at line Consistent with previous work [20], CacheD confirms
5-7). Simulation results for the second memory access two vulnerable program points in the Libgcrypt (1.6.1)
are omitted due to the limited space. We report to have that can lead to cache-based timing attacks. On the other
similar observations. hand, while previous work [20] only reports potential
timing channels through these two direct usage of se-
Consistent with the existing findings, these two ta- crets, CacheD can actually detect further unknown (to
ble queries are also reported as vulnerable by previous the best of our knowledge) pitfalls (around 20 unknown
work [20]. According to our taint policy, the table query points for each). The results show that CacheD can pro-
output (base u) would be tainted since e0 is tainted. Our vide developers with more comprehensive information
study also shows that memory access through base u regarding side-channel issues.
would reveal another twenty vulnerable program points
(row 2 in Table 2). Moreover, this modular exponentia- 7.4.2 AES in OpenSSL
tion function is used by both RSA and ElGamal decryp-
tion procedures; two independent vulnerable program We also analyzed the positive results identified in
points found in the ElGamal implementation (row 4 in AES implementations of OpenSSL (version 0.9.7c and
Table 2) are also due to these table queries. 1.0.2f). In general, standard AES decryption undertakes
2 ...
k0 (k0 is a window size of key) 3 for(size t i = exp nibbles; i > 0; −−i) {
two-level precomputed table
4 ...
Level 1 B1 B3 B5 B7 ... BN 5 const u32bit nibble = exp.get substring(
6 window bits∗(i−1), window bits);
7
Level 2 8 //note that the following code is not a mem access
9 const BigInt& y = g[nibble];
10
11 bigint monty mul(&z[0], z.size(),
12 x.data(), x.size(), x.sig words(),
13 y.data(), y.size(), y.sig words(),
14 ...
N N N N ... N 15 }
16 }
precomputed size table 17
18 size t sig words() const {
19 const word* x = ®[0];
20 size t sig = reg.size();
21 ...
Figure 5: Two-level precomputed table and precomputed 22 }
size table used in RSA and ElGamal. Our observation
shows that the length of all the second-level precomputed Figure 7: Unknown RSA vulnerabilities found in the
tables are equal in non-trivial decryption processes of Montgomery exponentiator of Botan (version 1.10.13).
RSA and ElGamal. In other words, attackers can hardly Tainted variables are marked as red and the vulnerable
infer k0 by observing query outputs of the precomputed program points are bold. sig is not tainted according to
size table. the optimization of RSA precomputed size table (§5).
NtCreateFile input
int: y
HANDLE: out
NtWriteFile
1: int x, y; // x is an input
2: HANDLE out = CreateFile ( "a.txt", … ); HANDLE: out
3: y = x << 1;
4: WriteFile ( out, &y, sizeof y, … ); NtClose
5: CloseHandle ( out );
(d) System call (Windows native API)
(c)
sequence and dependency
BB2: BB2':
The constraint solver will prove that Formula 2 is al-
c = a - b c’ = b’ - a’ ways true but Formula 1 is not. Apparently, we can find
a counterexample (e.g., x = −1) to falsify Formula 1.
out (c) = (i+1) - (j << 2) out (c’) = (j’ << 2) - (i’ + 1)
Therefore, we have ground truth that the NtWriteFile
in Figure 1(a) and Figure 1(c) are truly matched, while
(a) (b)
NtWriteFile in Figure 1(a) and Figure 1(b) are condi-
tionally equivalent (when the input satisfies x ≥ 0).
Figure 4: Semantic difference spreads across basic
For the three different implementations shown in Fig-
blocks.
ure 3, BinSim works on the execution traces under the
same input (n). In this way, the loops in Figure 3 have
pose them from equivalence checking (step 3). Then we been unrolled. Starting from the output argument, the
utilize a constraint solver to verify whether two WP for- resulting WP captures the semantics of “bits counting”
mulas are equivalent (step 4). Following the similar style, across basic blocks. Therefore, we are able to verify that
we compare the remaining system call pairs. At last, we the three algorithms are equivalent when taking the same
perform an approximate matching on identified crypto- unsigned 32-bit integer as input. Similarly, we can verify
graphic functions (step 5) and calculate the finial simi- that the two code snippets in Figure 4 are not semanti-
larity score. cally equivalent.
Now we use the examples shown in Section 2.1
to describe how BinSim improves existing semantics- 2.3 Architecture
based binary diffing approaches. Assume we have
got the aligned system call sequences shown in Fig- Figure 6 illustrates the architecture of BinSim, which
ure 1(d). Starting from the address of the argument y comprises two stages: online trace logging and offline
in NtWriteFile, we do backward slicing and compute comparison. The online stage, as shown in the left side
WP with respect to y. The results of the three programs of Figure 6, involves two plug-ins built on Temu [66],
are shown as follows. a whole-system emulator: generic unpacking and on-
demand trace logging. Temu is also used as a malware
ψ1a : x + x execution sandbox in our evaluation. The recorded traces
ψ1b : 2 × ((x ∧ (x >> 31)) − (x >> 31)) are passed to the offline stage of BinSim for comparison
ψ1c : x << 1 (right part of Figure 6). The offline stage consists of three
components: preprocessing, slicing and WP calculation,
To verify whether ψ1a = ψ1b , we check the equivalence and segment equivalence checker. Next, we will present
of the following formula: each step of BinSim in the following two sections.
Figure 6: Schematic overview of BinSim. The output for each processing: (1) unpacked code, instruction log, memory
log, and system call sequences; (2) IL traces and initial matched system call pairs; (3) weakest preconditions of system
call sliced segments; (4) identified cryptographic functions.
morphism [14] and tree automata inference [3] are or- an intermediate language (Vine IL). However, BinSim is
thogonal to our approach. different from Coogan et al.’s work in a number of ways,
which we will discuss in Section 7.
4.2 Dynamic Slicing Binary Code 4.2.1 Index and Value Based Slicing
After system calls alignment, we will further examine the We first trace the instructions with data dependencies by
aligned system calls to determine whether they are truly following the “use-def” chain (ud-chain). However, the
equivalent. To this end, commencing at a tainted system conventional ud-chain calculation may result in the pre-
call’s argument, we perform dynamic slicing to back- cision loss when dealing with indirect memory access, in
track a chain of instructions with data and control depen- which general registers are used to compute memory ac-
dencies. The slice criterion is heip, argumenti, while eip cess index. There are two ways to track the ud-chain of
indicates the value of instruction pointer and argument indirect memory access, namely index based and value
denotes the argument taken as the beginning of back- based. The index based slicing, like the conventional ap-
wards slicing. We terminate our backward slicing when proach, follows the ud-chain related the memory index.
the source of slice criterion is one of the following con- For the example of mov edx [4*eax+4], the instruc-
ditions: the output the previous system call, a constant tions affecting the index eax will be added. Value based
value, or the value read from the initialized data sec- slicing, instead, considers the instructions related to the
tion. Standard dynamic slicing algorithm [1, 80] relies value stored in the memory slot. Therefore, the last in-
on program dependence graph (PDG), which explicitly struction that writes to the memory location [4*eax+4]
represents both data and control dependencies. However, will be included. In most cases, the value based slicing
compared to the source code slicing, dynamic slicing on is much more accurate. Figure 7 shows a comparison be-
the obfuscated binaries is never a textbook problem. The tween index based slicing and value based slicing on the
indirect memory access of binary code will pollute the same trace. Figure 7(a) presents the C code of the trace.
conventional data flow tracking. Tracking control depen- In Figure 7(b), index based slicing selects the instruc-
dencies in the obfuscated binary code by following ex- tions related to the computation of memory index j =
plicit conditional jump instructions is far from enough. 2*i + 1. In contrast, value based slicing in Figure 7(c)
Furthermore, the decode-dispatch loop of virtualization contains the instructions that is relevant to the computa-
obfuscation will also introduce many fake control de- tion of memory value A[j] = a + b, which is exactly
pendencies. As a result, conventional dynamic slicing the expected slicing result. However, there is an excep-
algorithms [1, 80] will cause undesired slice explosion, tion that we have to adopt index based slicing: when an
which will further complicate weakest precondition cal- indirect memory access is a valid index into a jump table.
culation. Our solution is to split data dependencies and Jump tables typically locate at read-only data sections or
control dependencies tracking into three steps: 1) index code sections, and the jump table contents should not be
and value based slicing that only consider data flow; 2) modified by other instructions. Therefore, we switch to
tracking control dependencies; 3) remove the fake con- track the index ud-chain, like eax rather than the mem-
trol dependencies caused by virtualization obfuscation ory content.
code dispatcher.
BinSim shares the similar idea as Coogan et al. [19] in
4.2.2 Tracking Control Dependency
that we both decouple tracing control flow from data flow
when handling virtualization obfuscation. Coogan et al.’s Next, we include the instructions that have control de-
approach is implemented through an equational reason- pendencies with the instructions in the last step. In ad-
ing system, while BinSim’s dynamic slicing is built on dition to explicit conditional jump instructions (e.g., je
WP = F1 ∧ F2 ∧ ... ∧ Fk . 3. 0.5: the aligned system call pairs do not satisfy the
above conditions. The score indicates their argu-
Opaque predicates [17], a popular control flow obfus- ments are either conditionally equivalent or seman-
cation scheme, can lead to a very complicated WP for- tically different.
mula by adding infeasible branches. We apply recent
opaque predicate detection method [45] to identify so
Assume the system call sequences collected from pro-
called invariant, contextual, and dynamic opaque pred-
gram a and b are Ta and Tb , and the number of aligned
icates. We remove the identified opaque predicate to re-
system calls is n. We define the similarity calculation as
duce the size of the WP formula.
follows.
Similarity score
Loop unrolling, opaque predicates, 0.6
2 Control flow
control flow flatten, function inline
0.5
Synthetic benchmarks collected
3 ROP
from the reference [79] 0.4
BitCount (Figure 3) 0.3
4
Different implementations isPowerOfTwo (Appendix Figure 12)
0.2
flp2 (Appendix Figure 13)
5 Covert computation[59] Synthetic benchmarks 0.1
6 Single-level virtualization VMProtect [69] 0.0
Sy
Fe
Bi
Bi
D
C
Synthetic benchmarks collected
iB
ar
oP
n
n
in
sc
a
7 Multi-level virtualization
Si
tu
un
al
iff
from the reference [79]
m
r
un
G
e
la
ri m
se
lig
t
n.
Figure 8: Similarity scores change from right pairs to
5 Experimental Evaluation wrong pairs.
We conduct our experiments with several objectives.
First and foremost, we want to evaluate whether BinSim
code with different functionalities from VX Heavens2 .
outperforms existing binary diffing tools in terms of bet-
We investigate the source code to make sure they are dif-
ter obfuscation resilience and accuracy. To accurately
ferent, and each sample can fully exhibit its malicious
assess comparison results, we design a controlled dataset
behavior in the runtime. Besides, we also collect syn-
so that we have a ground truth. We also study the effec-
thetic benchmarks from the previous work. The purpose
tiveness of BinSim in analyzing a large set of malware
is to evaluate some obfuscation effects that are hard to
variants with intra-family comparisons. Finally, perfor-
automate. Our controlled dataset statistics are shown in
mance data are reported.
Table 2. The second column of Table 2 lists different
obfuscation schemes and combinations we applied.
5.1 Experiment Setup In addition to BinSim, we also test other six repre-
sentative binary diffing tools. BinDiff [23] and Darun-
Our first testbed consists of Intel Core i7-3770 processor Grim [50] are two popular binary diffing products in in-
(Quad Core with 3.40GHz) and 8GB memory, running dustry. They rely on control flow graph and heuristics to
Ubuntu 14.04. We integrate FakeNet [63] into Temu to measure similarities. CoP [41] and iBinHunt [43] repre-
simulate the real network connections, including DNS, sent “block-centric” approaches. Based on semantically
HTTP, SSL, Email, FTP etc. We carry out the large-scale equivalent basic blocks, iBinHunt compares two execu-
comparisons for malware variants in the second testbed, tion traces while CoP identifies longest common subse-
which is a private cloud containing six instances running quence with static analysis. System call alignment and
simultaneously. Each instance is equipped with a duo feature set are examples of dynamic-only approaches.
core, 4GB memory, and 20GB disk space. The OS and “Feature set” indicates the method proposed by Bayer
network configurations are similar to the first testbed. et al. [6] in their malware clustering work. They abstract
Before running a malware sample, we reset Temu to a system call sequence to a set of features (e.g., OS object,
clean snapshot to eliminate the legacy effect caused by OS operations, and dependencies) and measure the simi-
previous execution (e.g., modify registry configuration). larities of two feature sets by Jaccard Index. For compar-
To limit the possible time-related execution deviations, ison, we have implemented the approaches of CoP [41],
we utilize Windows Task Scheduler to run each test case iBinHunt [43], and feature set [6]. The system call align-
at the same time. ment is the same to the method adopted by BinSim.
ing to such examples. We collect eight malware source 3 We have normalized all the similarity scores from 0.0 ∼ 1.0.
Table 3: Absolute difference values of similarity scores under different obfuscation schemes and combinations.
Sample Obfuscation type BinDiff DarunGrim iBinHunt CoP Syscall alignment Feature set BinSim
BullMoose 6 0.58 0.56 0.39 0.61 0.08 0.10 0.08
Clibo 1+6 0.57 0.64 0.41 0.62 0.10 0.12 0.10
Branko 1+2+6 0.63 0.62 0.35 0.68 0.10 0.15 0.12
Hunatcha 2 0.40 0.42 0.19 0.30 0.12 0.17 0.12
WormLabs 1 0.10 0.12 0.03 0.03 0.08 0.12 0.05
KeyLogger 2 0.38 0.39 0.12 0.26 0.09 0.15 0.09
Sasser 1+2+6 0.62 0.62 0.42 0.58 0.12 0.18 0.10
Mydoom 1+2 0.42 0.38 0.10 0.38 0.10 0.15 0.05
ROP 3 0.63 0.54 0.49 0.52 0.10 0.10 0.10
Different implementations 4 0.48 0.39 0.48 0.52 0.05 0.10 0.05
Covert computation 5 0.45 0.36 0.44 0.45 0.05 0.10 0.05
Multi-level virtualization 7 0.68 0.71 0.59 0.69 0.15 0.18 0.16
Average
“Right pairs” vs. “Obfuscation pairs” 0.50 0.48 0.34 0.46 0.10 0.15 0.09
“Right pairs” vs. “Wrong pairs” 0.65 0.65 0.66 0.55 0.71 0.63 0.76
old. What matters is that a tool can differentiate right- mantics [17], the small and consistent difference values
pair scores from wrong-pair scores. We first test how can indicate that a binary diffing tool is resilient to dif-
their similar scores change from right pairs to wrong ferent obfuscation schemes and combinations. BinDiff,
pairs. For the right pair testing, we compare each sam- DarunGrim, iBinHunt and CoP do not achieve a consis-
ple in Table 2 with itself (no obfuscation). The average tent (good) result for all test cases, because their differ-
values are shown in “Right pairs” bar in Figure 8. Then ence values fluctuate. The heuristics-based comparisons
we compare each sample with the other samples (no ob- adopted by BinDiff and DarunGrim can only handle mild
fuscation) and calculate the average values, which are instructions obfuscation within a basic block. Since mul-
shown in “Wrong pairs” bar in Figure 84 . The compar- tiple obfuscation methods obfuscate the structure of con-
ison results reveal a similar pattern for all these seven trol flow graph (e.g., ROP and control flow obfuscation),
binary diffing tools: a large absolute difference value be- the effect of BinDiff and DarunGrim are limited. CoP
tween the right pair score and the wrong pair score. and iBinHunt use symbolic execution and theorem prov-
Next, we figure out how the similarity score varies un- ing techniques to match basic blocks, and therefore are
der different obfuscation schemes and combinations. We resilient to intra-basic-block obfuscation (Type 1). How-
first calculate the similarity scores for “Right pairs” (self ever, they are insufficient to defeat the obfuscation that
comparison) and “Obfuscation pairs” (the clean version may break the boundaries of basic blocks (e.g., Type 2 ∼
vs. its obfuscated version). Table 3 shows the absolute Type 7 in Table 1). The last rows of Table 3 shows the av-
difference values between “Right pairs” and “Obfusca- erage difference values for the “Right pairs” vs. “Obfus-
tion pairs”. Since code obfuscation has to preserve se- cation pairs” and “Right pairs” vs. “Wrong pairs”. The
closer for these two scores, the harder for a tool to set a
4 It does not mean that higher is better on the similarity scores for the threshold or cutoff line to give a meaningful information
right pairs, and lower is better for the wrong pairs. What is important on the similarity score.
is how their similarity values change from right pairs to wrong pairs.
Score range BinDiff DarunGrim iBinHunt CoP Syscall alignment Feature set BinSim
0.00–0.10 1 1 1 1 1 1 1
0.10–0.20 3 2 1 3 1 1 1
0.20–0.30 3 4 2 4 1 2 1
0.30–0.40 13 14 3 10 1 3 1
0.40–0.50 18 17 5 18 2 3 2
0.50–0.60 14 16 18 13 3 4 4
0.60–0.70 17 13 17 14 6 16 11
0.70–0.80 13 15 20 16 26 27 24
0.80–0.90 9 10 15 11 21 16 19
0.90–1.00 9 8 18 10 38 27 36
[18] C OMPARETTI , P. M., S ALVANESCHI , G., K IRDA , E., KOL - [34] K IRAT, D., AND V IGNA , G. MalGene: Automatic extraction of
BITSCH , C., K RUEGEL , C., AND Z ANERO , S. Identifying dor- malware analysis evasion signature. In Proceedings of the 22nd
mant functionality in malware programs. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Se-
2010 IEEE Symposium on Security and Privacy (S&P’10) (2010). curity (CCS’15) (2015).
[19] C OOGAN , K., L U , G., AND D EBRAY, S. Deobfuscation of [35] K IRAT, D., V IGNA , G., AND K RUEGEL , C. BareCloud: Bare-
virtualization-obfuscated software. In Proceedings of the 18th metal analysis-based evasive malware detection. In Proceedings
ACM Conference on Computer and Communications Security of the 23rd USENIX Conference on Security Symposium (2014).
(CCS’11) (2011). [36] KOLBITSCH , C., C OMPARETTI , P. M., K RUEGEL , C., K IRDA ,
[20] D IJKSTRA , E. W. A Discipline of Programming, 1st ed. Prentice E., Z HOU , X., AND WANG , X. Effective and efficient malware
Hall PTR, 1997. detection at the end host. In Proceedings of the 18th USENIX
Security Symposium (2009).
[21] E GELE , M., W OO , M., C HAPMAN , P., AND B RUMLEY, D.
Blanket Execution: Dynamic similarity testing for program bi- [37] L AKHOTIA , A., P REDA , M. D., AND G IACOBAZZI , R. Fast lo-
naries and components. In 23rd USENIX Security Symposium cation of similar code fragments using semantic ’juice’. In Pro-
(USENIX Security’14) (2014). ceedings of the 2nd ACM SIGPLAN Program Protection and Re-
verse Engineering Workshop (PPREW’13) (2013).
[22] E SCHWEILER , S., YAKDAN , K., AND G ERHARDS -PADILLA ,
E. discovRE: Efficient cross-architecture identification of bugs [38] L ANZI , A., S HARIF, M., AND L EE , W. K-Tracer: A system
in binary code. In Proceedings of the 23nd Annual Network and for extracting kernel malware behavior. In Proceedings of the
Distributed System Security Symposium (NDSS’16) (2016). 16th Annual Network and Distributed System Security Sympo-
sium (NDSS09) (2009).
[23] F LAKE , H. Structural comparison of executable objects. In
Proceedings of the 2004 GI International Conference on De- [39] L INDORFER , M., D I F EDERICO , A., M AGGI , F., C OM -
tection of Intrusions & Malware, and Vulnerability Assessment PARETTI , P. M., AND Z ANERO , S. Lines of malicious code:
(DIMVA’04) (2004). Insights into the malicious software industry. In Proceedings
of the 28th Annual Computer Security Applications Conference
[24] G ANESH , V., AND D ILL , D. L. A decision procedure for bit-
(ACSAC’12) (2012).
vectors and arrays. In Proceedings of the 2007 International Con-
ference in Computer Aided Verification (CAV’07) (2007). [40] L U , K., Z OU , D., W EN , W., AND G AO , D. deRop: Remov-
[25] G AO , D., R EITER , M. K., AND S ONG , D. BinHunt: Auto- ing return-oriented programming from malware. In Proceedings
matically finding semantic differences in binary programs. In of the 27th Annual Computer Security Applications Conference
Poceedings of the 10th International Conference on Information (ACSAC’11) (2011).
and Communications Security (ICICS’08) (2008). [41] L UO , L., M ING , J., W U , D., L IU , P., AND Z HU , S. Semantics-
[26] G ODEFROID , P., L EVIN , M. Y., AND M OLNAR , D. Auto- based obfuscation-resilient binary code similarity comparison
mated whitebox fuzz testing. In Proceedings of the 15th Annual with applications to software plagiarism detection. In Proceed-
Network and Distributed System Security Symposium (NDSS’08) ings of the 22nd ACM SIGSOFT International Symposium on
(2008). Foundations of Software Engineering (FSE’14) (2014).
[27] G R ÖBERT, F., W ILLEMS , C., AND H OLZ , T. Automated iden- [42] M ENG , X., AND M ILLER , B. P. Binary code is not easy. In Pro-
tification of cryptographic primitives in binary programs. In ceedings of the 25th International Symposium on Software Test-
Proceedings of the 14th International Conference on Recent Ad- ing and Analysis (ISSTA’16) (2016).
vances in Intrusion Detection (RAID’11) (2011). [43] M ING , J., PAN , M., AND G AO , D. iBinHunt: Binary hunting
[28] JANG , J., B RUMLEY, D., AND V ENKATARAMAN , S. BitShred: with inter-procedural control flow. In Proceedings of the 15th An-
feature hashing malware for scalable triage and semantic analy- nual International Conference on Information Security and Cryp-
sis. In Proceedings of the 18th ACM conference on Computer and tology (ICISC’12) (2012).
communications security (CCS’11) (2011). [44] M ING , J., X IN , Z., L AN , P., W U , D., L IU , P., AND M AO ,
[29] J OHNSON , N. M., C ABALLERO , J., C HEN , K. Z., M C C A - B. Replacement Attacks: Automatically impeding behavior-
MANT, S., P OOSANKAM , P., R EYNAUD , D., AND S ONG , D. based malware specifications. In Proceedings of the 13th Inter-
Differential Slicing: Identifying causal execution differences for national Conference on Applied Cryptography and Network Se-
security applications. In Proceedings of the 2011 IEEE Sympo- curity (ACNS’15) (2015).
sium on Security and Privacy (S&P’11) (2011). [45] M ING , J., X U , D., WANG , L., AND W U , D. LOOP: Logic-
[30] K ANG , M. G., P OOSANKAM , P., AND Y IN , H. Renovo: A oriented opaque predicates detection in obfuscated binary code.
hidden code extractor for packed executables. In Proceedings In Proceedings of the 22nd ACM Conference on Computer and
of the 2007 ACM Workshop on Recurring Malcode (WORM ’07) Communications Securit (CCS’15) (2015).
(2007). [46] M ING , J., X U , D., AND W U , D. Memoized semantics-based
[31] K AWAGUCHI , M., L AHIRI , S. K., AND R EBELO , H. Condi- binary diffing with application to malware lineage inference. In
tional Equivalence. Tech. Rep. MSR-TR-2010-119, Microsoft Proceedings of the 30th International Conference on ICT Systems
Research, 2010. Security and Privacy Protection (IFIP SEC’15) (2015).
[32] K HARRAZ , A., A RSHAD , S., M ULLINER , C., ROBERTSON , [47] M OSER , A., K RUEGEL , C., AND K IRDA , E. Exploring multiple
W. K., AND K IRDA , E. UNVEIL: A large-scale, automated execution paths for malware analysis. In Proceedings of the 2007
approach to detecting ransomware. In Proceedings of the 25th IEEE Symposium of Security and Privacy (S&P’07) (2007).
USENIX Conference on Security Symposium (2016).
Instructions Meaning
Figure 12: Two different isPowerOfTwo algorithms
CMOVcc Conditional move check if an unsigned integer is a power of 2.
SETcc Set operand to 1 on condition, or 0 otherwise
CMPXCHG Compare and then swap
REP-prefixed Repeated operations, the upper limit is stored in ecx
JECXZ Jump if ecx register is 0
1 unsigned flp2 1 (unsigned x){
LOOP Performs a loop operation using ecx as a counter 2 x=x|(x>>1);
3 x=x|(x>>2);
4 x=x|(x>>4);
5 x=x|(x>>8);
mov eax, [ebp+8] 6 x=x|(x>>16);
mov edx, eax
cmp [ebp+8], 0 and edx, 55555555h
7 return x−(x>>1);
jnz loc_8000016 mov eax, [ebp+8] 8 }
shr eax, 1 9
and eax, 55555555h
add eax, edx 10 unsigned flp2 2 (unsigned x){
mov [ebp+8], eax 11 unsigned y=0x80000000;
mov eax, [ebp+8] 12 while(y>x){
mov edx, eax
loc_8000016: printf (ಯ\dರ,count) and edx, 33333333h 13 y=y>>1;
mov eax, [ebp+8] mov eax, [ebp+8] 14 }
and eax, 1 shr eax, 2
add [ebp-0xC], eax and eax, 33333333h
15 return y;
shr [ebp+8], 1 add eax, edx 16 }
mov [ebp+8], eax 17
mov eax, [ebp+8]
mov edx, eax
18 unsigned flp2 3 (unsigned x){
and edx, 0F0F0F0Fh 19 unsigned y;
(a) mov eax, [ebp+8] 20 do{
shr eax, 4
and eax, 0F0F0F0Fh 21 y=x;
add eax, edx 22 x=x&(x−1);
mov [ebp+8], eax 23 }while(x!=0);
mov eax, [ebp+8]
mov edx, eax 24 return y;
and edx, 0FF00FFh 25 }
cmp [ebp+8], 0 mov eax, [ebp+8]
jnz loc_800004D shr eax, 8
and eax, 0FF00FFh
add eax, edx
mov [ebp+8], eax Figure 13: Three different flp2 algorithms find the
mov eax, [ebp+8]
movzx edx, ax largest number that is power of 2 and less than an given
mov eax, [ebp+8]
loc_800004D: printf (ಯ\dರ,count) shr eax, 10h integer x.
mov eax, [ebp+8] add eax, edx
sub eax, 1 mov [ebp+8], eax
and [ebp+8], eax
add [ebp-0xC], 1
printf (ಯ\dರ,n)
(b) (c)
Table 1: A taxonomy of malicious PDF document detection techniques. This taxonomy is partially based on a systematic survey
paper [40] with the addition of works after 2013 as well as summaries parser, machine learning, and pattern dependencies and
evasion techniques.
2.1 Static Techniques tures such as the number of fonts, the average length of
streams, etc. to improve detection. Maiorca et al. [34]
One line of static analysis work focuses on JavaScript focuses on both JavaScript and metadata and fuses many
content for its importance in exploitation, e.g., a majority of the above-mentioned heuristics into one procedure to
(over 90% according to [58]) of maldocs use JavaScript to improve evasion resiliency.
complete an attack. PJScan [27] relies on lexical coding All methods are based on the assumption that the nor-
styles like the number of variable names, parenthesis, and mal PDF element hierarchy is distorted during obfusca-
operators to differentiate benign and malicious JavaScript tion and new paths are created that could not normally
code. Vatamanu et al. [59] tokenizes JavaScript code exist in benign documents. However, this assumption
into variables types, function names, operators, etc. and is challenged by two attacks. Mimicus [53] implements
constructs clusters of tokens as signatures for benign and mimicry attacks and modifies existing maldocs to appear
malicious documents. Similarly, Lux0r [14] constructs more like benign ones by adding empty structural and
two sets of API reference patterns found in benign and metadata items to the documents with no actual impact on
malicious documents, respectively, and uses this to clas- rendering. Reverse mimicry [35] attack, on the contrary,
sify maldocs. MPScan [31] differs from other JavaScript attempts to embed malicious content into a benign PDF
static analyzers in a way that it hooks AAR and dynam- by taking care to modify it as little as possible.
ically extracts the JavaScript code. However, given that
code analysis is still statically performed, we consider it
a static analysis technique. 2.2 Dynamic Techniques
A common drawback of these approaches is that they All surveyed dynamic analysis techniques focus on em-
can be evaded with heavy obfuscation and dynamic code bedded JavaScript code only instead of the entire doc-
loading (except for [31] as it hooks into AAR at runtime). ument. MDScan [58] executes the extracted JavaScript
Static parsers extract JavaScript based on pre-defined code on a customized SpiderMonkey interpreter and the in-
rules on where JavaScript code can be placed/hidden. terpreter’s memory space is constantly scanned for known
However, given the flexibility of PDF specifications, it is forms of shellcode or malicious opcode sequences. PDF
up to an attacker’s creativity to hide the code. Scrutinizer [45] takes a similar approach by hooking the
The other line of work focuses on examining PDF file Rhino interpreter and scans for known malicious code
metadata rather than its actual content. This is partially patterns such as heap spray, shellcode, and vulnerable
inspired by the fact that obfuscation techniques tend to method calls. ShellOS [48] is a lightweight OS designed
abuse the flexibility in PDF specifications and hide ma- to run JavaScript code and record its memory access
licious code by altering the normal PDF structure. PDF patterns. During execution, if the memory access se-
Malware Slayer [36] uses the linearized path to specific quences match a known malicious pattern (e.g., ROP, crit-
PDF elements (e.g., /JS, /Page, etc) to build maldoc clas- ical syscalls or function calls, etc), the script is considered
sifiers. Srndic et al. [52] and Maiorca et al. [33] go one malicious.
step further and also use the hierarchical structure for Although these techniques are accurate in detecting
classification. PDFrate [46] includes another set of fea- malicious payload, they suffer from a common problem:
Execute shellcode
Platform-specific bugs Memory corruption
Syscall semantics
e.g., bugs in system library e.g., bugs in element parser
CVE-2015-2426 Bug does not exist Discrepancies in: Load executables
CVE-2013-2729
on other platforms Memory layout Attacks that cannot
Executable format be detected with
Heap management
Library functions Steal sensitive information platform diversity
Figure 1: Using platform diversity to detect maldocs throughout the attack cycle. Italic texts near × refers to the factors identified
in §3.2 that can be used to detect such attacks. A dash line means that certain attacks might survive after the detection.
1 var t = {};
2 t.__defineSetter__(’doc’, app.beginPriv); document opened, which can be used to check whether
3 t.__defineSetter__(’user’, app.trustedFunction); the document is opened on Windows or Mac by testing
4 t.__defineSetter__(’settings’, function() { throw 1; });
5 t.__proto__ = app; whether the returned path is prefixed with /c/.
6 try { Another way to launch platform-aware attacks is to
7 DynamicAnnotStore.call(t, null, f);
8 } catch(e) {} embed exploits on two platform-specific vulnerabilities,
9 each targeting one platform. In this way, regardless of on
10 f();
11 function f() { which platform the maldoc is opened, one exploit will be
12 app.beginPriv(); triggered and malicious activities can occur.
13 var file = ’/c/notes/passwords.txt’;
14 var secret = util.stringFromStream( In fact, although platform-aware maldocs are rare in
15 util.readFileIntoStream(file, 0) our sample collection, P LAT PAL must be aware of these
16 );
17 app.alert(secret); attack methods and exercises precautions to detect them.
18 app.endPriv(); In particular, the possibility that an attacker can probe the
19 }
platform first before launching the exploit implies that
merely comparing external behaviors (e.g., filesystem op-
Figure 2: CVE-2014-0521 proof-of-concept exploitation
erations or network activities) might not be sufficient as
the same external behaviors might be due to the result of
different attacks. Without tracing the internal PDF pro-
Besides being simple to construct, these attacks are
cessing, maldocs can easily evade P LAT PAL’s detection
generally available on both Windows and Mac platforms
using platform-specific exploits, for example, by carrying
because of the cross-platform support of the JavaScript.
multiple ROP payloads and dynamically deciding which
Therefore, the key to detecting these attacks via platform
payload to use based on the return value of app.platform,
diversity is to leverage differences system components
or even generating ROP payload dynamically using tech-
such as filesystem semantics, expected installed programs,
niques like JIT-ROP [49].
etc., and search for execution divergences when they are
However, we do acknowledge that, given the com-
performing malicious activities. For example, line 15 will
plexity of the PDF specification, P LAT PAL does not enu-
fail on Mac platforms in the example of Figure 2, as such
merate all possible platform-probing techniques. There-
a file path does not exist on Mac.
fore, P LAT PAL could potentially be evaded through im-
plicit channels we have not discovered (e.g., timing side-
3.5 Platform-aware Exploitation channel).
Given the difficulties of launching maldoc attacks on dif- 3.6 Platform-agnostic Exploitation
ferent platforms with the same payload, what an attacker
can do is to first detect which platform the maldoc is run- We also identified several techniques that can help “neu-
ning on through explicit or implicit channels and then tralize” the uncertainties caused by platform diversity,
launch attacks with platform-specific payload. including but not limited to heap feng-shui, heap spray,
In particular, the Adobe JavaScript API contains pub- and polyglot shellcode.
licly accessible functions and object fields that could Heap feng-shui. By making consecutive heap alloca-
return different values when executed on different plat- tions and de-allocations of carefully selected sizes, an
forms. For example, app.platform returns WIN and MAC attacker can systematically manipulate the layout of the
on respective platforms. Doc.path returns file path to the heap and predict the address of the next allocation or
Syscalls Syscalls
Windows Host (External behaviors) Compare impacts on Macintosh Host (External behaviors)
host platform
Filesystem Network Program Filesystem Network Program
Operations Activities Executions Operations Activities Executions
Figure 3: P LAT PAL architecture. The suspicious file is submitted to two VMs with different platforms. During execution, both
internal and external behaviors are traced and compared. Divergence in any behavior is considered a malicious signal.
The drawing stage starts by performing OpenActions 4.3 External System Impact
specified by the document, if any. Almost all maldocs
will register anything that could trigger their malicious As syscalls are the main mechanisms for a program to
payload in OpenActions for immediate exploitation upon interact with the host platform, P LAT PAL hooks syscalls
document open. Subsequent drawing activities depend and records both arguments and return values in order
on user’s inputs, such as scrolling down to the next page to capture the impact of executing a maldoc on the host
triggers the rendering of that page. Therefore, in this system. However, for P LAT PAL, a unique problem arises
stage, P LAT PAL not only hooks the functions but also when comparing syscalls across platforms, as the syscall
actively drives the document rendering component by semantics on Windows and Mac are drastically different.
component. Note that displaying content to screen is To ease the comparison of external behaviors across
a platform-dependent procedure and hence, will not be platforms, P LAT PAL abstracts the high-level activities
hooked by P LAT PAL, but the callbacks (e.g., an object is from the raw syscall dumps. In particular, P LAT PAL is
rendered) are platform-independent and will be traced. interested in three categories of activities:
In addition, for AAR, when the rendering engine per- Filesystem operations: including files opened/created
forms a JavaScript action or draws a JavaScript-embedded during the execution of the document, as well as file
form, the whole block of JavaScript code is executed. deletions, renames, linkings, etc.
However, this also enables the platform detection attempts Network activities: including domain, IP address, and
described in §3.5 and an easy escape of P LAT PAL’s de- port of the remote socket.
tection. To avoid this, P LAT PAL is designed to suppress External executable launches: including execution of
the automatic block execution of JavaScript code. Instead, any programs after opening the document.
the code is tokenized to a series of statements that are Besides behaviors constructed from syscall trace,
executed one by one, and the results from each execution P LAT PAL additionally monitors whether AAR exits grace-
are recorded and subsequently compared. If the state- fully or crashes during the opening of the document. We
ment calls a user-defined function, that function is also (empirically) believe that many typical malware activities
executed step-wise. such as stealing information, C&C, dropping backdoors,
Following is a summary of recorded traces at each step: etc, can be captured in these high-level behavior abstrac-
COS object parsing: P LAT PAL outputs the parsing tions. This practice also aligns with many automated
results of COS objects (both type and content). malware analysis tools like Cuckoo [44] and CWSand-
PD tree construction: P LAT PAL outputs every PD com- box [63], which also automatically generate a summary
ponent with type and hierarchical position in the PD tree. that sorts and organizes the behaviors of malware into a
Script execution: P LAT PAL outputs every executed few categories. However, unlike these dynamic malware
statement and the corresponding result. analysis tools that infer maliciousness of the sample based
Other actions: P LAT PAL outputs every callback trig- on the sequence or hierarchy of these activities, the only
gered during the execution of this action, such as change indication of maliciousness for P LAT PAL is that the set
of page views or visited URLs. of captured activities differs across platforms. Another
Element rendering: P LAT PAL outputs every callback difference is that the summary generated by Cuckoo and
triggered during the rendering of the PDF element. CWSandbox usually requires manual interpretation to
Table 3: P LAT PAL maldoc detection results grouped by the factor causing divergences. Note that for each sample, only one internal
and one external factor is counted as the cause of divergence. E.g., if a sample crashes on Mac and does not crash on Windows, even
their filesystem activities are different, it is counted in the crash/no crash category. The same rule applies to internal behaviors.
Lei Xue† , Yajin Zhou , Ting Chen†‡ , Xiapu Luo†∗, Guofei Gu§
† Department of Computing, The Hong Kong Polytechnic University
Unaffiliated
‡ Cybersecurity Research Center, University of Electronic Science and Technology of China
§ Department of Computer Science & Engineering, Texas A&M University
acquired (Line 39) for checking whether the SMS is sent Malton Cross-layer data flows
from the controller (Tel: 6223**60). Only the SMS from Comprehensive
Implementation features
Behaviors
the controller will be processed by the method procCMD()
(Line 42). There are 5 types of commands, each of which Figure 1: The scenario of Malton.
leads to a special malicious behavior (i.e., Line 13, 15,
17, 19 and 21). Reading contact and parsing SMS are
implemented in the JNI methods readContact() (Line 1)
3.1 Overview
and parseMSG() (Line 2), respectively.
Existing malware analysis tools could not construct a Malton helps security analysts obtain a complete view
complete view of the malicious behaviors. For example, of malware samples under examination. To achieve this
when cmd equals 3 (Line 16), IMSI is obtained by in- goal, Malton supports three major functionalities. First,
voking the framework API getSubscriberId() (Line 7), and due to the multi-layer nature of Android system, Malton
then leaked through SMTP protocol (Line 9). Although can capture malware behaviors at different layers. For
existing tools (e.g., CopperDroid [73]) can find that the instance, malware may conduct malicious activities by
haviors involve method invocations and data transmis- Android Framework Layer
sion at multiple layers. The challenging issue is how to Oat file parser Java method tracker Java object parser
effectively bridge the semantic gap when monitoring the
Android Runtime (ART) Layer Java reflection
ARM instructions.
JNI reflection JNI invocation
Second, malware could leak private information by
Dynamic native code loading Dynamic Java code loading
passing the data across multiple layers back and forth.
Malton
Note that many framework APIs are JNI methods (e.g., System Layer
String.concat(), String.toCharArray(), etc.), whose real Network operation monitor File operation monitor
implementations are in native code. Malton can detect Memory operation monitor Process operation monitor
such privacy leakage because it supports cross-layer in-
formation flow tracking (Section 3.5). Instruction Layer In-memory concolic execution
Third, since malware may conduct diverse malicious Taint propagation Direct execution
dynamically modifies its own codes through memory For Malton, there are 9 types IR statements related to
operations, like sys mmap(), sys protect(), etc., Malton the taint propagation, which are listed in Table 2. For
monitors such memory operations. the Ist WrTmp statement, since the source value may be
• Process operations. As malware often needs to fork the result of an IR expression, we also need to parse the
new process, or exits when the emulator or the debug logic of the IR expression for taint propagation. The IR
environment is detected, Malton captures such behav- expressions that can affect the taint propagation are sum-
iors by monitoring system calls relevant to the process marized in Table 3. During the execution of the target ap-
operations, including sys execve(), sys exit(), etc. p, Malton parses the IR statements and expressions in the
helper functions, and propagates the taint tags according
Moreover, Malton may need to modify the arguments
to the logic of the IR statements and expressions.
and/or the return values of system calls to explore code
Malton supports taint sources/sinks in different layers
paths. For example, the C&C server may have been shut
(i.e., the framework layer and the system layer). For ex-
down when malware samples are being analyzed. In
ample, Malton can take the arguments and results of both
this case, Malton replaces the results of the system call
Java methods and C/C++ methods as the taint sources,
sys connect() to success, or replaces the address of C&C
and check the taint tags of the arguments and the result-
with a bogus one controlled by the analyst to trigger ma-
s of sink methods. By default, at the framework layer,
licious payloads. We will discuss the techniques used to
11 types of information are specified as taint sources,
explore code paths in Section 3.6.
including device information (i.e., IMSI, IMEI, SN and
phone number), location information (i.e., GPS location,
3.5 Instruction Layer: Taint Propagation network location and last seen location) and personal
information (i.e., SMS, MMS, contacts and call logs).
At the instruction layer, Malton performs two major Malton also checks the taint tags of the arguments and
tasks, namely, taint propagation and path exploration. results when each framework method is invoked. In the
Note that accomplishing these tasks needs the semantic system layer, Malton takes system calls sys write() and
CursorWrapper. LoggingPrintStream.
01
CursorWrapper. 03 15 Arg1: String@0x12ccc3d8 (Āsend Message to Jeremy 1ā)
getColumnIndex() println()
getColumnIndex()
String@0x6fd83c00 (Ādata1ā)
CursorWrapper. CursorWrapper.
06 07
getColumnIndex() getString()
Int 0x33
Arg1: String@0x12d2d988 (Ā1471501**95ā)
String@0x12d05288 (Ā+86 1471501**95ā) Arg2: NULL
SmsManager.
14 Arg3: String@0x130b2c28 (Ā Jeremyⴻ䘉њˈ http://cdn.yyupload.com/down/4279193/XXshenqi.apkā )
String@0x6fbe4480 (Ā ā) String@0x12d8bd68 (Ā+86ā) sendTextMessage()
Arg4: NULL
String. Arg5: NULL
08 String@0X6fbe22d0 (Āā)
replace()
String.
String@0x12d056c8 (Ā+861471501**95ā) 09 String@0x12d2d988 (Ā1471501**95ā)
replace()
Figure 4: Malton can help the analyst construct the complete flow of information leakage in the XXshenqi malware. The ellipses
refer to function invocations, where the grey ellipses represent taint sources and the ellipses with bold lines denote taint sinks. The
rectangles indicate data and red italics strings highlight the tainted information.
were still active when Malton inspects the same samples. is created based on id and uri (com.android.contact),
Hence, in the worst case, Malton’s results may be penal- and this contact’s phone number is acquired through this
ized since the malware cannot receive commands. instance. After that, blank characters and the national
Summary Compared with existing tools running in the number (“+86”) are removed from the retrieved phone
emulator and monitoring malware behaviors in a sin- number in steps 8 and 9. In the method String.replace()7 ,
gle layer, Malton can capture more sensitive behaviors StringFactory.newStringFromString() and String.setCharAt()
thanks to its on-device and cross-layer inspection. are invoked to create a new string according to the
current string and set the specified character(s) of the
new string, respectively. These two methods are JNI
4.2 Malware Analysis
functions and implemented in the system layer. For
To answer Q2, we evaluate Malton with sophisticat- String.setCharAt(), Malton can further determine the taint-
ed malware samples by constructing the complete flow ed portion of the string at the byte granularity. By con-
of information leakage across different layers, detecting trast, TaintDroid does not support this functionality be-
stealthy behaviors with Java/JNI reflection, dissecting the cause for JNI methods it lets the taint tag of the whole
behaviors of packed Android malware, and identifying return value be the union of the function arguments’ taint
the malicious behaviors of hidden code. tags. After that, a phishing SMS message is constructed
according to the display name of a retrieved contact and
4.2.1 Identify Cross-Layer Information Leakage the phishing URL through steps 10-13. Finally, the phish-
This experiment uses the sample in the XXShenqi [3] ing SMS is sent to the contact in step 14 and a message
malware family, which is an SMS phishing malware with “send Message to Jeremy 1” is printed in step 15.
package name com.example.xxshenqi. When the mal- Summary By conducting the cross-layer taint propaga-
ware is launched, it reads the contact information and tion, Malton can help the analyst construct the complete
creates a phishing SMS message that will be sent to all flow of information leakage.
the contacts collected. In this inspection, we focused on
the behavior of creating and sending the phishing SMS 4.2.2 Detect Stealthy Malicious Behaviors
message to the retrieved contacts by letting the contact- Some malware adopts Java/JNI reflection to hide their
s be the taint source and the methods for sending SMS malicious behaviors. We use the sample in the photo38
messages be the taint sink. The detailed flow is illustrat- malware family to evaluate Malton’s capability of detect-
ed in Figure 4. ing such stealthy behaviors. Figure 5 demonstrates the i-
To retrieve the information of each contact, the mal- dentified stealthy behaviors, which are completed in two
ware first obtains the column index and the value of the different threads. The number in the ellipse and rectan-
field id in step 1 and step 2 in Figure 46 , respective- gle is the step index, and we use different colours (i.e.,
ly. Then, a new instance of the class CursorWrapper
7 in /libcore/libart/src/main/java/java/lang/String.java
6 The number in each ellipse denotes the step index. 8 md5:8bd9f5970afec4b8e8a978f70d5e87ab
TelephonyManager@0x12e99348
02 TelephonyManager. DefineClass()
getDeviceId() 05 Class=āLandroid/telephony/SmsManager;ā
06 SmsManager.
<init>
ć䖥Ԭᆿ㻻ᇂ∋?Q䇼ࡡ⸷Ĉ+ Ā353490069945232ā + Ā\nā + ćශਭ$263RQ+DPPHU+HDG?Qᢁ SmsManager.
ᵰ$QGURLG?Q㌱㔕⡾ᵢĈ 07
getDefault()
Āandroid.telephony.SmsManagerā
Arg1: Ā15767549497ā Class.
Arg2: NULL 08
forName()
Arg3:ć䖥Ԭᆿ㻻ᇂ∋?Q䇼ࡡ⸷353490069945232\nශਭ$263
RQ+DPPHU+HDG?Qᢁᵰ$QGURLG?Q㌱㔕⡾ᵢĈ Object@0x12eac048 Class@0x130f5c08
Arg4: NULL
Arg5: NULL 09 Class.
getMethod() ĀsendTextMessageā
12 SmsManager. 11
InvokeMethod() Method. Method@0x12e482c8
sendTextMessage() method=ĀsendTextMessageā 10
invoke()
Figure 5: Malton can detect stealthy behaviors through the Java/JNI reflection of the photo3 malware. The ellipses refer to
function invocations in the framework layer, where the grey ellipses represent taint sources and the ellipses with bold lines denote
taint sinks. The round corner rectangles stand for function invocations at the runtime layer. Other rectangles indicate data and red
italics strings highlight the tainted information.
blue and red) for the numbers to distinguish two threads. ware sample9 , and Figure 7 illustrates the identified mali-
The execution paths are denoted by both the solid lines cious behaviors of this sample. Such behaviors can be di-
and dashed lines, and the solid lines further indicate how vided into two parts. One is related to the original packed
the information is leaked. We describe the identified ma- malware (Lines 1-21), and the other one is relevant to the
licious behaviors as follows. hidden payloads of the malware (Lines 22-30).
• The device ID is returned by the method Once the malware is started, the class
TelephonyManager.getDeviceId() in step 1 and step 2. com.netease.nis.wrapper.MyApplication is loaded for
preparing the real payload (Line 2). Then, the Android
• A new thread is created to send the collected
framework API Application.attach() is invoked (Line 4)
information to the malware author. In step 3,
to set the property of the app context. After that, the
a memory area is allocated by the system cal-
malware calls the Java method System.loadLibrary() to
l sys mmap(), and the thread method run() is invoked
load its native component libnesec.so at Line 7.
by the runtime through the JNI reflection function
Malton empowers us to observe that the ART runtime
InvokeVirualOrInterfaceWithJValues() in step 4. Nex-
invokes the function FindClass() (Line 8) and the
t, the class android/telephony/SmsManager is defined
function LoadNativeLibrary() (Line 9) to locate the
and initialized in step 5 and step 6. In step 7,
class com.netease.nis.wrapper.MyJni and load the library
the SmsManager object is obtained through the static
libnesec.so, respectively.
method SmsManager.getDefault().
After initialization, the malware calls the JNI method
• The malware sends SMS messages through Java MyJni.load() to release and load the hidden Dalvik byte-
reflection. Specifically, in step 8, the malware obtain- code into memory. More precisely, the package name
s the object of the android.telephony.SmsManager is first obtained through JNI reflection (Line 11 and
class through the Java reflection method 12). Then, the hidden bytecode is written into the file
Class.forName(). Then, it retrieves the method “.cache/classes.dex” under the app’s directory (Line 13
object of sendTextMessage() using the Java reflection and 14). After that, a new DexFile object is initialized
method Class.getMethod() in step 9. Finally, it calls based on the newly created Dex file through the runtime
the Java method sendTextMessage() in step 10. This function DexFile::OpenMemory() (Line 16).
invocation goes to the method InvokeMethod() in the We also find that the packed malware reg-
ART runtime layer in step 11. isters an Intent receiver to handle the Intent
Summary Malton can identify malware’s stealthy be- com.zjdroid.invoke at Line 19 and 21. Note
haviors through Java/JNI reflection in different layers. that ZjDroid [9] is a dynamic unpacking tool based
on the Xposed framework and is started by the Intent
com.zjdroid.invoke. By registering the Intent
4.2.3 Dissect Packed Android Malware’s Behaviors receiver, the malware can detect the existence of ZjDroid.
Finally, the app loads and initializes the class
Since Malton stores the collected information into log v.v.v.MainActivity in Line 23 to 26, and the hidden mali-
files, we can dissect the behaviors of packed Android cious payloads are executed at Line 29. To hide itself, the
malware by analyzing the log files. As an example, Fig-
ure 6 shows partial log file of analyzing the packed mal- 9 md5: 03b2deeb3a30285b1cf5253d883e5967
Figure 6: The major information collected by Malton on function level. The names of Android runtime functions and system calls
are in black italics. We omit the information of method arguments due to the space limitation).
malware also calls the framework method Activity.finish() processes each incoming SMS by obtaining its address
to destroy its activity (Line 30). and content through methods getOriginatingAddress() and
Summary Malton can analyze sophisticated packed mal- getMessageBody(), respectively. If the SMS is not from
ware samples, and help the analyst identify the behaviors the controller (i.e., Tel: 1851130**14), it calls the method
of both the malware and its hidden code. abortBroadcast() to abort the current broadcast.
Effectiveness To explore all the malicious payload-
Behavior of Class Ācom.netease.nis.wrapper.MyApplicationā s controlled by the received SMS message, we speci-
Get IMEI of the device
fy the code region between the return of the function
Malicious behavior 1 Create a new thread
getMessageBody() and the return of handleReceiver() to
Leak the IMEI by the new thread
through network perform in-memory concolic execution. We set the re-
Load native component
libnesec.so Get package name through JNI
sult of getMessageBody() (i.e., SMS content) as the input
reflection invocation of the concolic execution. To circumvent the checking
Call JNI method Create new dex file classes.dex
MyJni.load() of the phone number of the received SMS message, we
Load dex file classes.dex and trigger the malware to execute the satisfied code path by
initialize new DexFile object
changing the result of getOriginatingAddress() to the num-
Register receiver for Intent
Ācom.zjdroid.invokeā Anti dynamic analysis ber of the controller.
Behavior of Class Āv.v.v.MainActivityā
However, we find that the constraint resolver cannot
Get IMEI of the device always find the satisfying input due to the comparison of
Malicious behavior 2 Write the IMEI information into disk two strings’ hash values. Therefore, we use the direct
through the interfaces of class execution engine to force the malware to execute the s-
android.app.SharedPreferences.
elected code path. Eventually, we identify 14 different
code paths (or behaviors) that depend on the content of
Figure 7: Malton can reconstruct the behaviors of the
the received SMS. The generated inputs and their cor-
packed malware and its hidden code.
responding behaviors are listed in Table 5. This result
demonstrated the effectiveness of Malton to explore dif-
ferent code paths.
4.3 Path Exploration Efficiency Thanks to the in-memory optimization, when
exploring code paths in the interested code region,
To answer Q3, we first employ Malton to analyze the Malton just needs one SMS and then iteratively executes
SMS handler of the packed malware com.nai.ke. From the specified code region for 14 times without the need
the logs, we find that its SMS handler handleReceiver() of restarting the app for 14 times. To evaluate the ef-
ficiency of the in-memory optimization, we record the 106 Android without Valgrind
Android with Valgrind
number of IR blocks to be executed for exploring each Malgrind without taint propagation
code path with/without in-memory optimization, and list Malgrind with taint propagation
Android on emulator (Qemu)
them in Table 5 (the last column). The result shows that 105
31079
Score
the in-memory optimization can avoid executing a large 17237 22773
Malware
With Without without the taint propagation. To compare with the dy-
Optimization Optimization
0710e f 0ee60e1ac f d2817988672b f 01b 203k 26237k
namic analysis tools based on the Qemu emulator, we
0ced776e0 f 18dd f 02785704a72 f 97aac 203k 26010k also execute CF-Bench in Qemu, which runs on a Ubuntu
0e69a f 88dcbb469e30 f 16609b10c926c 4k 16826k 14.04 desktop equipped with Core(TM) i7 CPU and 32G
336602990b176c f 381d288b79680e4 f 6 13k 1908k memory.
8e1c7909aed92eea89 f 6a14e0 f 41503d 7k 69968k
The results are shown in Figure 8. There are three
types of scores. The Java score denotes the performance
We also use five other malware samples, which have of Java operations, and the native score indicates the per-
the SMS handler, to further evaluate the efficiency of formance of naive code operation. The overall scores
the path exploration module. The average number of IR are calculated based on the Java and the native score. A
blocks to be executed with and without in-memory opti- higher score means a better performance.
mization are listed in Table 6. The in-memory optimiza-
Figure 8 illustrates that Malton introduces around 16x
tion can obviously reduce the number of IR blocks to be
and 36x slowdown to the Java operations without and
executed.
with taint propagation. However, when the app runs with
Summary The path exploration module of Malton can only Valgrind, there is also 11x slowdown. It mean-
explore code paths of malicious payloads effectively and s that Malton brings 1.5x-3.2x additional slowdown to
efficiently. The concolic execution engine generates the Valgrind. Similarly, for the native operations, Malton in-
satisfying inputs to execute certain code paths, and the troduces 1.7x-2.3x additional slowdown when running
direct execution engine forcibly executes selected code Valgrind is taken as the baseline. Overall, Malton in-
paths when the constraint resolver fails. troduces around 25x slowdown (with taint propagation).
Since the Qemu [57] emulator incurs around 15x over-
all slowdown, and the Qemu-based tools (e.g., the taint
4.4 Performance Overhead
tracker of DroidScope [83]) may incur 11x-34x additional
To understand the overhead introduced by Malton, we slowdown, Malton could be more efficient than the exist-
run the benchmark tool CF-Bench [8] 30 times on ing tools based on Qemu.
a Nexus 5 smartphone running Android 6.0 under Summary As a dynamic analysis tool, Malton has a rea-
four different environments, including Android without sonable performance and could be more efficient than the
Valgrind, Android with Valgrind, and Malton with and existing tools based on the Qemu emulator.
OS-level and Java-level semantics, and exports APIs for works to propagate the taint information in ART. They
building specific analysis tools, such as dynamic infor- modify the tool dex2oat, which is provided along with
mation tracer. Hence, there is a semantic gap between ART runtime to turn Dalvik bytecode into native code
the VMI observations and the reconstructed Android spe- during app’s installation. The taint propagation instruc-
cific behaviors. Since it monitors the Java-level behav- tions will be inserted into the compiled code by the modi-
iors by tracing the execution of Dalvik instructions, it fied dex2oat. However, they only propagate taint at the
cannot monitor the Java methods that are compiled into runtime layer, and do not support the taint propagation
native code and running on ART (i.e, partial support of through JNI or in native codes. Moreover, they cannot
framework layer). Moreover, DroidScope does not mon- handle the packed malware, because such malware usu-
itor JNI and therefore it cannot capture the complete be- ally dynamically load the Dalvik bytecode into runtime
haviors at runtime layer. CopperDroid [73] is also built directly without triggering the invocation of dex2oat.
on top of Qemu and records system call invocations by CRePE [29] and DroidTrack [64] track apps’ behaviors at
instrumenting Qemu. Since it performs binder analy- the framework layer by modifying Android framework.
sis to reconstruct the high-level Android-specific behav-
iors, only a limited number of behaviors can be mon- Boxify [20] and NJAS [24] are app sandboxes that en-
itored. Moreover, it cannot identify the invocations of capsulate untrusted apps in a restricted execution envi-
framework methods. ANDRUBIS [50] and MARVIN [49] ronment within the context of another trusted sandbox
(which is built on top of ANDRUBIS) monitor the behav- app. Since they behave as a proxy for all system call-
iors at the framework layer by instrumenting DVM and s and binder channels of the isolated apps, they support
log system calls through VMI. the analysis of native code and could reconstruct partial
Monitoring system calls [17, 35, 46, 48, 54, 68, 75, 76, framework layer behaviors.
84, 91] is widely used in Android malware analysis be- ARTDroid [30] traces framework methods by hooking
cause considerable APIs in upper layers eventually in- the virtual framework methods and supports ART. S-
voke systems calls. For instance, Dagger [84] collect- ince the boot image boot.art contains both the vtable and
s system calls through strace [6], recodes binder trans- virtual methods arrays that store the pointers to virtual
actions via sysfs [7], and accesses process details from methods, ARTDroid hijacks vtable and virtual methods
/proc file system. One common drawback of system- to monitor the APIs invoked by malware.
call-based techniques is the semantic gap between sys-
tem calls with the behaviors of upper layers, even though HARVESTER and GroddDroid [10, 61] support multi-
several studies [54, 84, 91] try to reconstruct high-level path analysis. The former [61] covers interested code
semantics from system calls. Besides tracing system forcibly by replacing conditionals with simple Boolean
calls, MADAM [35] and ProfileDroid [76] monitor the in- variables, while the latter [10] uses a similar method to
teractions between user and smartphone. However, they jump to interested code by replacing conditional jumps
cannot capture the behaviors in the runtime layer. with unconditional jumps. Different from Malton, they
Both TaintART [71] and ARTist [21] are new frame- need to modify the bytecode of malware.
[18] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, [30] V. Costamagna and C. Zheng. Artdroid: A virtual-
K. Rieck, and C. Siemens. Drebin: Effective and method hooking framework on android art runtime.
explainable detection of android malware in your In Proc. ESSoS Workshop IMPS, 2016.
pocket. In Proc. NDSS, 2014.
[31] S. Dai, T. Wei, and W. Zou. Droidlogger: Reveal
[19] E. Athanasopoulos, V. Kemerlis, G. Portokalidis, suspicious behavior of android applications via in-
and A. Keromytis. Nacldroid: Native code isolation strumentation. In Proc. ICCCT, 2012.
for android applications. In Proc. ESORICS, 2016.
[32] B. Davis, B. Sanders, A. Khodaverdian, and
[20] M. Backes, S. Bugiel, C. Hammer, O. Schranz, and H. Chen. I-arm-droid: A rewriting framework for
P. von Styp-Rekowsky. Boxify: Full-fledged app in-app reference monitors for android applications.
sandboxing for stock android. In Proc. USENIX Mobile Security Technologies, 2012.
Security, 2015.
[33] L. De Moura and N. Bjørner. Z3: An efficient smt
[21] M. Backes, S. Bugiel, O. Schranz, P. von Styp- solver. In Proc. TACAS, 2008.
Rekowsky, and S. Weisgerber. Artist: The android
[34] A. Desnos and P. Lantz. Droidbox: An android
runtime instrumentation and security toolkit. arXiv
application sandbox for dynamic analysis. http-
preprint arXiv:1607.06619, 2016.
s://goo.gl/iWYL9B, 2014.
[22] M. Backes, S. Gerling, C. Hammer, M. Maffei, and
[35] G. Dini, F. Martinelli, A. Saracino, and D. Sgan-
P. von Styp-Rekowsky. Appguardcenforcing user
durra. Madam: a multi-level anomaly detector for
requirements on android apps. In Proc. TACAS,
android malware. In Proc. MMM-ACNS, 2012.
2013.
[36] T. Dong and M. Zhang. Five ways android malware
[23] P. Berthome, T. Fecherolle, N. Guilloteau, and is becoming more resilient. https://goo.gl/7ZPWnJ,
J. Lalande. Repackaging android applications for 2016.
auditing access to private data. In Proc. ARES,
2012. [37] D. Earl and B. VonHoldt. Structure harvester: a
website and program for visualizing structure out-
[24] A. Bianchi, Y. Fratantonio, C. Kruegel, and G. Vi- put and implementing the evanno method. Conser-
gna. Njas: Sandboxing unmodified applications in vation genetics resources, 4(2), 2012.
non-rooted devices running stock android. In Proc.
SPSM, 2015. [38] W. Enck, P. Gilbert, B. Chun, L. Cox, J. Jung, P. M-
cDaniel, and A. Sheth. Taintdroid: An information-
[25] C. Cadar, D. Dunbar, and D. Engler. Klee: Unas- flow tracking system for realtime privacy monitor-
sisted and automatic generation of high-coverage ing on smartphones. In Proc. USENIX OSDI, 2010.
tests for complex systems programs. In Proc. OS-
DI, 2008. [39] W. Enck, P. Gilbert, S. Han, V. Tendulkar, B. Chun,
L. Cox, J. Jung, P. McDaniel, and A. Sheth. Taint-
[26] S. Cha, T. Avgerinos, A. Rebert, and D. Brumley. droid: an information-flow tracking system for re-
Unleashing mayhem on binary code. In Proc. IEEE altime privacy monitoring on smartphones. ACM
SP, 2012. Transactions on Computer Systems, 32(2), 2014.
[27] T. Chen, X. Zhang, S. Guo, H. Li, and Y. Wu. State [40] Y. Fratantonio, A. Bianchi, W. Robertson, E. Kirda,
of the art: Dynamic symbolic execution for auto- C. Kruegel, and G. Vigna. Triggerscope: Towards
mated test generation. Future Generation Comput- detecting logic bombs in android applications. In
er Systems, 29(7), 2013. Proc. IEEE SP, 2016.
[28] T. Chen, X. Zhang, X. Ji, C. Zhu, Y. Bai, and Y. Wu. [41] S. Hao, B. Liu, S. Nath, W. Halfond, and R. Govin-
Test generation for embedded executables via con- dan. Puma: Programmable ui-automation for large-
colic execution in a real environment. IEEE Trans- scale dynamic analysis of mobile apps. In Proc.
actions on Reliability, 64(1), 2015. MobiSys, 2014.
[43] C. Jeon, W. Kim, B. Kim, and Y. Cho. Enhanc- [55] F. Peng, Z. Deng, X. Zhang, D. Xu, Z. Lin, and
ing security enforcement on unmodified android. In Z. Su. X-force: Force-executing binary programs
Proc. SAC, 2013. for security applications. In Proc. USENIX Securi-
ty, 2014.
[44] Y. Jing, Z. Zhao, G. Ahn, and H. Hu. Morpheus:
automatically generating heuristics to detect an- [56] T. Petsas, G. Voyatzis, E. Athanasopoulos, M. Poly-
droid emulators. In Proc. ACSAC, 2014. chronakis, and S. Ioannidis. Rage against the virtu-
al machine: hindering dynamic analysis of android
[45] M. Kang, S. McCamant, P. Poosankam, and malware. In Proc. ACM EuroSys, 2014.
D. Song. Dta++: Dynamic taint analysis with tar-
geted control-flow propagation. In Proc. NDSS, [57] QEMU. https://goo.gl/EUgpkB.
2011.
[58] C. Qian, X. Luo, Y. Le, and G. Gu. Vulhunter: To-
[46] M. Karami, M. Elsabagh, P. Najafiborazjani, and ward discovering vulnerabilities in android applica-
A. Stavrou. Behavioral analysis of android appli- tions. IEEE Micro, 35(1), 2015.
cations using automated instrumentation. In Proc.
SERE-C, 2013. [59] C. Qian, X. Luo, Y. Shao, and A. Chan. On tracking
information flows through JNI in android applica-
[47] D. Kirat and G. Vigna. Malgene: Automatic ex- tions. In Proc. IEEE/IFIP DSN, 2014.
traction of malware analysis evasion signature. In
Proc. ACM CCS, 2015. [60] S. Rasthofer, S. Arzt, E. Lovat, and E. Bod-
den. Droidforce: enforcing complex, data-centric,
[48] Y. Lin, Y. Lai, C. Chen, and H. Tsai. Identifying an- system-wide policies in android. In Proc. ARES,
droid malicious repackaged applications by thread- 2014.
grained system call sequences. Computers & Secu-
rity, 39, 2013. [61] S. Rasthofer, S. Arzt, M. Miltenberger, and E. Bod-
den. Harvesting runtime values in android applica-
[49] M. Lindorfer, M. Neugschwandtner, and C. Platzer. tions that feature anti-analysis techniques. In Proc.
Marvin: Efficient and comprehensive mobile app NDSS, 2016.
classification through static and dynamic analysis.
In Proc. COMPSAC, 2015. [62] V. Rastogi, Y. Chen, and W. Enck. Appsplay-
ground: automatic security analysis of smartphone
[50] M. Lindorfer, M. Neugschwandtner, L. Weich- applications. In Proc. CODASPY, 2013.
selbaum, Y. Fratantonio, V. Van Der Veen, and
C. Platzer. Andrubis–1,000,000 apps later: A view [63] A. Sadeghi, H. Bagheri, J. Garcia, and s. Malek. A
on current android malware behaviors. In Proc. taxonomy and qualitative comparison of program
Workshop BADGERS, 2014. analysis techniques for security assessment of an-
droid software. IEEE Transactions on Software En-
[51] K. Lu, Z. Li, V. P. Kemerlis, Z. Wu, L. Lu, gineering, 43(6), 2016.
C. Zheng, Z. Qian, W. Lee, and G. Jiang. Checking
more and alerting less: Detecting privacy leakages [64] S. Sakamoto, K. Okuda, R. Nakatsuka, and T. Ya-
via enhanced data-flow analysis and peer voting. In mauchi. Droidtrack: tracking and visualizing infor-
Proc. NDSS. mation diffusion for preventing information leak-
age on android. Journal of Internet Services and
[52] S. Malek, N. Esfahani, T. Kacem, R. Mahmood, Information Security, 4(2), 2014.
N. Mirzaei, and A. Stavrou. A framework for au-
tomated security testing of android applications on [65] D. Schreckling, J. Köstler, and M. Schaff.
the cloud. In Proc. SERE-C, 2012. Kynoid: real-time enforcement of fine-grained,
user-defined, and data-centric security policies for
[53] N. Nethercote and J. Seward. Valgrind: a frame- android. Information Security Technical Report,
work for heavyweight dynamic binary instrumen- 17(3), 2013.
tation. In Proc. ACM PLDI, 2007.
[66] J. Schütte, R. Fedler, and D. Titze. Condroid: Tar-
[54] X. Pan, Y. Zhongyang, Z. Xin, B. Mao, and geted dynamic analysis of android applications. In
H. Huang. Defensor: Lightweight and efficient Proc. AINA, 2015.
and independent verifiability metrics. We say that a re- (192.168.0.10, 192.168.0.11, 192.168.0.12).
sponse is correct if it satisfies any consistency or inde- We say the answer is “correct” (i.e., not manip-
pendent verifiable metric; otherwise, we classify the re- ulated) if, for each metric, the answer is a non-
sponse as manipulated. In this section, we outline each empty subset of the controls. Returning to our IP
class of metrics as well as the specific features we de- example, if a global resolver returns the answer
velop to classify answers. The rest of this section defines (192.168.0.10, 192.168.0.12), it is identified as
these metrics; §5.1 explores the efficacy of each of them. correct. When a request returns multiple records, we
check all records and consider the reply good if any
3.5.1 Consistency response passes the appropriate tests.
Additionally, unmanipulated passive DNS [6] data
Access to a domain should have some form of consis-
collected simultaneously with our experiments across a
tency, even when accessed from various global vantage
geographically diverse set of countries could enhance (or
points. This consistency may take the form of network
replace) our consistency metrics. Unfortunately we are
properties, infrastructure attributes, or even content. We
not aware of such a dataset being available publicly.
leverage these attributes, both in relation to control data
as well as across the dataset itself, to classify DNS re- IP Address. The simplest consistency metric is the IP
sponses. address or IP addresses that a DNS response contains.
Consistency Baseline: Control Domains and Re- Autonomous System / Organization. In the case of ge-
solvers. Central to our notion of consistency is having ographically distributed sites and services, such as those
a set of geographically diverse resolvers we control that hosted on CDNs, a single domain name may return dif-
are (presumably) not subject to manipulation. These con- ferent IP addresses as part of normal operation. To at-
trols give us a set of high-confidence correct answers we tempt to account for these discrepancies, we also check
can use to identify consistency across a range of IP ad- whether different IP addresses for a domain map to the
dress properties. Geographic diversity helps ensure that same AS we see when issuing queries for the domain
area-specific deployments do not cause false-positives. name through our control resolvers. Because a single AS
For example, several domains in our dataset use differ- may have multiple AS numbers (ASNs), we consider two
ent content distribution network (CDN) hosting infras- IP addresses with either the same ASN or AS organiza-
tructure outside North America. As part of our measure- tion name as being from the same AS. Although many
ments we insert domain names we control, with known responses will exhibit AS consistency even if individual
correct answers. We use these domains to ensure a re- IP addresses differ, even domains whose queries are not
solver reliably returns unmanipulated results for non- manipulated will sometimes return inconsistent AS-level
sensitive content (e.g., not a captive portal). and organizational information as well. This inconsis-
For each domain name, we create a set of con- tency is especially common for large service providers
sistency metrics by taking the union of each metric whose infrastructure spans multiple regions and conti-
across all of our control resolvers. For example, nents and is often the result of acquisitions. To account
if Control A returns the answer 192.168.0.10 for these inconsistencies, we need additional consistency
and 192.168.0.11 and Control B returns metrics at higher layers of the protocol stack (specifically
192.168.0.12, we create a set of consistent IP set of HTTP and HTTPS), described next.
Number of resolvers
5000
0.98
Iran (122) 6.02% 5.99% 22.41% 0.00%
China (62) 5.22% 4.59% 8.40% 0.00%
Proportion of resolvers
0.86
Table 6: Top 10 countries by median percent of manipulated
responses per resolver. We additionally provide the mean, max-
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Proportion of responses manipulated imum, and minimum percent for resolvers in each country. The
number of resolvers per country is listed with the country name.
Figure 4: The fraction of responses manipulated, per resolver.
For 89% of resolvers, we observed no manipulation.
d
ia
s
ia
ey
e
a
na
an
er
an
tin
si
c
es
an
at
an
rk
hi
Ir
th
al
n
on
St
om
C
Tu
Fr
ge
lO
R
Ze
d
Ar
In
Al
te
N
U
Density of Domains
Indonesia (80) 2.74% 95.08%
Iraq (7) 1.68% 1.49% 0.6
New Zealand (16) 1.59% 100.00%
Turkey (192) 0.84% 99.81% 0.4 10
Romania (45) 0.77% 100.00%
Kuwait (10) 0.61% 0.00%
0.2
Greece (26) 0.41% 100.00%
Cyprus (5) 0.40% 100.00%
0.0 1
d
ia
ia
ey
ce
t
s
na
an
aq
ai
ru
an
es
an
e
rk
hi
Ir
Ir
w
Table 8: Percent of public IP addresses in manipulated re-
yp
re
al
on
Ku
om
C
Tu
Ze
C
G
d
R
In
ew
sponses, by country. Countries are sorted by overall frequency
N
of manipulation.
Figure 7: The fraction of resolvers within a country that ma-
nipulate each domain.
1.0 1.0
0.8
Proportion of domain names 0.8
Proportion of answers
0.6
0.6
0.4
0.4
0.2
0.2
0.0
1 10 100 1000
0.0
Unique answer, sorted by prevalence (log scaled) 0 5 10 15 20
Number of countries
Heterogeneity across a country may suggest a situa- 5.4 Commonly Manipulated Domains
tion where different ISPs implement filtering with differ-
ent block lists; it might also indicate variability across Commonly manipulated domains across countries.
geographic region within a country. The fact that ma- Many domains experienced manipulation across a range
nipulation rates vary even among resolvers in a certain of countries. Figure 8 shows a CDF of the number of
group within a country may indicate either probabilistic countries (or dependent territories) for which at least
manipulation, or the injection of manipulated responses one resolver manipulated each domain. 30% of domains
(a phenomenon that has been documented before [5]). were manipulated in only a single country, while 70%
Other more benign explanations exist, such as corporate were manipulated in 5 or fewer countries. No domain
firewalls (which are common in the United States), or lo- was manipulated in more than 19 countries.
calized manipulation by resolver operators. Table 9 highlights domains that experience manipula-
tion in many countries (or dependent territories). The 2
Ceilings on the percent of resolvers within a country most manipulated domains are both gambling websites,
performing manipulation, such as no domain in China each experiencing censorship across 19 different coun-
experiencing manipulation across more than approxi- tries. DNS resolutions for pornographic websites are
mately 85% of resolvers, suggest IP geolocation errors similarly manipulated, accounting for the next 3 most
are common. commonly affected domains. Peer-to-peer file sharing
jahjah exit was started in 2015 but the operator was only The second most common non-DMCA complaints are
able to provide complaints received from 2016 onwards. about automated scans and bruteforce login attacks on
Wordpress. Automated scanning, specifically port and
vulnerability scanning, accounts for 13.6% of the non-
4.2 Analysis DMCA complaints across the entire time range of our
dataset. Instances of Wordpress bruteforce login at-
We extract the nature of abuse, the time of complaint, and tacks lasted for a comparatively shorter period, Septem-
the associated exit IP addresses. 99.7% of the complaints ber 2015 until May 2016, but constitute 12.1% of non-
received by Torservers.net related to Digital Millennium DMCA complaints. All of the exits in our dataset re-
Copyright Act (DMCA) violations, with over 1 million ceived complaints about bruteforce login attempts from
of the complaints sent by one IP address. These emails Wordpress except our own exits, probably because we
use a template, enabling parsing of email text with regu- started our exits after the attack stopped.
lar expressions. The majority of the non-DMCA emails Email, comment, and forum spam constitutes 9.01%
also follow a template, but the structure varied across a of non-DMCA complaints. Note that all of the exits
large number of senders. To extract the relevant abuse in question have the SMTP port 25 blocked. Our data
information from non-DMCA complaint emails, we first shows a spike in the number of abuse complaints regard-
applied KMeans clustering to identify similar emails. We ing referrer spam from Semalt’s bots [26] towards the
manually crafted regular expressions for each cluster. We end 2016. 1.05% of the non-DMCA emails complain
used these regular expressions to assign high confidence about harassment. Over 11% of the non-DMCA emails
labels to emails. Not all emails matched such a tem- do not fall under the mentioned categories. These emails
plate regular expression—e.g., one-off complaints sent include encrypted emails and emails unrelated to abuse.
by individuals. We classified these emails by looking
for keywords related to types of abuse. We iteratively
refined this process until manually labeled random sam- 4.3 Consequences of Undesired Traffic
ples showed the approach to be quite accurate, with only Along with the complaints, some emails mention the
2% cases of misidentification. steps the sender will take to minimize abuse from
99.99% of all DMCA violation complaints were Tor. 34.3% of the emails mentioned temporary block-
against the Torservers’ exits. The other exits collec- ing (19.8%), permanent blocking (0.2%), blacklisting
tively received only 12 such complaints. Over 99% of (9.8%) or other types of blocking (4.6%). The rest of
DMCA complaints mentioned the usage of BitTorrent for the emails notify the exit operators about the abuse. For
infringement; the rest highlighted the use of eDonkey. the most frequent form of blocking, temporary blocking,
We categorized the Non-DMCA complaints into five the emails threaten durations ranging from 10 minutes to
broad categories enumerated in Table 3. Network abuse a week. Some companies (e.g., Webiron) maintain dif-
is the most frequent category of non-DMCA complaints. ferent blacklists depending on the severity of the abuse.
≈15% of the complaints related to network abuse came The majority of the blacklists mentioned in the emails
from Icecat [24], a publisher of e-commerce statistics are either temporary or unspecified. Only 18 emails men-
and product content. Icecat’s emails complain about ex- tioned permanently blocking the offending IP address. A
cessive connection attempts from the jahjah exit and 13 small fraction (less than 1%) of the emails ask exit oper-
exit IP addresses hosted by Torservers. These emails ators to change the exit policy to disallow exiting to the
were received from November 2011 until December corresponding website.
2012. Other exit operators also received similar com- We did not find any complaint emails from known
plaints during the same time frame [25]. We checked a Tor discriminators, such as Cloudflare and Akamai.
recent Tor consensus in November 2016 and found eight Among the websites we crawled to quantify discrimina-
exits that avoided exiting to the Icecat IP address. tion against Tor, we found complaints from Expedia and
Zillow. Expedia complained about an unauthorized and Exit Family Avg. users Avg. complaints
excessive search of Expedia websites and asked exit op-
apx 23,082.08 0.53
erators to disallow exiting to the Expedia website. Zil-
TorLand1 59,284.54 0.17
low’s complaint was less specific, about experiencing
Effect jahjah 1,206.30 0.19
traffic in violation ofof Exitterms
their Policyand
andconditions.
Bandwidth
Our exits 3,050.81 5.55
Large Medium Small
Table 4: Avg. simultaneous users and complaints per day
150
Number of Complaints
an exit, the other factors such as the exit policy might affect which exits
We estimate the average number of simultaneous Tor will be selected. For our estimation, we do not consider the effect of
users per day through the exits getting complaints and exit policies.
Fraction Blacklisted
0.75
0.75 0.75
CDF
CDF
0.50 0.50 0.50
0.25 0.25
0.25
0.00 0.00
−5000 0 5000 10000 0 5000 10000
0.00
Time until enlistment of Tor IPs Time until enlistment of Tor IPs University IPs VPNGate HMA VPN Tor Exits
in paid−aggregator (in hours) in Contributed Blacklist 12 (in hours) Types of IP Addresses
(a) Paid Aggregator (proactive) (b) Contributed Blacklist 12 (reactive) (c) Fraction of IPs blacklisted
Figure 2: (a) and (b) provide the time (in hours) between first seeing a relay IP address in the consensus until the given
blacklist enlists the IP address. Negative values indicate cases where the IP address was blacklisted before appearing in
the consensus. Figure (c) compares the fraction of public IP addresses of different types of networks that are currently
in any tracked blacklist.
work. Some relays have historically, at different points HMA IP addresses Tor Exit IP addresses
in time, been both exit and non-exit relays in the Tor con- 1.00
0.50
0.25
0.00
sh
tp
12
sh
14
11
ttp
15
ip
10
or
lis
lis
tfi
t−
sm
BL
BL
at
−s
−s
−h
ck
ck
os
−c
BL
BL
BL
BL
BL
or
eg
on
ps
l−
la
a
ps
−p
ed
ed
ps
sn
bl
gr
ed
ed
ed
ed
ed
ai
−b
di
ag
di
t
ps
di
−
bu
bu
ag
m
ba
t
t
ba
ip
ip
ba
bu
bu
bu
bu
bu
dr
di
−q
tri
tri
−
e−
−
ba
tr i
tr i
tr i
tr i
tr i
id
id
on
on
ps
td
pa
on
on
on
on
on
pa
C
C
di
is
C
C
ba
kl
oc
bl
IP blacklist
Figure 3: Fraction of Tor relay and HMA VPN IP addresses listed in IP blacklists (including proactive and reactive).
Some feed names are derived based on the broad categories of undesired traffic they blacklist: e.g., ssh (badips-ssh,
dragon-ssh), content management systems/Wordpress (badips-http, badips-cms).
5.5 Exit policies and bandwidth address was blacklisted (reactive blacklisting only) and
its overall uptime. Based on the coefficients learned by
We looked for but did not find any associations between the regression model, we conclude that policy, consensus
various factors and blacklisting. In an attempt to counter weight, and relay uptime have very little observed asso-
IP blacklisting and abusive traffic, the Tor community ciation on IP blacklisting of Tor relays. We provide more
has suggested that exit operators adopt more conserva- details about the regression model in Appendix C.
tive exit policies [18]. Intuitively, a more open exit pol-
icy allows a larger variety of traffic (e.g., BitTorrent, ssh,
telnet) that can lead to a larger variety of undesired traffic 5.6 Our Newly Deployed Exit Relays
seen to originate from an exit. We analyze the exit relays As described in Section §3, we operated exit relays of
that first appeared in the consensus after the IP reputation various bandwidth capacities and exit policies to actively
system started to gather data using the hourly consen- monitor the response of the IP reputation system. In
suses of year 2015 and 2016. Since exit relays have a va- this subsection, we analyze the sequence of blacklisting
riety of exit policies, we find which well-known exit pol- events for each exit relay that we ran. Figure 5 shows the
icy (Default, Reduced, Reduced-Reduced, Lightweight, timeline of blacklisting events for each of the exit relays
Web) most closely matches the relay’s exit policy. To we operated. Each coloured dot represents an event. An
compute this closeness between exit policies, we calcu- event is either the appearance of a relay on a blacklist or
late the Jaccard similarity between the set of open ports its appearance in the consensus (an up event).
on a relay and each well-known exit policy. (See Ap- Prior to launching the exits, none of our prospective
pendix B). In this way, we associated approximate exit relays’ IP addresses were on any blacklist. We see that
policies to 21,768 exit relays. We found that in the last 18 within less than 3 hours of launching, feeds like Snort
months, only 1.2% of exit relays have exhibited different IP listed all our relays, supporting our classification of
well-known exit policies, and excluded these from our Snort IP as a proactive blacklist. Additionally, both Snort
analysis. In the resulting set of exits, we assigned 81% IP and Paid Blacklist (also classified as proactive) block
to Default, 17% to Reduced, 0.6% to Reduced-Reduced, our relay IP addresses for long periods of time. Snort
0.5% to lightweight and 0.4% to Web policy. IP enlists all of relays, and did not remove them for the
We also compute the uptime (in hours) for each of the entire duration of their lifetime. Paid Blacklist enlists IP
exit relays as the number of consensuses in which the addresses for durations of over a week. Blacklists such
relay was listed. In addition, we maintain the series of as badips-ssh (for protecting SSH) and badips-cms (for
consensus weights that each relay exhibits in its lifetime. protecting content management systems such as Word-
Higher consensus weights imply more traffic travelling press and Joomla) have short bans spanning a few days.
through the relay, proportionally increasing the chance Contributed Blacklist 12 has the shortest bans, lasting
of undesired traffic from a relay. A high uptime increases only a few hours. We consider Contributed Blacklist 12’s
the chance of use of a relay for undesired activities. blacklisting strategy in response to undesired traffic to be
We trained a linear regression model on the policies, in the interest of both legitimate Tor users and content
scaled uptimes, and consensus weights of exit relays, providers that do not intend to lose benign Tor traffic.
where the observed variable was the ratio of hours the IP On November 29, 2016, we turned off all of our relays
l
badips−http Contributed BL 12l
paid−ip−blacklist l
l
badips−ssh Contributed BL 9 l
snort−ip l
Small−RR−1 | l
Small−RR−2 | l
Small−Default−1 |
Relay Nicknames
Small−Default−2 | l
Medium−RR−1 | l l
Medium−RR−2 | l
Medium−Default−1| l l
Medium−Default−2 | l
Large−Default−1 | l l
Large−Default−2 | l l
10/10/16
10/17/16
10/24/16
10/31/16
11/07/16
11/14/16
11/21/16
11/28/16
12/05/16
12/12/16
12/19/16
12/26/16
01/02/17
01/09/17
01/16/17
01/23/17
01/30/17
Time
Figure 5: Blacklisting of our exit relays over time. Each coloured dot shows the instant when a relay was on a blacklist.
Snort IP and Paid Blacklist have long term bans while other blacklists enlist IPs for short periods of time ranging from
hours to a few days.
to observe how long a proactive blacklist like Snort IP presence of a “search box”. Upon finding the search box,
would take to de-enlist our relays. We observe that such the crawler enters and submits a pre-configured search
blacklists drop our relays just as fast as they enlist them, query. Our crawler found and tested the search func-
suggesting a policy of crawling the Tor consensus. tionality of 243 websites from the Alexa Top 500. We
Note that a synchronised absence of data from any performed the search functionality crawl once. (3) Login
blacklist, while the relays are up, represents an outage functionality crawls load front pages and scan them for
of the IP reputation system. the presence of a “login” feature. Upon finding the fea-
ture, and if it has credentials for the webpage available,
the crawler authenticates itself to the site (using Face-
6 Crawling via Tor
book/Google OAuth when site-specific credentials were
To quantify the number of websites discriminating unavailable). We created accounts on OAuth-compatible
against Tor, we performed crawls looking both at front- websites prior to the crawl. Since the created accounts
page loads, as in prior work [2], and at search and login had no prior history associated with them, we speculate
functionality. We crawled the Alexa Top 500 web pages that they were unlikely to be blocked as a result of un-
from a control host and a subset of Tor exit relays. These usual behavior. For example, we found that LinkedIn
crawls identify two types of discrimination against Tor blocks log ins from Tor for accounts with prior log in
users: (1) the Tor user is blocked from accessing con- history, but not for new accounts. Our crawler found
tent or a service accessible to non-Tor users, or (2) the and tested the login functionality of 62 websites from
Tor user can access the content or service, but only after the Alexa Top 500. We performed the login functionality
additional actions not required of non-Tor users—e.g., crawl once.
solving a CAPTCHA or performing two-factor authenti- The crawler records screenshots, HTML sources, and
cation. HARs (HTTP ARchives) after each interaction. Our in-
teractive crawler improves upon previous work in sev-
eral ways. First, it uses a full browser (Firefox) and in-
6.1 Crawler Design
corporates bot-detection avoidance strategies (i.e., rate-
We developed and used a Selenium-based interactive limited clicking, interacting only with visible elements,
crawler to test the functionality of websites. We per- and action chains which automate cursor movements
formed three types of crawls: (1) Front-page crawls at- and clicks). These features allow it to avoid the (bot-
tempt to load the front page of each website. We repeated )based blocking observed while performing page-loads
the crawl four times over the course of six weeks. (2) via utilities such as curl and other non-webdriver li-
Search functionality crawls perform front-page loads and braries (urllib). Second, its ability to interact with
then use one of five heuristics (Table 5) to scan for the websites and exercise their functionality allows us to
CDF of relays
0.8
When exercising the search functionality of the 243 0.6
search-compatible websites, we see a 3.89% increase in 0.4
discrimination compared to the front-page load discrim- 0.2
ination observed for the same set of sites. Similarly, 0.0
when exercising the login functionality of the 62 login- 0.0 0.2 0.4 0.6 0.8 1.0
Fraction of sites discriminating against relay
compatible websites, we observe a 7.48% increase in dis-
crimination compared to the front-page discrimination
(a) Distribution of discrimination faced by relays.
observed for the same set of sites.
1.0
Websites Interaction Discrimination observed
CDF of sites
0.8
A-500 Front page 20.03% 0.6
0.4
Front page 17.44%
S-243 0.2
Front page + Search 21.33% (+3.89%)
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Front page 17.08%
L-62 Fraction of relays discriminated against by site
Front page + Login 24.56% (+7.48%)
0.8
Alexa Top 500. We find that no relay experiences dis- 0.6
crimination by more than 32.6% of the 500 websites, 0.4
but 50% of the exit relays are discriminated against by 0.2
more than 27.4% of the 500 websites. Figure 7b shows 0.0
0.0 0.2 0.4 0.6 0.8 1.0
the distribution of discrimination performed by websites Fraction of relays discriminated against by site
against Tor exit relays. Here, we see that 51% of the
websites perform discrimination against fewer than 5% Figure 8: Distribution of discrimination performed by
of our studied exits, while 11% of websites perform dis- websites hosted on four of the six most popular hosting
crimination against over 70% of our studied exits. platforms.
We now examine various factors associated with Tor
discrimination. Since we did not (and in many cases can-
not) randomly assign these factors to websites or relays, between exit-relay characteristics and the discrimination
these associations might not be causal. faced by them found no significant correlations when ac-
Hosting Provider. Figure 8 shows the fraction of relays counting for relay-openness (fraction of ports for which
discriminated against by websites hosted on four of the the exit relay will service requests) or for the age of the
six most-used hosting platforms. We find that Amazon- relay. We found a small positive correlation (Pearson
and Akamai-hosted websites show the most diversity correlation coefficient: 0.147) between the relay band-
in discrimination policy, which we take as indicative width and degree of discrimination faced, but the result
of websites deploying their own individual policies and was not statistically significant (p-value: 0.152). Fig-
blacklists. In contrast, CloudFlare has several clusters of ure 9 presents these results graphically. We further ana-
websites, each employing a similar blacklisting policy. lyze the impact of relay characteristics on discrimination
This pattern is consistent with CloudFlare’s move to al- performed by websites using popular hosting providers .
low individual website administrators to choose from one We find that only Amazon has a statistically significant
of several blocking policies for Tor exit relays [1]. Fi- positive correlation between discrimination observed and
nally, we see 80% of China169- and CloudFlare-hosted relay bandwidth (Pearson correlation coefficient: 0.247,
websites perform discrimination against at least 60% of p-value: 0.015). These results are illustrated in Fig-
our studied relays. ure 10.
Relay Characteristics. Our analysis of the association Service Category. We now analyze how aggressively
0.4
CDF of sites
Age PCC: -0.07, p-value: 0.51 0.8
0.3 0.6
0.2
0.4
0.1
0.2
0.0 2
10 10 3 10 4 10 5 0.0
Bandwidth (KBps) 0.0 0.2 0.4 0.6 0.8 1.0
Fraction of relays discriminated against by site
Figure 9: Relationship between relay characteristics and Figure 11: Distribution of discrimination performed by
discrimination faced. Each circle represents a single re- websites in various categories.
lay. Lighter colors indicate younger relays and larger
circles indicate more open exit policies. The legend
shows the Pearson correlation co-efficient (PCC) and the Day -1 crawl utilized the crawler from Khattak et al.; see
p-value for each characteristic. below), although we note that the IP addresses used by
our exit relays were never used by other Tor exit relays
in the past, and did not appear in any of our studied com-
discriminating sites
Fraction of Amazon
0.8
working and online shopping sites share similar blocking 0.6
behavior. Websites in these categories are also observed 0.4
to be the most aggressive—with 50% of them blocking Our crawl
0.2 Khattak et al
over 60% of the chosen relays. Figure 11 illustrates the 0.0
results. 0.0 0.2 0.4 0.6 0.8 1.0
Fraction of relays discriminated against by site
The Evolution of Tor Discrimination. We now fo-
cus on discrimination changes over time. For this ex-
Figure 12: Impact of methodological changes on mea-
periment, we conducted four crawls via our own ten exit
sured discrimination from data generated by a single
relays to the Alexa Top 500 websites. Let Day 0 denote
front-page crawl.
the day when we set the relay’s exit flag. We conducted
crawls on Day -1, Day 0, and once a week thereafter.
Table 7 shows the fraction of websites found to to dis- Measurement Methodology. We now measure the
criminate against each exit set during each crawl. We impact of changes in our discrimination identification
observe increases in discrimination when the exit flag is methodology compared to previous work by Khattak et
assigned. We can attribute some of this can to our im- al. [2]. The key differences between the methodologies
proved crawling methodology deployed on Day 0 (the are: (1) The measurements conducted by Khattak et al.
Fraction of incomplete
1.0 1.0
handshakes
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
S10
S11
S12
S13
S14
S10
S11
S12
S13
S14
S0
S1
S2
S3
S4
S5
S6
S7
S8
S9
S0
S1
S2
S3
S4
S5
S6
S7
S8
S9
URL Category URL Category
(a) Fraction of HTTP error codes (b) Fraction of incomplete TLS handshakes
Figure 13: Fraction of errors encountered by users visiting the Top 1M websites over time. The URL category S1
consists of the top (1–100) websites and Sn (n ≥ 2) consists of sites in the top [100 × 2n−2 + 1 to 100 × 2n−1 ] ranks.
Ultimately, we do not view IP-based blacklisting as [3] Tor Project: Anonimity Online. Tor Metrics. Avail-
a suitable long-term solution for the abuse problem. In able at https://metrics.torproject.org.
addition to Tor aggregating together users’ reputations,
IPv4 address exhaustion has resulted in significant IP ad- [4] Peter Zavlaris. Cloudflare vs Tor: Is IP Block-
dress sharing. IPv6 may introduce the opposite problem: ing Causing More Harm than Good? Dis-
the abundance of addresses may make it too easy for a till Networks’ Blog. Available at https:
single user to rapidly change addresses. Thus, in the long //resources.distilnetworks.com/all-blo
run, we believe that online service operators should shift g-posts/cloudflare-vs-tor-is-ip-block
to more advanced ways of curbing abuse; ideally, ones ing-causing-more-harm-than-good.
compatible with Tor.
[5] Christophe Cassa. Tor – the good, the bad,
and the ugly. Sqreen blog, 2016. Avail-
Acknowledgements able at https://blog.sqreen.io/tor-the-g
ood-the-bad-and-the-ugly/.
The authors would like to thank Facebook Threat Ex-
change for providing IP blacklists and Tor exit operators: [6] Akamai. Akamai’s [state of the internet] / security,
Moritz Bartl (Torservers.net), Kenan Sulayman (apx), Q2 2015 report. https://www.stateoftheinte
Riccardo Mori (jahjah), and the operator of the exit re- rnet.com/downloads/pdfs/2015-cloud-sec
lay TorLand1 for sharing the abuse complaints they re- urity-report-q2.pdf.
ceived. We are grateful to David Fifield, Mobin Javed
[7] Ben Herzberg. Is TOR/I2P traffic bad for your site?
and the anonymous reviewers for helping us improve this
Security BSides London 2017. Available at https:
work. We acknowledge funding support from the Open
//www.youtube.com/watch?v=ykqN36hCsoA.
Technology Fund and NSF grants CNS-1237265, CNS-
1518918, CNS-1406041, CNS-1350720, CNS-1518845, [8] IBM. IBM X-Force Threat Intelligence Quar-
CNS-1422566. The opinions in this paper are those of terly,3Q 2015. IBM website. Available
the authors and do not necessarily reflect the opinions of at http://www-01.ibm.com/common/ssi/cgi
a sponsor or the United States Government. -bin/ssialias?htmlfid=WGL03086USEN.
[17] The Tor Project. Is there a list of default [30] VPN Gate Academic Experiment Project at Na-
exit ports? Tor FAQ. Accessed Feb. 14, tional University of Tsukuba, Japan. VPN Gate:
2017. https://www.torproject.org/docs/fa Public VPN Relay Servers. http://www.vpngat
q.html.en#DefaultExitPorts. e.net/en/.
[18] Contributors to the Tor Project. Reducedexitpolicy. [31] Privax, Ltd. Free Proxy List – Public Proxy Servers
Tor Wiki, 2016. Version 33 (May 8, 2016). https: (IP PORT) – Hide My Ass! http://proxylist.
//trac.torproject.org/projects/tor/wik hidemyass.com.
i/doc/ReducedExitPolicy?version=33.
[32] The Internet Archive. Internet Archive: Wayback
[19] Details for: apx1. Atlas. Available at https: Machine. https://archive.org/web/.
//atlas.torproject.org/#details/51377C
496818552E263583A44C796DF3FB0BC71B. [33] Evan Klinger and David Starkweather. phash–
the open source perceptual hash library. pHash.
[20] Details for: apx2. Atlas. Available at https: Aνακτ ήθ ηκε, 14(6), 2012.
//atlas.torproject.org/#details/A6B
0521C4C1FB91FB66398AAD523AD773E82E77E. [34] McAfee. Customer URL ticketing system.
www.trustedsource.org/en/feedback/url?
[21] Details for: apx3. Atlas. Available at https: action=checklist.
//atlas.torproject.org/#details/38A42B
8D7C0E6346F4A4821617740AEE86EA885B. [35] Distil Networks. Help Center: Third
Party Plugins That Block JavaScript.
[22] Torland1 history. Exonerator. Available at https://help.distilnetworks.com/hc/e
https://exonerator.torproject.org/?ip= n-us/articles/212154438-Third-Party-B
37.130.227.133×tamp=2017-01-01&la rowser-Plugins-That-Block-JavaScript.
ng=en.
[36] Tariq Elahi, George Danezis, and Ian Goldberg.
[23] Details for: jahjah. Atlas. Available at https: Privex: Private collection of traffic statistics for
//atlas.torproject.org/#details/2B72D anonymous communication networks. In Proceed-
043164D5036BC1087613830E2ED5C60695A. ings of the 2014 ACM SIGSAC Conference on
Censorship A routing-capable adversary can censor traf- AS-aware Tor variants The work perhaps closest to
fic that enters its borders. Commonly, with Tor, this in- ours in terms of overall goals is a series of systems
volves identifying the set of active Tor routers and simply that try to avoid traversing particular networks once (or
dropping traffic to or from these hosts. The Tor Metrics twice). To the best of our knowledge, these have focused
project monitors several countries who appear to perform almost exclusively on using autonomous system (AS)-
this kind of censorship [37]. level maps of the Internet [2, 19, 10]. Like DeTor, the
idea is that if we can reason about and enforce where our
Traffic deanonymization Consider an attacker who is
packets may (or may not) go between hops in the circuit,
able to observe the traffic on a circuit between the source
we can address attacks such as censorship and certain
and the entry node and between the exit node and the des-
forms of traffic deanonymization.
tination. The attacker can correlate these two seemingly
As described in §2.2, however, we assume in this pa-
independent flows of traffic in a handful of ways. For
per an adversary who has the ability to manipulate an ob-
instance, a routing-capable adversary operating a router
server’s map of the Internet, for instance by withholding
on the path between source and entry could introduce jit-
some routing advertisements, withholding traceroute
ter between packets that it could detect in the packets
responses, and so on. Instead of relying on these manip-
between exit and destination. This works because Tor
ulable data sources, we base our proofs on physical, nat-
routers do not delay or shuffle their packets, but rather
ural limitations: the fact that information cannot travel
send them immediately in order to provide low laten-
faster than the speed of light. As an additional depar-
cies [6].
ture from this line of work, we focus predominately on
Website fingerprinting Even an attacker limited to ob- nation-state adversaries, which are easier to locate and
serving only the traffic between the source and entry geographically reason about than networks (which may
node can be capable of deanonymizing traffic. In par- have points of presence throughout the world).
ticular, if the destination’s traffic patterns (e.g., the num- DeTor’s proofs of avoidance come at a cost that sys-
ber of bytes transferred in response to apparent requests) tems that do not offer provable security do not have to
e e e
x x
s s s
(a) When Ce (yellow) and Cx (darker (b) When Ce and Cx do intersect (dark green), we must compute the shortest distances
blue) do not intersect, double-traversal for both legs to go through each country (right). The dashed lines show the shortest
of any country is impossible. distances through France, and the solid lines through Belgium.
Figure 2: Never-twice: To prove that a Tor circuit did not traverse any given country at the beginning and end of the
circuit, we compute the set of countries Ce that could have been on the path of the entry leg and the countries Cx that
could have been on the exit leg. This figure shows two example circuits with different exit nodes.
we have our proof of never-twice avoidance. Never-twice avoidance of colluding countries Fi-
However, if the intersection is non-empty, as in Fig- nally, we consider how to avoid deanonymization attacks
ure 2b, then we need to perform additional checks. For from a group of countries who might coordinate their ef-
each country F ∈ Ce ∩ Cx , we ensure that the minimum forts. For example, the Five Eyes is an alliance of five
RTT to go through the entry leg and F plus the minimum countries (Australia, Canada, New Zealand, the United
RTT to go through the exit leg and F is larger than the Kingdom, and the United States) who have agreed to
end-to-end RTT would allow: share intelligence. Were such a group of countries to
collude, then traversing one of them on the entry leg
and another on the exit leg could result in a successful
∀F ∈ Ce ∩Cx : (1 + δ ) · Re2e < deanonymization attack.
3 (5) Our above method for avoiding double-traversal of a
· (Dmin (s, F, e) + D(e, m, x) + Dmin (x, F,t)) country extends naturally to colluding nation-states. One
c
can simply use a modified database of country borders to
The subtle yet important difference between Eq. (5) flag all those in an alliance as a single “country.” That
and the previous equations is that the right hand side need this would result in a set of disjoint geographic polygons
not minimize distance with respect to a single f ∈ F. is of no concern to our algorithm; in fact, many coun-
Rather, there could be two distinct points in F: one on tries are already made up of disjoint regions (for instance
the entry leg’s path and another on the exit leg’s. This islands off of a country’s coast).
puts the attacker at a greater advantage; consider the
above example wherein the entry leg was geographically 5 DeTor Design
isolated to western France and the exit leg was isolated
to eastern France. When a single f ∈ F required to be The previous section demonstrates how to prove, for a
present on both legs, the packets would be required to given circuit, whether a single round-trip of communica-
traverse an extra distance of roughly twice the width of tion provably avoided a region (once or twice). Unfor-
France. But with separate points in F, it could impose tunately, not all circuits can provide such proofs, even if
arbitrarily low additional delay. they were to minimize latencies between all hosts. Triv-
What Eq. (5) does share in common with the other ially, any circuit with a Tor router in some region F can-
equations is that it can be computed purely locally, us- not be used to avoid F. Subtler issues can also arise, such
ing only the knowledge of the circuit relays’ locations as when two consecutive hops on a circuit are in direct
and a database of countries’ borders, which are readily line-of-sight of a forbidden region.
available [16]. In this section, we describe how DeTor identifies
0.8
Fraction of Src-Dest Pairs
Successful
Failed
0.6
No possible
0.4
No trusted
In forbidden region
0.2
0.0
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
China India Japan PR Korea Russia Saudi Arabia Syria US
Figure 4: DeTor’s success at never-once avoidance, and reasons for failure, across multiple choices of forbidden
regions and δ . Overall, DeTor is successful at avoiding all countries, even those prevalent on many paths, like the US.
tion of all source-destination pairs who (from bottom to lead to lower likelihoods of avoidance, as expected. This
top): (1) terminate in the forbidden region and therefore is particularly more pronounced with Russia, Syria, and
cannot possibly achieve avoidance, (2) do not have any Saudi Arabia; we believe that this, too, is because these
trusted nodes, typically because they are too close to the countries are near the cluster of European nodes. Inter-
forbidden region to ensure that anyone they are com- estingly, this impact is least pronounced with the more
municating with is not in it, (3) have trusted nodes but routing-central adversaries we tested (Japan and the US).
no circuits that could possibly provide provable avoid- Some have proposed defense mechanisms that introduce
ance, (4) have circuits that could theoretically avoid the packet forwarding delays in Tor [9, 5, 20]; these results
forbidden region, but none that do with real RTTs, and lend insight into how these defenses would compose with
(5) successfully avoid the forbidden region over at least DeTor. In particular, note that increasing δ in essence
one DeTor circuit. simulates greater end-to-end delays, which these defense
The key takeaway from this figure is that DeTor is gen- mechanisms would introduce. Thus, with greater delay
erally successful at finding at least one DeTor circuit for (intentional or not), DeTor experiences a lower likeli-
all countries and all values of δ . We note two exceptions hood of providing proofs of avoidance.
to this: Russia can only be avoided by approximately
35% of all source-destination pairs when δ = 0.5. We
Number of DeTor circuits The above results show
believe this is due to the fact that Russia is close to the
that we are successful at identifying at least one DeTor
large cluster of European nodes in our dataset.
circuit for most source-destination pairs. We next look
The US is another example of somewhat lower suc- at how many DeTor circuits are available to each source-
cess rate; this is due, again, to our dataset comprising destination pair.
many nodes from the US, and thus 45% of all pairs in our Figure 5 shows the distribution, across all source-
dataset cannot possibly avoid the US. However, of the destination pairs in our dataset, of the number of cir-
remaining source-destination pairs who do not already cuits that (1) offered successful never-once avoidance,
terminate in the US, 75% of them can successfully, prov- (2) were estimated to be possible (but may not have
ably avoid the US. We find this to be a highly encour- achieved avoidance with real RTTs), and (3) were
aging result, particularly given that the US is on very trusted, but not necessarily estimated to be possible. We
many global routes on the Internet. We note that this look specifically at the number of circuits while attempt-
is a higher avoidance rate than Alibi Routing was able to ing to avoid the US and China, with δ = 0.5.
achieve; we posit that this is because DeTor uses longer This result shows that approximately 30% of the
circuits, thereby allowing it to maneuver around even source-destination pairs were only able to successfully
nearby countries by first “stepping away” from them. In- use a single circuit while avoiding the US; 18% of pairs
vestigating the quality of longer DeTor circuits is an in- avoiding China had a single circuit. Fortunately, the
teresting area of future work. majority had much more: avoiding the US, the median
We also observe from Figure 4 that larger values of δ source-destination pair has over 1,000 successful circuits
Cummulative Fraction
of Src-Dst Pairs
of Src-Dst Pairs
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
100 101 102 103 104 105 100 101 102 103 104 105
Number of Circuits Number of Circuits
Figure 5: The distribution of the number of circuits that DeTor is able to find while avoiding (a) US and (b) China,
with δ = 0.5. Some source-destination pairs get only a single DeTor circuit, but the majority get 500 or more.
1 1
Cummulative Fraction
Cummulative Fraction
0.8 0.8
of Circuits
of Circuits
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0.005 0.01 0.015 0.02 0 0.005 0.01 0.015 0.02
Fraction of Src-Dst Pairs Fraction of Src-Dst Pairs
Figure 6: The distribution of the fraction of source-destination pairs for which a given circuit successfully provides
provable avoidance. (δ = 0.5)
at its disposal; when avoiding China, this number is 500. circuit. To evaluate this, we show in Figure 6 the distribu-
Even the most successful source-destination pairs tend tion of the fraction of source-destination pairs for which
to have far fewer successful circuits than “trusted” Tor a given circuit successfully provides proof of avoidance.
circuits. These results allow us to infer how well Tor’s The median circuit achieves provable avoidance to only
current policies work. Recall that, in today’s Tor, users 1.4% of source-destination pairs avoiding the US; 0.6%
can specify a set of countries from which they wish not when avoiding China. These numbers are lower than de-
to choose relays on their circuit. This is similar to the sired (compare them to standard Tor routing for which
“trusted” line in the plots of Figure 5 (in fact, because nearly 100% of circuits are viable), but we believe they
we actually test the ping times to verify that it is not in would be more reasonable in practice, for two reasons:
the country, Tor’s policy is even more permissive). This First, the Ting dataset we use is not representative of the
means that roughly 88% of the time (comparing the suc- kind of node density that exists in the Tor network; in col-
cessful median to the trusted median), Tor’s approach to lecting that dataset, the experimenters explicitly avoided
avoidance would in fact not be able to deliver a proof of picking many hosts that were very close to one another,
avoidance. It is in this sense that we say that Tor offers yet proximal peers are common in Tor. Second, in our
its users merely the illusion of control. simulations, we choose our source and destinations from
the Tor nodes in our dataset; in practice, clients and des-
All together, these results demonstrate the power of
tinations represent a far larger, more diverse set of hosts,
DeTor—simply relying on random chance is highly un-
and thus, we believe, would make it much more difficult
likely to result in a circuit with provable avoidance.
to deanonymize.
Given that there are source-destination pairs that have
only a handful of DeTor circuits, we ask the converse: are
6.2.2 Circuit diversity
there some circuits that offer avoidance for only a small
set of source-destination pairs? If so, then this opens Having many circuits is not enough to be useful in Tor;
up potential attacks wherein knowing the circuit could it should also be the case that there is diversity among
uniquely identify the source-destination pair using that the set of hosts on the DeTor circuits. Otherwise,
0.8 0.8
0.6 0.6
CDF
CDF
0.4 0.4
0.2 0.2
0 0
10-6 10-5 10-4 10-3 10-2 10-1 100 10-6 10-5 10-4 10-3 10-2 10-1 100
Prob. of Being on a Circuit Prob. of Being on a Circuit
Figure 7: Distribution of the 50th, 75th, and 90th percentile probabilities of a node being selected to be on a circuit,
taken across all DeTor circuits across source-destination pairs. Vertical lines denote these same percentiles across all
Tor circuits. DeTor introduces only a slight skew, preferring some nodes more frequently than usual. (δ = 0.5)
0.8 0.8
0.6 0.6
CDF
CDF
0.4 0.4
0.2 0.2
0 0
0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400
Round-trip Time (msec) Round-trip Time (msec)
Figure 8: The distribution of round-trip times for DeTor circuits (δ = 0.5) and regular Tor circuits. Because avoidance
becomes more difficult with higher-RTT circuits, DeTor’s successful circuits tend to have lower RTTs.
popular Tor routers may become overloaded, and it be- 90th percentile is typically less than Tor’s, indicating that
comes easier to predict which Tor routers will be on a DeTor more equitably chooses nodes to be on its circuits.
circuit, thereby potentially opening up avenues for at- It is true that DeTor may result in load balancing is-
tack. We next turn to the question of whether the set sues, especially if Tor routers are not widely geographi-
of circuits that DeTor makes available disproportionately cally dispersed – this is fundamental to DeTor: after all,
favor some Tor routers over others. if many users are avoiding the US, then all of this load
To measure how equitably DeTor chooses available would have to shift from the US to other routers. How-
Tor relays to be on its circuits, we first compute, for ever, as shown in Figure 7, while DeTor does introduce
each source-destination pair, the probability distribution some node selection bias, it is within the skew that Tor
of each Tor relay appearing on a successful DeTor cir- itself introduces.
cuit. Figure 7 shows the distribution of the 50th, 75th,
and 90th percentiles across all source-destination pairs. 6.2.3 Circuit performance
As a point of reference, the vertical lines represent these We investigate successful DeTor circuits by their latency
same percentiles for Tor’s standard circuit selection (re- and expected bandwidth. Figure 8 compares the dis-
call that Tor does not choose nodes uniformly at random, tribution of end-to-end RTTs through successful DeTor
but instead weights them by their bandwidth). circuits to the RTT distribution across all Tor circuits
We find that DeTor’s median probability of being cho- in our dataset. DeTor circuits have significantly lower
sen to be in a circuit is less than normal, as evidenced by RTTs—on the one hand, this is a nice improvement in
the 50th percentile curve being almost completely less performance. But another way to view these results is
than the 50th percentile spike. When avoiding the US, that DeTor precludes selection of many circuits, pre-
there is a slight skew towards more popular nodes, as ev- dominately those with longer RTTs. For some source-
idenced by the 75th percentile also being less than nor- destination-forbidden region triples, this is a necessary
mal. When avoiding China, on the other hand, DeTor’s byproduct of the fact that we are unlikely to be able
Cummulative Fraction
0.8 0.8
of Circuits
of Circuits
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1 10 100 1000 10000 100000 1 10 100 1000 10000 100000
Bandwidth (kbps) Bandwidth (kbps)
Figure 9: The distribution of minimum bandwidth for DeTor circuits (δ = 0.5) and regular Tor circuits.
Frac. of Circuits Successful
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Ratio: Max Acceptable Dist. / Min Possible Dist. Ratio: Max Acceptable Dist. / Min Possible Dist.
Figure 10: Success rates for never-once circuits as a function of the ratio between the maximum acceptable distance
(through the circuit but not through F) and the minimum distance (directly through the circuit). This shows a positive
correlation, indicating that it is feasible to predict which circuits will be successful. (δ = 0.5)
to get proofs of avoidance if we must traverse multiple distance (how far the packet could travel without travers-
trans-oceanic links. In these examples, China has access ing the circuit and the forbidden region) to the minimum
to some circuits with longer RTTs, because it is farther possible distance (the direct great-circle distance through
away from many of our simulated hosts than is the US. the circuit). Our insight is that, the larger this ratio is,
Figure 9 compares bandwidths of DeTor and Tor cir- the more “room for error” the circuit has, and the more
cuits. For each circuit, we take the minimum bandwidth, resilient it is to links whose RTTs deviate from the two-
as reported by Tor’s consensus bandwidths. Here, we see thirds speed of light.
largely similar distributions between DeTor and Tor, with
Tor having more circuits with lower bandwidth. We sus-
pect that those lower-bandwidth hosts that Tor makes use
of may also have higher-latency links, therefore making Figure 10 shows this ratio corresponds to the fraction
them less likely to appear in DeTor circuits. of theoretically-possible circuits that achieve successful
avoidance. As this ratio increases from 0 to 10, there is
6.2.4 Which circuits are more likely to succeed? a clear positive correlation with success. However, with
large ratio values, the relationship becomes less clear;
As Figure 5 showed, it is not uncommon for there to be this is largely due to the fact that large ratio values can be
one to two orders of magnitude more circuits that meet a result of very small denominators (the shortest physical
the theoretical requirements for being a DeTor circuit distance).
than there are circuits who achieve avoidance in prac-
tice. In a deployed setting, a client would ideally be able
to identify which circuits are more likely to work before
actually going through the trouble of setting up the con- These results lend encouragement that clients can
nection and attaching a transport stream to it. largely determine a priori which circuits are likely to
As a predictor for a circuit’s success for never-once provide provable avoidance. Exploring more precise fil-
avoidance, we take the ratio of the maximum acceptable ters is an area of future work.
Cummulative Fraction
6.3.1 How often does never-twice work?
of Src-Dst Pairs
0.8
0.6
Recall that, unlike never-once, there are no forbidden re-
gions explicitly stated a priori with never-twice. There- 0.4
fore, to evaluate how well never-twice works, we mea- 0.2
sure the number of source-destination pairs that yield a 0
successful DeTor circuit. 100 101 102 103 104 105
Number of Circuits
Ruling out the source-destination pairs who are in the
same country (as these can never avoid a double-transit),
Figure 11: The distribution of the number of circuits that
we find that 98.6% of source-destination pairs can find
DeTor is able to find with never-twice (δ = 0.5).
at least one never-twice DeTor circuit. This is a very
promising result, as it demonstrates that simple client-
side RTT measurements may be enough to address a has the ability to draw from a more diverse set of links—
wide range of attacks. In the remainder of this section, particularly when the entry and exit legs are farther from
we investigate the quality of the circuits that our never- one another.
twice avoidance scheme finds.
Figure 14 reinforces our finding that never-twice has
Turning once again to the number of circuits, Fig- access to a larger set of routers, as the distribution of suc-
ure 11 compares the number of circuits that DeTor identi- cessful never-twice bandwidths matches those of the full
fied as possibly resulting in a proof of avoidance (as com- Tor network much more closely.
puted using Eq. (5), and those that were successful given
real RTTs. Never-twice circuits tend to succeed with ap-
proximately 5× the number of circuits that never-once 6.3.4 Which circuits are more likely to succeed?
receives. This demonstrates how fundamentally different
these problems are, and that our novel approach of com- We close by investigating what influences a never-twice
puting “forbidden” countries on the fly (as opposed to DeTor circuit’s success or failure. Figure 15 shows
some a priori selection of countries to avoid with never- the fraction of possible never-twice circuits that were
once) results in greater success rates. found to be successful, and plots them as a function
of D(e, m, x)/D(s, e, m, x,t). This ratio represents how
much of the overall circuit’s length is attributable to the
6.3.2 Circuit diversity middle: that is, everything but the entry and exit legs.
There are several interesting modes in this figure that
We turn again to the question of how diverse the circuits are worth noting. When this ratio on the x-axis is very
are; are some Tor relays relied upon more often than oth- low, it means that almost the entire circuit is made up
ers when achieving never-twice avoidance? of the entry and exit legs, and therefore they are very
Figure 12 shows the percentile distribution across all likely to intersect—as expected, few circuits succeed at
successful never-twice DeTor circuits. Compared with this point. DeTor succeeds more frequently as the mid-
never-once (Fig. 7), never-twice circuits fall even more dle legs take on a larger fraction of the circuit’s distance,
squarely within the distribution of normal Tor routers but then begins to fail as the length of the middle legs
(the vertical spikes in the figures). In particular, the top approaches the combined length of the entry and exit
10% most commonly picked nodes appear roughly as of- legs. This is because, in our dataset, when the middle
ten as Tor’s top 10% (the median 90th percentile is al- legs are approximately as long as the entry and exit legs,
most exactly equal to Tor’s 90th percentile). The median this tends to correspond to circuits made out of the two
node is slightly less likely to be selected than in Tor, in- clusters of nodes: one in North America and the other
dicating only a small skew to more popular nodes. in Europe. Because these clusters are tightly packed, the
probability of intersecting entry and exit legs increases.
6.3.3 Circuit performance This probability of intersection decreases when the cir-
cuits no longer come from such tightly packed groups.
We investigate successful DeTor never-twice circuits, When the middle legs dominate the circuit’s distance
once more turning to latency and expected bandwidth. (the ratio in the figure approaches one), we again enter
Figure 13 compares successful never-twice DeTor cir- a particular regime in our dataset: These very high ra-
cuits’ RTTs to those of Tor. Compared to never-once tio values correspond to circuits with source and entry
(Fig. 8), there are never-twice DeTor circuits with greater node both in North America (or in Europe), and with
RTTs. We conclude from this that never-twice avoidance middle legs that traverse the Atlantic (and then return).
Cummulative Fraction
0.8 0.8 0.8
of Circuits
0.6 0.6 0.6
CDF
CDF
0.4 0.4 0.4
0 0 0
10-6
-5 -4 -3 -2 -1 0 0 200 400 600 800 1000 1200 1400 1 10 100 1000 10000 100000
10 10 10 10 10 10
Prob. of Being on a Circuit Round-trip Time (msec) Bandwidth (kbps)
Figure 12: Nodes’ probability of be- Figure 13: Round-trip times for Figure 14: Bandwidth for never-twice
ing on a successful never-twice DeTor never-twice DeTor circuits (δ = 0.5) DeTor circuits (δ = 0.5) and regular
circuit. (δ = 0.5) and regular Tor circuits. Tor circuits.
Frac. of Circuits Successful
[17] N. Hopper, E. Y. Vasserman, and E. Chan-Tin. How [29] S. J. Murdoch and P. Zieklński. Sampled traf-
much anonymity does network latency leak? ACM fic analysis by Internet-exchange-level adversaries.
Transactions on Information and System Security In Workshop on Privacy Enhancing Technologies
(TISSEC), 13(2):13, 2010. (PET), 2007.
1
Carnegie Mellon University
2
Indiana University Bloomington
3
Samsung
4
University of Chicago
was overprivileged, using the current Samsung Smart- Figure 2: Users install IoT apps through mobile devices, allow-
Things interface, 48% of participants chose to install the ing the vendor’s IoT cloud to interact with the user’s locally
overprivileged app in each of five tasks. With Smart- deployed devices. IoT apps pair event handlers to devices, issue
Auth, however, this rate reduces to 16%, demonstrating direct commands, and enable external interaction via the web.
the value of SmartAuth in avoiding overprivileged apps.
We also generated automated patches to the 180 Smart-
Apps to validate compatibility of our policy enforcement 2 BACKGROUND
mechanism, and we found no apparent conflicts with 2.1 Home Automation Systems
SmartAuth. Given our observations of the effectiveness
of the technique, the low performance cost, and the high Home automation is growing with consumers, and many
compatibility with existing apps and platforms, we believe homeowners deploy cloud-connected devices such as ther-
that SmartAuth can be utilized by IoT app marketplaces mostats, surveillance systems, and smart door locks. Re-
to vet submitted apps and enhance authorization mecha- cent studies predict home automation revenue of over
nisms, thereby providing better user protection. $100 billion by 2020 [31], drawing even more vendors
Our key contributions in this paper are as follows: into the area. As representative examples, Samsung
SmartThings and Vera MiOS [54] connect smart devices
with a cloud-connected smart hub. Such vendors typically
• We propose the SmartAuth authorization mechanism host third-party IoT apps in the cloud, allowing remote
for protecting users under current and future smart monitoring and control in a user’s home environment.
home platforms, using insights from code analysis Figure 2 illustrates a typical home automation system
and NLP of app descriptions. We provide a new architecture. We use Samsung SmartThings to exemplify
solution to the overprivilege problem and contribute key concepts and components of such a system.
to the process of human-centered secure computing. IoT apps written by third-party developers can get ac-
• We design a new policy enforcement mechanism, cess to the status of sensors and control devices within a
compatible with current home automation frame- user’s home environment. Such access provides the ba-
works, that enforces complicated, context-sensitive sic building blocks of functionality to help users manage
security policies with low overhead. their home, for example turning on a heater only when
• We evaluate SmartAuth over real-world applications the temperature falls below the set point. Figure 2 depicts
and human subjects, demonstrating the efficacy and cloud-based IoT apps B EACON C ONTROL and S IMPLE
usability of our approach to mitigate the security C ONTROL installed by a user from their mobile device
risks of overprivileged IoT apps. and with access to the user’s relevant IoT devices.
Current IoT platforms use capabilities [36] to describe
app functionality and request access control and autho-
The remainder of this paper is organized as follows. rization decisions from users. Unlike permissions, capa-
In Section 2, we present the models and assumptions bility schemes are not designed for security, but rather
for our work. In Section 4, we describe the high-level for device functionality. A smart lighting application,
design of SmartAuth. We present the details of our design for example, would have capabilities to read or control
and implementation in Section 5, and our evaluation of the light switch, light level, and battery level. Due to
SmartAuth follows in Section 6. We highlight relevant complexity, capabilities in home automation platforms
related work in Section 7 and conclusions in Section 9. are often coarse grained. One capability might allow an
app to check several device attributes (status variables) or
issue a variety of commands. This functionality-oriented
1 Our user studies were conducted with IRB approval. design creates potential privacy risks, as granting an app
Content Policy
Program
Inspector Enforcer
Analyzer
(NLP)
Consistency Authorization
Checker Security creator
Policy
Latency in milliseconds
300
Both SmartAuth and Samsung SmartThings participants
focused on two factors in common: functionality and ease 250
[12] D EMIRIS , G. Privacy and social implications of distinct sens- [30] J IA , Y. J., C HEN , Q. A., WANG , S., R AHMATI , A., F ERNANDES ,
ing approaches to implementing smart homes for older adults. E., M AO , Z. M., AND P RAKASH , A. ContexIoT: Towards pro-
In Annual International Conference of the IEEE Engineering in viding contextual integrity to appified IoT platforms. In Network
Medicine and Biology Society (2009), IEEE, pp. 4311–4314. and Distributed System Security Symposium (NDSS’17) (2017).
[31] J UNIPER R ESEARCH. Smart Home Revenues to
[13] E GELMAN , S., JAIN , S., P ORTNOFF , R. S., L IAO , K., C ON -
reach $100 Billion by 2020. https://www.
SOLVO , S., AND WAGNER , D. Are you ready to lock? In
juniperresearch.com/press/press-releases/
Proceedings of the 2014 ACM SIGSAC Conference on Computer
smart-home-revenues-to-reach-\protect \T1\
and Communications Security (2014).
textdollar 100-billion-by-2020, 2008–2016.
[14] E GELMAN , S., K ANNAVARA , R., AND C HOW, R. Is this thing [32] K ELLEY, P. G., C ONSOLVO , S., C RANOR , L. F., J UNG , J.,
on?: Crowdsourcing privacy indicators for ubiquitous sensing S ADEH , N., AND W ETHERALL , D. A conundrum of permissions:
platforms. In 33rd Annual ACM Conference on Human Factors in Installing applications on an Android smartphone. In International
Computing Systems (2015), ACM, pp. 1669–1678. Conference on Financial Cryptography and Data Security (FC’12)
[15] FAWAZ , K., K IM , K.-H., AND S HIN , K. G. Protecting privacy of (2012).
BLE device users. In 25th USENIX Security Symposium (2016), [33] K ELLEY, P. G., C RANOR , L. F., AND S ADEH , N. Privacy as part
USENIX Association. of the app decision-making process. In ACM SIGCHI Conference
[16] F ELT, A. P., E GELMAN , S., F INIFTER , M., A KHAWE , D., AND on Human Factors in Computing Systems (CHI’13) (2013).
WAGNER , D. How to ask for permission. In HotSec (2012). [34] K IM , T. H.-J., BAUER , L., N EWSOME , J., P ERRIG , A., AND
[17] F ELT, A. P., E GELMAN , S., AND WAGNER , D. I’ve got 99 WALKER , J. Challenges in access right assignment for secure
problems, but vibration ain’t one: a survey of smartphone users’ home networks. In HotSec (2010).
concerns. In 2nd ACM workshop on Security and privacy in [35] K IM , T. H.-J., BAUER , L., N EWSOME , J., P ERRIG , A., AND
smartphones and mobile devices (2012), ACM, pp. 33–44. WALKER , J. Access right assignment mechanisms for secure
[18] F ELT, A. P., H A , E., E GELMAN , S., H ANEY, A., C HIN , E., AND home networks. Journal of Communications and Networks (2011).
WAGNER , D. Android permissions: User attention, comprehen- [36] L EVY, H. M. Capability-based computer systems. Digital Press,
sion, and behavior. In 8th Symposium on Usable Privacy and 2014.
Security (SOUPS’12) (2012).
[37] L IU , B., L IN , J., AND S ADEH , N. Reconciling mobile app
[19] F ERNANDES , E., J UNG , J., AND P RAKASH , A. Security analysis privacy and usability on smartphones: Could user privacy profiles
of emerging smart home applications. In 36th IEEE Symposium help? In 23rd international conference on World wide web (2014),
on Security and Privacy (2016). ACM, pp. 201–212.
capabilities-reference.html capabilities-reference.html
We assume the presence of a trusted path for users User Initiation Every operation on privacy-sensitive
to receive unforgeable communications from the system sensors must be initiated by an authentic user input event.
and provide unforgeable user input events to the system. User Authorization Each operation on privacy-
We assume that trusted paths are protected by mandatory sensitive sensors requested by each application must be
access control [49, 50] as well, which ensures that only authorized by the user explicitly prior to that operation
trusted software can receive input events from trusted being performed.
system input devices to guarantee the authenticity (i.e., Limited User Effort Ideally, only one explicit user
prevent forgery) of user input events. authorization request should be necessary for any benign
Trusted path communication from the system to the application to perform an operation targeting privacy-
user uses a trusted display area of the user interface, sensitive sensors while satisfying the properties above.
which we assume is available to display messages for Application Compatibility No application code
users and applications do not have any control of the should require modification to satisfy the properties
content displayed in this area; thus they cannot interfere above.
with system communications to or overlay content over We aim to control access to privacy-sensitive sensors
the trusted display area. that operate in discrete time intervals initiated by the user,
These assumptions are in line with existing research such as the cameras, microphone, and screen buffers. We
that addresses the problem of designing and building believe the control of access to continuous sensors, such
trusted paths and trusted user interfaces for browsers [55], as GPS, gyroscope, and accelerometer, requires a different
X window systems [47, 56], and mobile operating sys- approach [34], but we leave this investigation as future
tems [26, 27]. The design of our prototype leverages work.
mechanisms provided by the Android operating system To achieve these objectives, we design the AWare au-
satisfying the above assumptions, as better described in thorization framework. The main insight of the AWare de-
Section 7. sign is to extend the notion of an authorization tuple (i.e.,
Threat Model - We assume that applications may subject, resource, operation) used to determine whether
choose to present any user interface to users to obtain to authorize an application’s operation request to include
user input events, and applications may choose any opera- the user interface configuration used to elicit the user in-
tion requests upon any sensors. Applications may deploy put event. We call these extended authorization tuples
user interfaces that are purposely designed to be similar operation bindings, and users explicitly authorize oper-
to that of another application, and replay its user inter- ation bindings before applications are allowed to access
face when another application is running to trick the user sensors. An operation binding may reused to authorize
into interacting with such interface to “steal” such user subsequent operations as long the application uses the
input event. Applications may also submit any operation same user interface configuration to elicit input events to
request at any time when that application is running, even request the same operation.
without a corresponding user input event. Applications Approach Overview. Figure 6 summarizes the steps
may change the operation requests they make in response taken by the AWare to authorize applications’ operation
to user input events. requests targeting privacy-sensitive sensors.
In a typical workflow, an application starts by specify-
5 Research Overview ing a set of user interface configuration, such as widgets
Our objective is to develop an authorization mechanism and window features, to the trusted software (step 1 ) in
that eliminates ambiguity between user input events and charge of rendering such widgets with windows to elicit
the operations granted to untrusted applications via those user input (step 2 ). An authentic user interaction with
events, while satisfying the following security, usability, the application’s widgets in a user interface configuration
and compatibility properties: generates user input events (step 3 ), which are captured
8 AWare Evaluation
We investigated the following research questions. 8.1.1 Laboratory-Based User Study
To what degree is the AWare operation binding concept
assisting the users in avoiding attacks? We performed a We performed a laboratory-based user study to evaluate
laboratory-based user study and found that the operation the effectiveness of AWare in supporting users in avoiding
binding enforced by AWare significantly raised the bar attacks by malicious apps and compared it with alternative
for malicious apps trying to trick the users in authoriz- approaches.
ing unintended operations, going from an average attack
success rate of 85% down to 7%, on average, with AWare. We divided the participants into six groups. Partic-
ipants in Group1 interacted with a stock Android OS
What is the decision overhead imposed to users due to
using install-time permissions. Participants in Group2
per-configuration access control? We performed a field-
interacted with a stock Android OS using first-use permis-
based user study and found that the number of decisions
sions. Participants in Group3 interacted with a modified
imposed to users by AWare remains confined to less than
version of the Android OS implementing input-driven
four decisions per app, on average, for the study period.
access control, which binds user input events to the op-
How many existing apps malfunction due to the inte-
eration requested by an app but does not prove the app’s
gration of AWare? How many operations from legitimate
identity to the user. Participants in Group4 interacted
apps are incorrectly blocked by AWare (i.e., false posi-
with a modified version of the Android OS implementing
tives)? We used a well-known compatibility test suite to
the first-use permissions and a security indicator that in-
evaluate the compatibility of AWare with existing apps
forms the users about the origin of the app (i.e., developer
and found that, out of 1,000 apps analyzed, only five
ID [6]). Participants in Group5 interacted with a modified
of them malfunctioned due to attempted operations that
version of the Android OS implementing the use of access
AWare blocked as potentially malicious. However, these
control gadgets [41] including basic user interface config-
malfunctioning instances have been resolved by features
uration checks (i.e., no misleading text, UI background
developed in subsequent versions of the AWare prototype.
and the text must preserve the contrast, no overlay of UI
What is the performance overhead imposed by AWare
elements, and user events occur in the correct location
for the operation binding construction and enforcement?
at the correct time [39]) and timing checks for implicit
We used a well-known software exerciser to measure the
authorizations. Lastly, participants in Group6 interacted
performance overhead imposed by AWare. We found that
with a modified version of the Android OS integrating the
AWare introduced a negligible overhead on the order of
AWare authorization framework.
microseconds that is likely to be not noticeable by users.
Experimental Procedures: Before starting the experi-
8.1 Preliminaries for the User Studies ment, all participants were informed that attacks targeting
We designed our user studies following suggested prac- sensitive audio and video data were possible during the in-
tices for human subject studies in security to avoid com- teraction with apps involved in the experimental tasks, but
mon pitfalls in conducting and writing about security and none of the participants were aware of the attack source.
privacy human subject research [43]. Participants were in- Further, the order of the experimental tasks was random-
formed that the study was about mobile systems security, ized to avoid ordering bias. All the instructions to perform
with a focus on audio and video, and that the involved the experimental tasks were provided to participants via a
researchers study operating systems security. An Institu- handout at the beginning of the user study. Participants
tional Review Board (IRB) approval was obtained from were given the freedom to ignore task steps if they were
our institution. We recruited user study participants via suspicious about the resulting app activities.
local mailing lists, Craigslist, and local groups on Face- We used two apps, a well-known voice note recording
book, and compensated them with a $10 gift card. We app called Google Keep, and a testing app (developed in
excluded friends and acquaintances from participating in our research laboratory) called SimpleFilters, which
the studies to avoid acquiescence bias. Participants were provides useful photo/video filtering functionality. How-
given the option to withdraw their consent to participate at ever, SimpleFilters also attempts adversarial use of
any time after the purpose of the study was revealed. For privacy-sensitive sensors, such as the microphone and the
camera. We explicitly asked the participants to install Overall, we found that all the operation binding compo-
such apps on the testing platforms10 . nents used by AWare were useful in helping the users in
Before starting the experiment tasks, we asked the par- avoiding the four types of attacks. Moreover, AWare out-
ticipants to familiarize themselves with Google Keep, by performed alternative approaches conspicuously, while
recording a voice note, and with SimpleFilters, by tak- each experimental task revealed interesting facts.
ing a picture and recording a video with the smartphone’s In particular, the analysis of the subjects’ responses to
front camera. During this phase, participants were pre- TASK 1 revealed that the operation performed by the app
sented with authorization requests at first use of any of was not visible to users in the alternative approaches, thus,
the privacy-sensitive sensors. leading them into making mistakes. The only exceptions
All the user study participants in Groups1-6 were were the subjects from Group5 (AC Gadgets) because the
asked to perform the four experimental tasks reported SimpleFilters app was not in control of the requested
in Table 1. We designed such tasks to test the four types operation due to the use of a system-defined access control
of attacks discussed in Section 3.1. During the experi- gadget. Furthermore, all subjects from Group6 (AWare)
ment, the researchers recorded whether the participants did not authorize SimpleFilters to access the micro-
commented noticing any suspicious activity in the apps’ phone. Thus, the binding request clearly identifying the
user interface, while a background service logged whether operation requested by the app aided them in avoiding to
the designed attacks took place. be tricked into granting an unintended operation.
Experimental Results: 90 subjects participated and The analysis of the subjects’ responses to TASK2 and
completed our experimental tasks. We randomly assigned TASK3 revealed that the users were successfully tricked
15 participants to each group. The last column of Table 1 by either switching the user interface configuration within
summarizes the results for the four experimental tasks which a widget is presented, or by changing the widget
used in the laboratory-based user study. The third col- presented within the same configuration, thus, leading
umn of Table 1 reports additional authorization requests them into making mistakes. Interestingly, there was no
prompted only to subjects in Group6 using the AWare noticeable improvement for subjects in Group5 (AC Gad-
system. Indeed, only AWare is able to identify the change gets) where the system put in place some basic user in-
in configuration (e.g., widget in a different activity win- terface configuration checks [39] for the presentation of
dow, widget linked to a different operation or different the access control gadgets. The reason was that such ba-
privacy-sensitive sensor) under which the experimental sic checks were insufficient to identify the user interface
applications are attempting access to the privacy-sensitive modifications made by the malicious app when perform-
sensors (i.e, microphone and cameras). ing the attacks described in Table 1. Furthermore, one
10 SimpleFilters is providing interesting features to convince the subject from Group6 (AWare) had mistakenly authorized
users to install it and grant the required permissions. SimpleFilters to carry out an unintended operation
the AWare authorization framework. During this period, 13 The Absolute 1,000 Top Apps for Android. http://bestapps
monkey.html 15 https://github.com/gxp18/AWare
[5] M. Backes, S. Bugiel, S. Gerling, and P. von Styp-Rekowsky. An- [23] S. S. Jon Howell. What you see is what they get: Protecting users
droid security framework: Extensible multi-layered access control from unwanted use of microphones, cameras, and other sensors.
on android. In Proceedings of the 30th annual computer security In Web 2.0 Security and Privacy. IEEE, May 2010.
applications conference, pages 46–55. ACM, 2014.
[24] J. Jung, S. Han, and D. Wetherall. Short paper: enhancing mobile
[6] A. Bianchi, J. Corbetta, L. Invernizzi, Y. Fratantonio, C. Kruegel, application permissions with runtime feedback and constraints. In
and G. Vigna. What the app is that? deception and countermea- Proceedings of the second ACM workshop on Security and privacy
sures in the Android user interface. In 2015 IEEE Symposium on in smartphones and mobile devices, pages 45–50. ACM, 2012.
Security and Privacy, pages 931–948, May 2015.
[25] P. M. Kelly, T. M. Cannon, and D. R. Hush. Query by image
[7] F. Chang, A. Itzkovitz, and V. Karamcheti. User-level resource- example: the comparison algorithm for navigating digital im-
constrained sandboxing. In Proceedings of the 4th USENIX Win- age databases (candid) approach. In IS&T/SPIE’s Symposium
dows Systems Symposium, volume 91. Seattle, WA, 2000. on Electronic Imaging: Science & Technology, pages 238–248.
International Society for Optics and Photonics, 1995.
[8] P. T. Cummings, D. Fullan, M. Goldstien, M. Gosse, J. Picciotto,
J. L. Woodward, and J. Wynn. Compartmented model workstation: [26] W. Li, M. Ma, J. Han, Y. Xia, B. Zang, C.-K. Chu, and T. Li.
Results through prototyping. In Security and Privacy, 1987 IEEE Building trusted path on untrusted device drivers for mobile de-
Symposium on, pages 2–2. IEEE, 1987. vices. In Proceedings of 5th Asia-Pacific Workshop on Systems,
page 8. ACM, 2014.
[9] R. Dhamija, J. D. Tygar, and M. Hearst. Why phishing works.
In Proceedings of the SIGCHI Conference on Human Factors in [27] W. Li, M. Ma, J. Han, Y. Xia, B. Zang, C.-K. Chu, and T. Li.
Computing Systems, CHI ’06, pages 581–590, New York, NY, Building trusted path on untrusted device drivers for mobile de-
USA, 2006. ACM. vices. In Proceedings of 5th Asia-Pacific Workshop on Systems,
page 8. ACM, 2014.
[10] A. P. Felt, E. Ha, S. Egelman, A. Haney, E. Chin, and D. Wagner.
Android permissions: User attention, comprehension, and behav- [28] J. Lin, S. Amini, J. I. Hong, N. Sadeh, J. Lindqvist, and J. Zhang.
ior. In Proceedings of the Eighth Symposium on Usable Privacy Expectation and purpose: understanding users’ mental models of
and Security, SOUPS ’12, pages 3:1–3:14, New York, NY, USA, mobile app privacy through crowdsourcing. In Proceedings of the
2012. ACM. 2012 ACM Conference on Ubiquitous Computing, pages 501–510.
ACM, 2012.
[11] A. P. Felt, E. Ha, S. Egelman, A. Haney, E. Chin, and D. Wagner.
Android permissions: User attention, comprehension, and behav- [29] T. Luo, X. Jin, A. Ananthanarayanan, and W. Du. Touchjacking
ior. In Proceedings of the Eighth Symposium on Usable Privacy attacks on web in android, ios, and windows phone. In Interna-
and Security, SOUPS ’12, pages 3:1–3:14, New York, NY, USA, tional Symposium on Foundations and Practice of Security, pages
2012. ACM. 227–243. Springer, 2012.
[31] B. McCarty. SElinux: NSA’s open source security enhanced linux. [47] J. S. Shapiro, J. Vanderburgh, E. Northup, and D. Chizmadia.
O’Reilly Media, Inc., 2004. Design of the eros trusted window system. In Proceedings of
the 13th conference on USENIX Security Symposium-Volume 13,
[32] D. S. McCrickard and C. M. Chewar. Attuning notification design pages 12–12. USENIX Association, 2004.
to user goals and attention costs. Commun. ACM, 46(3):67–72,
Mar. 2003. [48] M. Sheppard. Smartphone apps, permissions and privacy. Office
of the Privacy Commissioner of Canada, 2013.
[33] K. Onarlioglu, W. Robertson, and E. Kirda. Overhaul: Input-
driven access control for better privacy on traditional operating [49] S. Smalley and R. Craig. Security enhanced (se) android: Bringing
systems. In 2016 46th Annual IEEE/IFIP International Confer- flexible mac to android. In NDSS, volume 310, pages 20–38, 2013.
ence on Dependable Systems and Networks (DSN), pages 443–454,
June 2016. [50] S. Smalley, C. Vance, and W. Salamon. Implementing selinux as
a linux security module. NAI Labs Report, 1(43):139, 2001.
[34] G. Petracca, L. M. Marvel, A. Swami, and T. Jaeger. Agility
maneuvers to mitigate inference attacks on sensed location data. [51] R. Templeman, Z. Rahman, D. Crandall, and A. Kapadia. Plac-
In Military Communications Conference, MILCOM 2016-2016 eRaider: Virtual theft in physical spaces with smartphones. In The
IEEE, pages 259–264. IEEE, 2016. 20th Annual Network and Distributed System Security Symposium
(NDSS), To appear, Feb 2013.
[35] G. Petracca, Y. Sun, T. Jaeger, and A. Atamli. Audroid: Preventing
attacks on audio nels in mobile devices. In Proceedings of the [52] G. S. Tuncay, S. Demetriou, and C. A. Gunter. Draco: A sys-
31st Annual Computer Security Applications Conference, pages tem for uniform and fine-grained access control for web code
181–190. ACM, 2015. on android. In Proceedings of the 2016 ACM SIGSAC Confer-
ence on Computer and Communications Security, CCS ’16, pages
[36] V. Prevelakis and D. Spinellis. Sandboxing applications. In 104–115, New York, NY, USA, 2016. ACM.
USENIX Annual Technical Conference, FREENIX Track, pages
119–126, 2001. [53] T. Whalen and K. M. Inkpen. Gathering evidence: Use of vi-
sual security cues in web browsers. In Proceedings of Graphics
[37] U. U. Rehman, W. A. Khan, N. A. Saqib, and M. Kaleem. On Interface 2005, GI ’05, pages 137–144, School of Computer Sci-
detection and prevention of clickjacking attack for osns. In Fron- ence, University of Waterloo, Waterloo, Ontario, Canada, 2005.
tiers of Information Technology (FIT), 2013 11th International Canadian Human-Computer Communications Society.
Conference on, pages 160–165. IEEE, 2013.
[54] P. Wijesekera, A. Baokar, A. Hosseini, S. Egelman, D. Wagner,
[38] C. Ren, Y. Zhang, H. Xue, T. Wei, and P. Liu. Towards discover- and K. Beznosov. Android permissions remystified: A field study
ing and understanding task hijacking in android. In 24th USENIX on contextual integrity. In 24th USENIX Security Symposium
Security Symposium (USENIX Security 15), pages 945–959, Wash- (USENIX Security 15), pages 499–514, 2015.
ington, D.C., Aug. 2015. USENIX Association.
[55] Z. E. Ye, S. Smith, and D. Anthony. Trusted paths for browsers.
[39] T. Ringer, D. Grossman, and F. Roesner. Audacious: User-driven ACM Transactions on Information and System Security (TISSEC),
access control with unmodified operating systems. In Proceedings 8(2):153–186, 2005.
of the 2016 ACM SIGSAC Conference on Computer and Commu-
nications Security, CCS ’16, pages 204–216, New York, NY, USA, [56] Z. Zhou, V. D. Gligor, J. Newsome, and J. M. McCune. Building
2016. ACM. verifiable trusted path on commodity x86 computers. In 2012
IEEE Symposium on Security and Privacy, pages 616–630. IEEE,
[40] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining 2012.
digital signatures and public-key cryptosystems. Communications
of the ACM, 21(2):120–126, 1978.
Appendices
[41] F. Roesner, T. Kohno, A. Moshchuk, B. Parno, H. J. Wang, and
C. Cowan. User-driven access control: Rethinking permission A Compatibility Discussion
granting in modern operating systems. In Proceedings of the 2012 Here, we discuss how AWare addresses special cases of
IEEE Symposium on Security and Privacy, SP ’12, pages 224–238,
Washington, DC, USA, 2012. IEEE Computer Society. applications’ accesses to privacy-sensitive sensors.
Background Access: To enable background access,
[42] J. Ruderman. The same origin policy, 2001. AWare still uses the explicit authorization mechanism
[43] S. Schechter. Common pitfalls in writing about security and via the creation of a binding request. However, as soon
privacy human subjects experiments, and how to avoid them. Mi- as the application goes in the background, any on-screen
crosoft Technical Report, January 2013. security message used to notify ongoing operations over
[44] S. E. Schechter, R. Dhamija, A. Ozment, and I. Fischer. The
privacy-sensitive sensors is replaced with a periodic dis-
emperor’s new security indicators. In 2007 IEEE Symposium on tinctive sound or a small icon on the system status bar
Security and Privacy (SP ’07), pages 51–65, May 2007. (Section 7.1), if the platform’s screen is on, or a hardware
[45] R. Schlegel, K. Zhang, X. yong Zhou, M. Intwala, A. Kapadia,
sensor-use indicator LED when the platform’s screen goes
and X. Wang. Soundcomber: A stealthy and context-aware sound off. These periodic notifications will be active until the
trojan for smartphones. In NDSS. The Internet Society, 2011. user terminates the background activity explicitly. Our
Abstract making phone calls and messaging only. They are in-
tegrated into various applications from home security to
Sensors (e.g., light, gyroscope, accelerometer) and sens- health care to military [18, 60]. Since smart devices
ing enabled applications on a smart device make the ap- seamlessly integrate the physical world with the cyber
plications more user-friendly and efficient. However, the world via their sensors (e.g., light, accelerometer, gyro-
current permission-based sensor management systems of scope, etc.), they provide more efficient and user-friendly
smart devices only focus on certain sensors and any App applications [37, 41, 85, 55, 48].
can get access to other sensors by just accessing the While the number of applications using different sen-
generic sensor API. In this way, attackers can exploit sors [38] is increasing and new devices offer more sen-
these sensors in numerous ways: they can extract or leak sors, the presence of sensors have opened novel ways
users’ sensitive information, transfer malware, or record to exploit the smart devices [76]. Attackers can exploit
or steal sensitive information from other nearby devices. the sensors in many different ways [76]: they can trig-
In this paper, we propose 6thSense, a context-aware in- ger an existing malware on a device with a simple flash-
trusion detection system which enhances the security of light [28]; they can use a sensor (e.g., light sensor) to leak
smart devices by observing changes in sensor data for sensitive information; using motion sensors such as ac-
different tasks of users and creating a contextual model celerometer, and gyroscope, attackers can record or steal
to distinguish benign and malicious behavior of sen- sensitive information from other nearby devices (e.g.,
sors. 6thSense utilizes three different Machine Learning- computers, keyboards) or people [10, 87, 26, 42]. They
based detection mechanisms (i.e., Markov Chain, Naive can even transfer a specific malware using sensors as a
Bayes, and LMT) to detect malicious behavior associated communication channel [76]. Such sensor-based threats
with sensors. We implemented 6thSense on a sensor- become more serious with the rapid growth of Apps uti-
rich Android smart device (i.e., smartphone) and col- lizing many sensors [6, 2].
lected data from typical daily activities of 50 real users. In fact, these sensor-based threats highlight the flaws
Furthermore, we evaluated the performance of 6thSense of existing sensor management systems used by smart
against three sensor-based threats: (1) a malicious App devices. Specifically, Android sensor management sys-
that can be triggered via a sensor (e.g., light), (2) a mali- tem relies on permission-based access control, which
cious App that can leak information via a sensor, and (3) considers only a few sensors (i.e., microphone, camera,
a malicious App that can steal data using sensors. Our and GPS)1 . Android asks for access permission (i.e., with
extensive evaluations show that the 6thSense framework a list of permissions) only while an App is being installed
is an effective and practical approach to defeat growing for the first time. Once this permission is granted, the
sensor-based threats with an accuracy above 96% with- user has no control over how the listed sensors and other
out compromising the normal functionality of the device. sensors (not listed) will be used by the specific App.
Moreover, our framework costs minimal overhead. Moreover, using some sensors is not considered as a vi-
olation of security and privacy in Android. For instance,
1 Introduction any App is permitted to access to motion sensors by just
accessing the sensor API. Access to motion sensors is
Smart devices such as smartphones and smartwatches not controlled in Android.
have become omnipresent in every aspect of human life. 1 IOS,
Windows, and Blackberry also have permission-based sensor
Nowadays, the role of smart devices is not limited to management systems. In this work, we focus on Android.
n
p(X|c) = ∏ p(Xi |c) (3)
i=1
consider the condition of the sensors. 6thSense observes nique, we use the training sessions to define different
the data collected by the aforementioned App and deter- user activities. In 6thSense, we have nine typical user
mines whether the condition of sensors is changing or activities in total as listed in Table 2. We use groundtruth
not. If the sensor value is changing from the previous user data to define these activities. Using the theoretical
value in time, 6thSense represents the sensor condition foundation explained in Section 5, we calculate the prob-
as 1 and 0 otherwise. The logic state information col- ability of a test session to belong to any of these defined
lected from the sensors need to be reorganized, too as activities. As we consider one second of data in each
these data are merged with the data collected from the computational cycle, we calculate the total probability
collected values from the other sensors to create an in- up to a predefined configurable time interval (in this case
put matrix. The sampling frequency of the logical state five minutes). This calculated probability is used to de-
detection is 0.2 Hz which means in every five seconds tect malicious activities from normal activities. If the
the App generates one session of dataset. We consider computed probability for all the known benign activities
the condition of the sensors to be the same over this time is not over a predefined threshold, then it is detected as a
period and organize the data accordingly. The reorga- malicious activity.
nized data generated from the aforementioned App are For the other alternative machine-learning-based de-
then merged to create the training matrices. tection techniques, we used WEKA, a data mining tool
which offers data analysis using different machine learn-
ing approaches [64, 27]. Basically, WEKA is a collection
6.3 Data Analysis Phase of machine learning algorithms developed at the Univer-
In the third and final phase, 6thSense uses different ma- sity of Waikato, New Zealand, which can be directly ap-
chine learning-based detection techniques introduced in plied to a dataset or can be integrated with a framework
the previous section to analyze the data matrices gener- using JAVA platform [56]. WEKA offers different types
ated in the previous phase. of classifier to analyze and build predictive model from
For the Markov Chain-based detection, we use 75% of given dataset. We use 10 fold cross-validation method to
the collected data to train 6thSense and generate the tran- train and test 6thSense with different ML techniques in
sition matrix. This transition matrix is used to determine Section 7.
whether the transition from one state to another is appro-
priate or not. Here, state refers to generic representation 7 Performance Evaluation of 6thSense
of all the sensors’ conditions on a device. For testing pur-
poses we have two different data set — basic activities or In this section, we evaluate the efficiency of the proposed
trusted model and malicious activities or threat model. context-aware IDS framework, 6thSense, in detecting the
The trusted model consists of 25% of the collected data sensor-based threats on a smart device. We test 6thSense
for different user activities. We test the trusted model with the data collected from different users for benign
to ensure the accuracy of the 6thSense framework in de- activities and adversary model described in Section 4.
tecting benign activities. The threat model is built from As discussed earlier, 6thSense considers three sensor-
performing the attack scenarios mentioned in Section 4. based threats: (1) a malicious App that can be triggered
We calculate the probability of a transition occurring be- via a light or motion sensors, (2) a malicious App that
tween two states at a given time and accumulate the total can leak information via audio sensor, and (3) a mali-
probability to distinguish between normal and malicious cious App that steals data via camera. Furthermore, we
activities. measured the performance impact of 6thSense on the de-
To implement the Naive Bayes-based detection tech- vice and present a detailed results for the efficiency of
score reach to a peak value with the threshold of three For a threshold value of 55%, FN rate
consecutive malicious states on the device. From Fig- is zero. However, FPR is too high, which lowers F-
ure 3, we can see that FP rate remains zero while TP rate score of the framework. For a threshold of 60%, FPR
increases at the beginning. The highest TP rate without decreases while FNR is still zero. In this case, accu-
introducing any FP case is over 98%. After 98%, it intro- racy is 95% and F-score is 82%. If the threshold is in-
duces some FP cases in the system which is considered as creased over 65%, it reduces the recall rate which affects
a risk to the system. In summary, Markov Chain-base de- accuracy and F-score. The evaluation indicates that the
tection in 6thSense can acquire accuracy over 98% with- threshold value of 60% provides an accuracy of 95% and
out introducing any FP cases. F-score of 82%.
From Figure 4, one can observe the relation between
FPR and TPR of Naive Bayes-based detection. For FPR
larger than 0.3, TPR becomes 1.
7.5 Evaluation of Naive Bayes-based De- Figure 4: ROC curve of Naive Bayes-based detection.
tection
In the Naive Bayes-based detection technique, 6thSense 7.6 Evaluation of Alternative Detection
calculates the probability of a session to match it with
Techniques
each activity defined in Section7.1. Since all the activ-
ities are benign and there is no malicious activity (i.e., In alternative detection techniques, we used other super-
ground-truth data), 6thSense checks calculated probabil- vised machine learning techniques to train the 6thSense
ity of an activity from dataset against a threshold to de- framework. For this, we utilized WEKA and it provides
termine the correct activity. If there is no match for a cer- three types of analysis - split percentage analysis, cross-
tain sensor condition with any of the activities, 6thSense validation analysis, and supplied test set analysis. We
detects the session as malicious. Table 4 shows the eval- chose 10 fold cross-validation analysis to ensure that all
uation results. the data was used for both training and test. Thus, the
Table 5: Performance of other different machine learning based-detection techniques tested in 6thSense.
error rate of the predictive model would be minimized LMT performs better than the others.
in the cross validation. In Table 5, a detailed evalua-
tion of different machine learning algorithms is given for Performance Markov Naive
LMT
6thSense. For Rule Based Learning, 6thSense has the Metrics Chain Bayes
best result for PART algorithm, which has an accuracy Recall rate 0.98 1 0.9998
of 0.99 and F-score of 0.7899. On the other hand, for False Negative Rate 0.02 0 0.0002
Regression Analysis, we use the logistic function which Precision rate 1 0.7 0.9306
has high FPR (0.7222) and lower F-score (0.4348). Mul- False positive rate 0 0.3 0.0694
Accuracy 0.9833 0.9492 0.9997
tilayer Perceptron algorithm gives an accuracy of 0.9991 F-Score 0.9899 0.8235 0.964
and F-score of 0.8196, which is higher than previously auPRC 0.947 0.686 0.91
mentioned algorithms. However, FPR is much higher
(0.3056), which is actually a limitation for intrusion de- Table 6: Comparison of different machine-learning-
tection frameworks in general. Compared to these algo- based approaches proposed for 6thSense (i.e., Markov
rithms, Linear Model Tree (LMT) gives better results in Chain, Naive Bayes, and LMT).
detecting sensor-based attacks. This evaluation indicates
that LMT provides an accuracy of 0.9997 and F-score of
0.964. 7.8 Performance Overhead
As previously mentioned, 6thSense collects data in an
7.7 Comparison Android device from different sensors (permission and
no-permission imposed sensors). In this sub-section,
In this subsection, we give a comparison among the we measure the performance overhead introduced by
different machine-learning-based detection approaches 6thSense on the tested Android device in terms of CPU
tested in 6thSense for defending against sensor-based usage, RAM usage, file size, and power consumption and
threats. For all the approaches, we select the best pos- Table 7 gives the details of the performance overhead.
sible case and report their performance metrics in Ta- For no-permission-imposed sensors, the data collec-
ble 6. For Markov Chain-based detection, we choose tion phase logs all the values within a time interval
three consecutive malicious states as valid device con- which causes an increased usage of RAM, CPU and
ditions. On the other hand, in Naive Bayes approach, the Disc compared to permission- imposed or logic-oriented
best performance is observed for the threshold of 60%. sensors. For the power consumption, we observe that
For other machine learning algorithms tested via WEKA, no-permission-imposed sensors use higher power than
we choose LMT as it gives highest accuracy among other permission-imposed sensors. This is mainly because
machine learning algorithms. These results indicate that logic-oriented sensors have lower sampling rate, which
LMT provides highest accuracy and F-score compared to reduces its resource needs.
the other two approaches. The overall performance overhead is as low as 4% of
On the contrary, Naive Bayes model displays higher CPU, less then 40MB RAM space, and less than 15MB
recall rate and less FNR than other approaches. However, disc space. Thus, its overhead is minimal and accept-
the presence of FPR in IDS is a serious security threat to able for an IDS system on current smartphones. One of
the system since FPR refers to a malicious attack that is the main concerns of implementing 6thSense on Android
identified as a valid state, which is a threat to user pri- device is the power consumption.
vacy and security of the device. Both Markov Chain and Table 7 also shows the power consumption of the
LMT has lower FPR. In summary, considering F-score Android app used in 6thsense. For one minute,
and accuracy of all these approaches, we conclude that 6thsense consumes 16.62 mW power which increases
Abstract ARP request, the source host binds the IP address in the
ARP request to the MAC address in the ARP reply.
In this work, we demonstrate a novel attack in SDN
Given the importance of these identifier bindings, it is
networks, Persona Hijacking, that breaks the bindings of
not surprising that numerous attacks against them have
all layers of the networking stack and fools the network
been developed, including DNS spoofing [48], ARP poi-
infrastructure into believing that the attacker is the le-
soning [29], DHCP forgery, and host location hijack-
gitimate owner of the victim’s identifiers, which signifi-
ing [24]. These attacks are facilitated by several net-
cantly increases persistence. We then present a defense,
work design characteristics: (1) reliance on insecure
S ECURE B INDER, that prevents identifier binding attacks
protocols that use techniques such as broadcast for re-
at all layers of the network by leveraging SDN’s data and
quests and responses without any authentication mech-
control plane separation, global network view, and pro-
anisms, (2) allowing binding changes without consider-
grammatic control of the network, while building upon
ing the network-wide impact of services relying on them,
IEEE 802.1x as a root of trust. To evaluate its effective-
(3) allowing independent bindings across different lay-
ness we both implement it in a testbed and use model
ers without any attempt to check consistency, and (4) al-
checking to verify the guarantees it provides.
lowing high-level changes to identifiers that are designed
and assumed to be unique. Numerous defenses have also
1 Introduction been proposed to prevent identifier binding attacks of
various types in traditional networks [18, 48, 2].
Modern networks use various identifiers to specify en- Software-Defined Networking (SDN) is a new net-
tities at network stack layers. These identifiers include working paradigm that facilitates network management
addresses, like IP addresses or MAC addresses, and do- and administration by providing an interface to con-
main names, as well as less explicitly known values such trol network infrastructure devices (e.g., switches). In
as the switch identifier and the physical switch port to this paradigm, the system responsible for making traf-
which a machine is connected. Identifiers are used in fic path decisions (the control plane) is separated from
modern networks not only to establish traffic flow and the switches responsible for delivering the traffic to the
deliver packets, but also to enforce security policies such destination (the data plane). The SDN controller is the
as in firewalls or network access control systems [3]. In centralized system that manages the switches, installs
order to achieve proper operation and security guaran- forwarding rules, and presents an abstract view of the
tees, network infrastructure devices (e.g., switches, net- network to SDN applications. SDN provides flexibility,
work servers, and SDN controllers) implicitly or explic- manageability, and programmability for network admin-
itly associate various identifiers of the same entity with istrators. Although previous work has focused on var-
each other in a process we call identifier binding. For ex- ious aspects of the intersection of security and SDNs
ample, when a host acquires an IP address from a DHCP [46, 27, 28, 45, 26, 36, 44, 5, 14, 52], there has been lit-
server, the server binds that IP address to the host’s MAC tle work on studying identifier binding attacks and their
address; when an ARP reply is sent in response to an implications in SDN systems.
∗ This work is sponsored by the Department of Defense under Air
In this paper, we first study identifier binding attacks
Force Contract #FA8721-05-C-0002. Opinions, interpretations, con-
in SDN systems. We show that the centralized control
clusions and recommendations are those of the author and are not nec- exacerbates the implications and consequences of weak
essarily endorsed by the United States Government. identifier binding. This allows malicious hosts to poi-
BLACKHOLE
Attacker
MAC: 02
DHCP OFFER Port: 02 Ping Victim
DHCP server 10.0.0.11 DHCP ACK Ether(Src:01, Dst:03)
IP = 10.0.0.10 Victim
MAC = :01 ARP Probe IP: 10.0.0.11
(TPA: 10.0.0.11) MAC:03 Reply to Ping ARP Reply
OF Switch and Port: 03 (Src:03, Dst:02) (TPA:10.0.11, SHA:03)
Controller
Flow Poisoning Target Server
Evil Server Attack MAC:01
ARP Request
IP=10.0.0.12 Port: 01 (TPA:10.0.11)
DHCP Flood DHCP
MAC=:02
RELEASE DISCOVER
10.0.0.11
Victim
IP=10.0.0.11
Figure 3: Timeline of the Flow Poisoning Attack
ARP Reply
MAC=:03
(TPA: 10.0.0.11, SHA:03)
4.1 Design
3.3 Attack Implementation
The Persona Hijacking attack and other identifier attacks
We have implemented both the IP takeover and Flow are possible because of several network design character-
Poisoning phases as fully automated Python scripts run- istics of identifier binding in traditional networks: (1) re-
ning on an attacking end-host. We use Scapy1 to forge a liance on insecure protocols using techniques like broad-
DHCP RELEASE message for the IP takeover phase and cast for requests and responses without any authentica-
to request new IP addresses until the victim’s address is tion mechanisms, (2) allowing the changing of bindings
offered. The Flow Poisoning phase simply consists of without considering the network-wide impact to services
sending an ICMP ping with the DHCP server’s MAC ad- relying on them, (3) allowing for independent bindings
dress to the victim as soon as the DHCP Offer is received across different layers without any attempt to check con-
by the attacker. The entire attack was tested in an emu- sistency, and (4) allowing high-level changes to identi-
lated SDN environment using Mininet 2.2.1 [30], against fiers that are designed and assumed to be unique. We
both the ONOS and Ryu controllers. An analysis of the design S ECURE B INDER as a comprehensive solution to
Floodlight and POX source code suggests that they are identifier binding attacks and the above attack facilitating
also vulnerable. factors, not merely as another defense against a specific
Because Flow Poisoning relies on a race condition, we attack. In doing so we dramatically reduce the number
measured the Persona Hijacking success rate over 10 tri- of network components that must be trusted, as shown in
als, each of which took an average of 90.39 seconds to Figure 4.
acquire the target IP address. For Ryu, which uses hard
S ECURE B INDER leverages SDN and IEEE 802.1x to
flow rule timeouts, the success rate was 90%. Failures
target the facilitating factors as follows:
corresponded to an ARP Reply that was sent by the vic-
• It leverages SDN functionality to separate the iden-
tim to the DHCP server after the inconsistent flow rule
tifier binding control traffic from the regular data plane,
expired. For ONOS, the Flow Poisoning phase was not
isolating it from an attacker, and creating a binding me-
1 http://www.secdev.org/projects/scapy/ diator which can perform additional security checks on
When S ECURE B INDER was not deployed, many in- Active Directory
Proxy ARP
DHCP
Server
variant violations were found corresponding to existing Forwarding
1000
6 Limitations and Discussion
500
Although the Persona Hijacking attack is extremely pow-
0
erful, it does have some limitations as well as several pos-
0 20 40 60 80 100 sible partial mitigations that may not prevent the attack
Switch Ports
but can alert an attentive defender.
Figure 6: Number of flow rules needed in each switch DHCP. Principle among these is that the target must
for S ECURE B INDER, as a function of the number of be using DHCP for Persona Hijacking to be applica-
switch ports. Dashed black line marks minimum TCAM ble. Another limitation is that DHCP starvation, de-
rule slots in a modern SDN switch. spite being extremely transient, is easily detectable and
likely to be monitored because it can also indicate net-
rules required per switch by S ECURE B INDER as: work malfunction. In a similar way, the large number of
DHCP DISCOVERs needed to launch the attack would
26 + 13 ∗ edge ports + internal ports be readily noticed by a network anomaly detector. If
such monitoring systems are deployed, Persona Hijack-
The first term relates to static flow rules installed globally ing would quickly be brought to the attention of the net-
in each switch to send 802.1x, ARP, and DHCP traffic to work administrators. However, even if detected, mitiga-
the controller and block DNS, mDNS, and Active Direc- tion would require the involvement of a human operator,
tory by default. The second term describes flow rules probably on a time scale of tens of minutes to hours.
installed for each edge switch port to enable egress fil- Another factor that can complicate a Persona Hi-
tering and allow non-spoofed Active Directory and DNS jacking attack is the use of static DHCP leases fixing
traffic destined for the legitimate servers. The final term IP addresses to specific MAC addresses. If the net-
includes the flow rules that are inserted in table 0 for in- work protects against MAC spoofing, this will com-
ternal network ports to send traffic directly to the normal pletely prevent Persona Hijacking. However, we are un-
forwarding rules in table 1. aware of any SDN controller implementing any form of
We plot this equation for both edge and core switches MAC spoofing protection. As a result, the attacker is
with between 1 and 100 ports in Figure 6. We assume free to launch the Persona Hijacking attack by spoofing
edge switches have one port connected to the core net- DHCP DISCOVERs with the target’s MAC address at
work with all other ports connected to edge devices while a different network location. Interestingly, in this case
core switches are connected only to other switches. Keep DHCP starvation is not required.
in mind that most edge switches are usually in the 24-48 A number of security features commonly found in
port range with the very high degree switches in the core traditional switches (e.g., port security [9] and DHCP
of the network. For a 48 port edge switch, S ECURE - snooping [8]) make it more difficult to launch a DHCP
B INDER would require 638 flow rules. starvation attack by limiting the number of source MAC
Determining the number of flow rules supported by addresses originating from a single port; however, a re-
modern SDN switches is surprisingly challenging. The cent starvation technique has been developed to bypass
number of TCAM slots in current SDN switches of about these defensive mechanisms [50]. This technique ex-
48-ports varies from around 512 to 8,192 [12, 43, 7, 21] ploits the DHCP server’s IP address conflict detection by
and many vendors claim to be able to support 65,536 answering all the probes used to check if an address is in
flow rules or more [7, 21]. We use 2,048 TCAM entries use, without spoofed MAC addresses.
(available on many switches) as a lowerbound and denote Despite these limitations Persona Hijacking remains a
it with a black dashed lined in Figure 6. For this lower- powerful attack that co-opts the network infrastructure
bound, S ECURE B INDER would require 31% of the rules to propagate a malicious identifier binding that will reli-
in a 48 port edge switch, 16% in a 24 port edge switch, ably last for a significant time even in the presence of a
and only 4% in a 48 port core switch. For edge switches, vigilant system administrator running many monitoring
this is a significant, but still practical, overhead; assum- tools. This level of persistence is unmatched among ex-
ing the devices connected to such a switch communicate isting network identifier attacks and gives the attacker a
evenly, each can talk to 29 other devices simultaneously reasonable window in which to achieve their goals. Fur-
[4] B RUSCHI , D., O RNAGHI , A., AND ROSTI , E. S-ARP: a secure [24] H ONG , S., X U , L., WANG , H., AND G U , G. Poisoning network
address resolution protocol. In ACSAC (Dec 2003), pp. 66–74. visibility in software-defined networks: New attacks and counter-
measures. In NDSS (2015).
[5] C ANINI , M., V ENZANO , D., P ERESINI , P., KOSTIC , D., R EX -
FORD , J., ET AL . A NICE way to test openflow applications. In [25] H U , H., H AN , W., A HN , G.-J., AND Z HAO , Z. FLOWGUARD:
NSDI (2012). Building robust firewalls for software-defined networks. In Pro-
ceedings of the Third Workshop on Hot Topics in Software De-
[6] C ASADO , M., F REEDMAN , M. J., P ETTIT, J., L UO , J., M C K- fined Networking (2014), HotSDN ’14, ACM, pp. 97–102.
EOWN , N., AND S HENKER , S. Ethane: taking control of the
enterprise. In ACM Computer Communication Review (2007), [26] K ATTA , N. P., R EXFORD , J., AND WALKER , D. Logic pro-
vol. 37, ACM. gramming for software-defined networks. In XLDI (2012).
[7] C ENTEC N ETWORKS. Centec networks - SDN/OpenFlow [27] K AZEMIAN , P., C HAN , M., Z ENG , H., VARGHESE , G., M C K-
switch - v330, 2017. http://www.centecnetworks.com/en/ EOWN , N., AND W HYTE , S. Real time network policy checking
SolutionList.asp?ID=42. using header space analysis. In NSDI (2013), pp. 99–111.
(a) (b)
Figure 2: (a) Truth table for recovering [h(mD )0 ] from ([h(mD )] + mH )0 , using s, and (b) an example of recovering [h(mD )0 ] from ([h(mD )] + mH )0 .
the BS, it must hold that (a) all the ON slots indicated in 6
s are also ON slots in [h(m0D )] + mH , and (b) the removal
of mH during step 8 of HELP (see Section 4.2), leads to 7
the decoding of [h(m0D )]. As mD follows in plaintext, the
adversary can then replace mD with m0D . 8
We first show that if the adversary can identify the ON
slots of the helper (this is equivalent to knowing mH ),
then it can modify the transmitted signal such that the
desired value m0D is decoded at the BS. Consider the 100
transmission of one MC ON-OFF bit bD and the super-
position of an ON slot by H either during the ON or the
OFF slot of the coded bD . The possible outcomes of this
superposition are shown in the third column of Table 1. 10
-5
δ
Figure 6: MitM attack against the key-agreement phase of HELP-enabled pairing protocol.
1 1 1
Probability of Inference
Probability of Inference
Probability of Inference
0.9 0.9 0.9
PDH PDH PDH
0.8 0.8 0.8
PNDH PNDH PNDH
0.7 0.7 0.7
PH PH PH
0.6 0.6 0.6
Probability of Inference
Probability of Inference
0.9 0.9
ting alone. These correspond to pI for any of the candi- 0.8 0.8
PDH
PDH
date scenarios. In the first experiment, we measured the 0.7 0.7
PNDH
PNDH PH
detection probability as a function of the sampling win- 0.6 0.6
PH
dow size used for computing the average RSS value for a 0.5 0.5
0.53
Device Signal faster than D. Practically, using time misalignment to
Helper Signal distinguish the helper becomes a random guess.
0.51
-6
δ
δ
10
-2
10
-7
10
10-3 10-8
4 8 12 16 20 4 8 12 16 20
Length of h(mD ) (ℓ) Number of Helper ON Slots (|s|)
(c) (d)
Figure 12: (a) Placement of D and H, (b) placement of the BS (RX1 ) and RX2 . (c) probability of acceptance of a modified message at the BS in the
absence of H, and (d) probability of acceptance of a modified message at the BS in the presence of H.
attempted to distinguish between D and H using the RSS to the legitimate device and simultaneously transmits at
sampling method discussed in Section 6.1.1. Also, the random times to allow the detection of cancellation at-
adversary canceled slots on which D or H’s signals were tacks at the BS. We showed that a pairing protocol such
indistinguishable. Figure 12(d) shows the probability δ as the DH key agreement protocol using HELP as an in-
of accepting the adversary’s modified message as a func- tegrity protection primitive can resist MitM attacks with-
tion of the number of active helper slots |s| when the mes- out requiring an authenticated channel between D and the
sage length is ` = 20. We observe that δ decreases dras- BS. This was not previously feasible by any of the pair-
tically compared to Figure 12(c). Moreover, imperfect ing methods if signal cancellation is possible. We studied
cancellation (pC < 1) leads to further deterioration of the various implementation details of HELP and analyzed its
adversary’s performance. The results obtained support security. Our protocol is aimed at alleviating the device
the analytical results provided in Section 5, which are pairing problem for IoT devices that may not have the
computed assuming pC = 1. appropriate interfaces for entering or pre-loading crypto-
Timing performance: The upper bound on the exe- graphic primitives.
cution time of the DH protocol with HELP primarily de-
pends on the communication time of the ON-OFF keyed
message, since the rest of the messages are exchanged Acknowledgments
in the normal communication mode. Public key param-
eters for an EC-DH key-agreement [58] can have values We thank our shepherd Manos Antonakakis and the
from 160–512 bits, depending on the security require- anonymous reviewers for their insightful comments.
ment. Assuming a hash length of 160 bits and a slot du- This research was supported in part by the NSF under
ration of 1ms, the time required to transmit the HELP grant CNS-1409172 and CNS-1410000. Any opinions,
protected DH public primitive varies between 0.6–1.4s, findings, conclusions, or recommendations expressed in
which is acceptable. this paper are those of the author(s) and do not necessar-
ily reflect the views of the NSF.
7 Conclusion References
We considered the problem of pairing two devices using [1] BALFANZ , D., S METTERS , D. K., S TEWART, P., AND W ONG ,
in-band communications in the absence of prior shared H. C. Talking to strangers: authentication in ad-hoc wireless
networks. In Proc. of NDSS’02 (2002).
secrets. We proposed a new PHY-layer integrity protec-
tion scheme called HELP that is resistant to signal can- [2] B ELLARE , M., AND NAMPREMPRE , C. Authenticated encryp-
tion: Relations among notions and analysis of the generic com-
cellation attacks. Our scheme operates with the assis- position paradigm. In Proc. of International Conference on the
tance of a helper device that has an authenticated chan- Theory and Application of Cryptology and Information Security
nel to the BS. The helper is placed in close proximity (2000), Springer, pp. 531–545.
|s|
tive, for two conditions with wrong inference (1 − pI ). − log |s|
= (|s|) a ,
(a) First, the helper bit is zero i.e. H injects energy on t2i
slot, device bit is one slot and adversary bit is one. (b) Since |s| > n0 , it follows that
Second, the helper bit is one i.e. H injects energy on the
|s| n0 n0
t2i−1 slot, device bit is zero and the adversary bit is zero. > > 1 > c.
loga |s| loga n0
Let Pr denote the probability that the BS rejects the n0a
corresponding bit of [h(m0D )] at bit bi due to cases (a)
Therefore,
and (b). This probability can be calculated as:
µ(|s|) = a−|s|
pr = (Pr[bH D A
i = 0, bi = 1, bi = 1] |s|
− log |s|
= (|s|) a
+ Pr[bH D A
i = 1, bi = 0, bi = 0])
< n−c .
Pr[wrong inference]
This proves that (1 − pr )|s| is a negligible function for
1 1 1 1 1 1
= · · + · · (1 − pI )
2 2 2 2 2 2 a 6= 1 or equivalently pr 6= 0, thus concluding the proof
1 − pI on the negligibility of δ for pr 6= 0.
= , (5)
4
Appendix B
In (5), bXi denotes the transmitted value of device X at
bit bi , and pI is the probability of inference of helper’s Proposition. A legitimate device D pairs with a rogue
activity by the A on a given bit. For (5), we have used BS with probability δ + ε, where
the fact that a strictly universal hash function is the part |s0 |
of HELP. For a strictly universal hash function, output δ = p0I , (7)
whereas a bit one corresponds to (ON, OFF). With su- − log |s0 |
|s0 |
perimposing H’s signal, the BS will also receive slots = (|s0 |) a ,
combinations of (ON, ON). The adversary can extract
some information of mD from the (OFF, ON) and (ON, Since |s0 | > n0 , it follows that
OFF) slots in the [h(mD ), mD ] + mH . But to extract the
|s0 | n0
information from (ON, ON) slots without the knowledge >
of s. The adversary has to make intelligent guesses for loga |s0 | loga n0
n0
received (ON, ON) slots, which is parameterized as the > 1
probability of inferring the helper’s activity by A. n0a
Let p0I be the inference probability for detecting the > c.
presence of H’s signal. This is discussed in details in
Section 6. Note that, if H’s signal is wrongly inferred Therefore,
(with probability (1 − p0I )), A maps the received bit on 0
which H was active to a wrong outcome. µ(|s0 |) = a−|s |
The adversary makes wrong mapping when it receives |s0 |
− log |s0 |
(ON, ON) slots on received [h(mD ), mD ] + mH . It hap- = (|s0 |) a
pens when A cannot detect the presence of the helper’s < n−c .
signal on the slot where D has injected no energy. 0
This proves that (1 − pr )|s | is a negligible function for
pr = Pr[wrong inference] = (1 − p0I ). (8) a 6= 1 or equivalently pr 6= 0.
p0I
In (8), is the probability that A detects the H’s signal After the attacker extracts mD , the rogue BS needs
correctly on a particular bit. to pass the challenge-response authentication in the
The probability δ of extracting correct mD from re- key confirmation phase. Assuming the use of a
ceived signal [h(mD ), mD ] + mH by A. The adversary can strongly universal hash function to compute the response
decode correct mD if none of the bits are decoded wrong. hkD,BS0 (IDBS ||CD ||0), he can only pass this authentication
Each bit is wrongly mapped with probability pr , given if he has the correct key kD,BS0 . Otherwise, his successful
by (8). As rejection on each slot occurs independently, probability ε is negligible. But he can only obtain the
the overall probability of correctly decoding mD from correct key by extracting the correct mD value. There-
[h(mD ), mD ] + mH is computed via the Binomial distri- fore, the success probability of the rogue BS to pair with
bution with parameter pr . That is, the device is upper bounded by δ + ε, where ε is a negli-
gible function (of the length of the hash function). Since
δ is a negligible function of |s0 | which can be the same
|s0 |
as the message length (and here the mD is a DH public
= 1 − ∑ B x, |s0 |, pr
δ
x=1 number, whose bit length is typically larger or equal to
|s | 0 the hash length), the overall probability is a negligible
= 1 − ∑ B x, |s0 |, pr + B 0, |s0 |pr
function. This concludes the proof.
x=0
0
= (1 − pr )|s |
|s0 |
= 1 − (1 − p0I )
|s0 |
= p0I . (9)
where B(α, β , γ) is the Binomial probability density
function and |s0 | ⊂ |s|, which corresponds to the num-
Lei Xu1 , Jeff Huang1 , Sungmin Hong1 , Jialong Zhang1,2 , and Guofei Gu1
1 Texas A&M University, {xray2012,jeffhuang,ghitsh,guofei}@tamu.edu
2 IBM Research, Jialong.Zhang@ibm.com
switchDisconnected(){
getSwitch(dpid){
10:returnthis.switches.get(dpid);
tion is that harmful race conditions are commonly rooted
4:this.switches.remove(dpid);
}
}
OFSwitchManager RaceCondition!
OFSwitchManager by two conflicting operations upon shared network states
that are not commutative, i.e., mutating the scheduling
order of them leads to a different state though the two op-
erations can be well-synchronized (e.g., by using locks).
Figure 1: A harmful race condition in Floodlight v1.1. Because there is no pre-defined order between the two
conflicting operations, we can hence actively control the
scheduler (e.g., by inserting delays) to run an adversar-
tions are common due to a massive number of network ial schedule, which forces one operation to execute after
events on the shared network states. To meet the perfor- another. If we observe an erroneous state (e.g., an ex-
mance requirement, the event handlers in the SDN con- ception or a crash) in the adversarial schedule, we have
troller may run in parallel, which allows race conditions found a harmful race condition.
on the shared network states. By design, all such race
For the second problem, the key challenge is that a
conditions should be benign since they are protected by
harmful race condition occurs very rarely in normal oper-
mutual exclusion synchronizations and do not break the
ations, but relies on a combination of a certain input and
consistency of the network states. However, in practice,
an unexpected thread schedule to manifest. As the adver-
many of these race conditions become harmful races be-
sary typically has no control of the machine or operating
cause it is difficult for the SDN developers to avoid logic
system running the SDN controllers, even if a harmful
flaws such as the one in Figure 1.
race condition is known, it is difficult for an adversary to
The key insight of State Manipulation Attack is that
create the input and schedule combination to trigger the
we can leverage the existence of such harmful race con-
harmful race condition.
ditions in SDN controllers to trigger inconsistent net-
Nevertheless, we show that an adversary can remotely
work states. Nevertheless, a successful attack requires
exploit many harmful race conditions with a high success
tackling two challenging problems:
ratio by injecting the “right” external events into the SDN
• First, how to locate such harmful race conditions in network. Because SDN controllers define an event han-
the SDN controller source code? dler to process each network event, a correlation between
external network events and their corresponding event
• Second, how to trigger the harmful race conditions handlers can be established by analyzing the controller
by an external attacker who has no control of the source code. By further mapping the event handlers to
controller schedule? their operations, we can correlate the conflicting opera-
tions in a harmful race condition to their corresponding
For the first problem, the key challenges are that it is network events. An adversary can then generate many
generally unknown if a race condition is harmful or not, sequences of these network events repeatedly to increase
and that detecting race conditions in a program is gen- the chance of hitting a right schedule to trigger the harm-
erally undecidable. Although many data race detectors ful race condition.
have been developed for different domains [18, 32, 22, We have designed and implemented a framework
19, 31, 36], there is no existing tool to detect race con- called C ON G UARD for exploiting concurrency vulnera-
ditions in the SDN controllers. We note that race condi- bilities in the SDN control plane, and we have evaluated
tions are different from data races but are a more general it on three mainstream open-source SDN controllers –
phenomenon; while data races concern whether accesses Floodlight, ONOS, and OpenDaylight, with 34 applica-
to shared variables are properly synchronized or not, race tions in total. C ON G UARD found 15 previously unknown
conditions concern about the memory effect of high-level harmful race conditions in these SDN controllers. We
races, regardless of synchronizations. For example, a show that these harmful race conditions can incur serious
them are possible if the SDN network uses in-band con- Provider
Event
Handlers
Event
network interfaces.
We do not assume that the adversary can compromise Figure 3: SWITCH_LEAVE event generated by TCP
SDN controllers or switches, and we do not assume the Resets.
adversary can compromise SDN applications or proto-
cols. That is, we consider operating systems of SDN con-
trollers and switches are well protected from the adver- 3 An OpenFlow switch reports all packets to the SDN control plane
sary, and the control channels between SDN controllers if those packets do not hit its existing flow rule table.
Floodlight HOST_LEAVE
Time
(LoadBalancer) Response
ControlPlane 3 4 …
Request
Request
SWITCH
LEAVE
HOST_LEAVE
DataPlane 2
Offer
5
Internet 1
1 2 3 4 5 6 7 8
SDN Controller Trace Processing Race Detection Results
Name Version #RT #OT RE OTATime #Races #RSVs
Floodlight 1.1 234,517 8,063 96.6% 43s 153 22
1.2 410,128 52,271 87.2% 101s 184 35
OpenDaylight 0.1.7 47,855 3,752 92.1% 5s 221 26
ONOS 1.2 69,214 1,292 98.1% 5s 13 5
6.1 Detection Results lations in SDN networks. Because SDNRacer can also
work on the Floodlight controller, we directly compared
Table 3 summarizes our race detection results in Flood- their results with ours. In a single-switch topology,
light 1.1 and 1.2, ONOS 1.2 and OpenDaylight 0.1.7. In SDNRacer reported 2, 281 data races. However, we
total, our tool found 153 race conditions on 22 network find that none of those data races are relevant to our
state variables in Floodlight 1.1, 184 race conditions on detected harmful race conditions. The reason lies in
35 variables in Floodlight 1.2, 221 race conditions on 26 that SDNRacer only models memory operations in SDN
variables in OpenDaylight, and 13 race conditions on 5 switches but ignores internal state operations in SDN
variables in ONOS. The numbers of detected race op- controllers. In this sense, we consider our new detection
erations and network state variables in ONOS are much solution is orthogonal and complementary to SDNRacer.
smaller than those of the other two controllers, because Comparing with RV-Predict. RV-Predict is the
ONOS uses a centralized data storage to manage the net- state-of-the-art general-purpose data race detector that
work states. In addition, our results show that our offline achieves maximal detection capability based on a pro-
trace analysis is highly effective and efficient. The pre- gram trace but does not consider harmful race conditions,
processing step reduces the size of traces (by removing and does not have SDN-specific causality rules. We eval-
redundant events) by more than 87%. For all the three uated RV-Predict as a Java agent for Floodlight v1.1 with
controllers, the offline analysis was able to finish in less our implemented network event generator and REST test
than two minutes. scripts. We found that RV-Predict reported a total of 29
To evaluate the effectiveness of the SDN domain- data races. However, none of them was harmful and none
specific happens-before rules, we compared the fol- of them was related to harmful race conditions4 . The rea-
lowing two configurations on running race detection son is that all those harmful race conditions are caused
of C ON G UARD with Floodlight version 1.1: (1) en- by well-synchronized operations in Java concurrent li-
forces only thread-based happens-before rules; (2) en- braries, which are not data races.
forces both thread-based and SDN-specific rules. Our
results show that adopting SDN-specific happens-before
rules reduces 105 reported race conditions in total (153 6.3 C ON G UARD Runtime Performance
vs 258). We manually inspected all those race condi- We evaluated the runtime performance of C ON G UARD
tion warnings filtered by SDN-specific rules and found for trace collection using Cbench [8], an SDN controller
that all of them are false positives. We expect that performance benchmark. We use Cbench to generate
the happens-before rules formulated in this work greatly a sequence of OFP_PACKET_IN events and test the de-
complement existing thread-based rules for conducting lay. To remove network latency, we locate Cbench in
more precise concurrency defect detection in SDN con- the same physical machine with SDN controllers and
trollers. range testbed from 2 switches to 16 switches. Our results
show that C ON G UARD incurs about 30X, 10X and 8X
6.2 Comparing With Existing Techniques latency overhead for Floodlight, ONOS and OpenDay-
light, respectively. The network functionalities can work
To evaluate the effectiveness of our approach for iden- properly and the instrumentation does not affect the col-
tifying harmful race conditions, we also compared lection of execution traces. The performance overhead
C ON G UARD with an SDN-specific race detector, SD- mainly comes from instrumentation sites that frequently
NRacer [18], and a state-of-the-art general dynamic race write event traces into the database. Although apparently
detector, RV-Predict (version 1.7) [22]. 4 We manually backtracked the call graph information for every data
Comparing with SDNRacer. SDNRacer is a dy- race reported by RV-Predict and checked if it could lead to harmful race
namic race detector that also locates concurrency vio- conditions.
6∗ <OFP_PACKET_IN, SWITCH_LEAVE>
9† <OFP_PACKET_IN, REST_REQUEST>
could break the service chain for OFP_PACKET_IN event ploitation of those 6 harmful race conditions is out of
handlers. Similarly, we found 1 bug (Bug-14) in Open- scope of the paper since we do not assume an attacker
Daylight, i.e., a concurrent HOST_LEAVE event can break can generate authenticated administrative events in the
the host event handling chain. paper. Also, we found that there might have multiple
triggers for a specific harmful race condition since SDN
applications may reference the same network state vari-
6.5 Remote Exploitation Analysis able in order to react upon various network events.
We consider all of the detected harmful race conditions Moreover, based on results from Table 4, we evaluate
can be triggered non-deterministically in normal oper- the feasibility of an external attacker to exploit harmful
ations of an SDN/OpenFlow network. In addition, we race conditions. In particular, we utilize Mininet to inject
study the adversarial exploitations of those harmful race ordered attack event sequences with a proper timing and
conditions by a remote attacker as discussed in Sec- test how many trials an external attacker needs to trig-
tion 3.1. We first investigate their external triggers, i.e., ger a harmful race condition. Table 6 shows the average
the trigger event and update event pair, as shown in number of injected event sequences from 5 successful ex-
Table 4. For 15 harmful race conditions we detected, ploitations for an attacker to exploit a harmful race con-
we found 9 of them can be exploited by external net- dition in an SDN controller6 . Consequently, we found
work events. An attacker with the control of compro- an attacker can exploit 7 out of 9 harmful race conditions
mised hosts/virtual machines in SDN networks can eas- within only hundreds of attempts.
ily trigger three harmful race conditions (i.e., Bug-11, Furthermore, Table 5 lists some feedback information
Bug-12 and Bug-14 ) by generating OFP_PACKET_IN, that an attacker can use to infer the result of exploita-
HOST_JOIN, HOST_LEAVE, PORT_UP, and PORT_DOWN. tions. For Bug-1, Bug-2, Bug-3, and Bug-4, the attacker
Moreover, the attacker can remotely exploit 6 more can infer the failure of exploitation by monitoring LLDP
harmful race conditions (i.e., Bug-1, Bug-2, Bug-3, packets from the SDN controller to the active ports of
Bug-4, Bug-5 and Bug-6) by utilizing SWITCH_JOIN the activated switch. For Bug-5 and Bug-6, the attacker
and SWITCH_LEAVE events when the SDN network uti- can notice the unsuccessful exploitations by receiving re-
lizes in-band control messages. For the rest 6 harm-
6 Note that since some attack event sequence may trigger multiple
ful race conditions (i.e., Bug-7, Bug-8, Bug-9, Bug-10,
harmful race condition (e.g., <SWITCH_LEAVE, SWITCH_JOIN> can
Bug-13, and Bug-15), we found that they correlate with trigger Bug-1, Bug-2, Bug-3, and Bug-4), we only record the first bug
REST API requests which are administrative events and exploitation because an exploitation of harmful race condition may dis-
might be protected by TLS/SSL. We consider the ex- rupt the operation of the SDN controller.
Bug # Indications of Failed Exploitation Figure 11: Privacy leakage in Floodlight LoadBalancer.
1,2,3,4 receipt of LLDP packets
5,6 receipt of responses from the service IP address
12 receipt of DHCP response/offer messages
Disrupting Packet Processing Service. We set up
an attack experiment in Mininet (with 500ms delay link
Table 6: Remote exploitation result. between the DHCP server and its connected switch),
where we injected ordered attack event sequences, i.e.,
Bug # Attack Case Trials (average)
<OFP_PACKET_IN, HOST_LEAVE>. In detail, we con-
1 (SWITCH_JOIN,SWITCH_LEAVE) 10.6
2 (SWITCH_JOIN,SWITCH_LEAVE) 78.4 trolled a host to send out a DHCP request (to generate
3 (SWITCH_JOIN,SWITCH_LEAVE) 120 OFP_PACKET_IN) and turn off the network interface (to
4 (SWITCH_JOIN,SWITCH_LEAVE) 10 inject a HOST_LEAVE event) immediately after the trans-
5 (OFP_PACKET_IN,SWITCH_LEAVE) 67.6 mission of the DHCP request. As a result, the harmful
6 (OFP_PACKET_IN,SWITCH_LEAVE) 106.8
11 (OFP_PACKET_IN,HOST_LEAVE) -
race condition is triggered by injecting an attack event
12 (OPP_PACKET_IN,HOST_LEAVE) 1 sequence, which actually disrupts the packet processing
14 (HOST_LEAVE,HOST_JOIN) - service (as shown in Figure 12) to dispatch the incoming
packets to OFP_PACKET_IN event handlers of SDN con-
troller/applications. The exploitation possibility of such
harmful race condition is comparatively high for a re-
6.6 Case Studies
mote attacker since its vulnerable window is subject to
Here we detail two state manipulation attack examples as round-trip delay between the ONOS controller and the
briefly introduced in Section 3.3. DHCP server. In this case, a tactical attacker can even
Sniffing Physical IP Address of Service Replica. pick up a network congestion timing to increase the suc-
In order to exploit the harmful race condition remotely, cess ratio of the exploitation.
[9] ASM. Java Bytecode Analysis Framework. http://asm.ow2. [30] M AI , H., K HURISHID , A., AGARWAL , R., C AESAR , M., G OD -
org/. FREY, P., AND K ING , S. Debugging the Data Plane with
Anteater. In SIGCOMM’11 (2011).
[10] BALL , T. The Concept of Dynamic Analysis. In FSE’99 (1999).
[31] M AIYA , P., K ANADE , A., AND M AJUMDAR , R. Race Detection
[11] BALL , T., B JORNER , N., G EMBER , A., I TZHAKY, S., K ARBY- for Android Applications. In PLDI’14 (2014).
SHEV, A., S AGIV, M., S CHAPIRA , M., AND VALADARSKY, A.
VeriCon: Towards Verifying Controller Programs in Software- [32] M ISEREZ , J., B IELIK , P., E L -H ASSANY, A., VANBEVER , L.,
defined Networks. In PLDI’14 (2014). AND V ECHEV, M. SDNRacer: Detecting concurrency violations
in software-defined networks. In SOSR’15 (2015).
[12] B ORISOV, N., J OHNSON , R., S ASTRY, N., AND WAGNER , D.
Fixing Races for Fun and Profit: How to abuse atime. In Usenix [33] P ETROV, B., V ECHEV, M., S RIDHARAN , M., AND D OLBY, J.
Security’05 (2005). Race Detection for Web Applications. In PLDI’12 (2012).
[13] B RAUN , W., AND M ENTH , M. Software-Defined Networking [34] P ORRAS , P., C HEUNG , S., F ONG , M., S KINNER , K., AND
Using OpenFlow: Protocols, Applications and Architectural De- Y EGNESWARAN , V. Securing the Software-Defined Network
sign Choices. In Future Internet (2014). Control Layer. In NDSS’15 (2015).
[14] C AI , X., G UI , Y., AND J OHNSON , R. Exploiting Unix File- [35] P ORRAS , P., S HIN , S., Y EGNESWARAN , V., F ONG , M.,
System Races via Algorithmic Complexity Attacks. In S&P’09 T YSON , M., AND G U , G. A Security Enforcement Kernel for
(2009). OpenFlow Networks. In HotSDN’12 (2012).
[15] C ANINI , M., V ENZANO , D., P ERESINI , P., KOSTIC , D., AND [36] R AYCHEV, V., V ECHEV, M., AND S RIDHARAN , M. Effec-
R EXFORD , J. A NICE Way to Test OpenFlow Applications. In tive Race Detection for Event-Driven Programs. In OOPSLA’13
NSDI’12 (2012). (2013).
Grant Ho† Aashish Sharma Mobin Javed† Vern Paxson†∗ David Wagner†
† ∗
UC Berkeley Lawrence Berkeley National Laboratory International Computer Science Institute
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 1 10 100 1000 10000
Total number of emails sent per From: name Total number of From: addresses per name
Figure 2: Distribution of the number of emails sent per From Figure 3: Distribution of the total number of From addresses
name. Nearly 40% of all From names appear in only one per From name (who send over 100 emails) across all emails
email and over 60% of all From names appear in three or fewer sent by the From name. Over half (52%) of these From names
emails. sent email from two or more From addresses (i.e., have at least
one new From address).
To address this gap, one might flag all emails with a
new or previously unknown From name (e.g., any email
where the From name has been seen in two or fewer names that appear in at least 100 emails (our dataset con-
emails leads to an alert). Unfortunately, this approach tains 125,172 of them) and assess the frequency at which
generates an overwhelming number of alerts in practice these names use a new From email address when send-
because millions of From names are only ever seen in a ing email.
few emails. Figure 2 shows the distribution of the num- Figure 3 shows the cumulative distribution of the to-
ber of emails per From name in our dataset. In particu- tal number of From email addresses per From name.
lar, we find that over 60% of From names sent three or From this graph, we see that even among From names
fewer emails and over 40% of From names sent exactly with substantial history (sent over 100 emails), there is
one email. Thus, even if one ran a detector retrospec- considerable variability in header values: 52% of these
tively to alert on every email with a From name that had From names send email from more than one From email
never been seen before and did not eventually become address. We find that 1,347,744 emails contain a new
an active and engaged sender, it would produce over 1.1 From email address which has never been used in any
million alerts: a false positive rate of less than 1% on of the From name’s prior emails. Generating an alert for
our dataset of nearly 370 million emails, but still orders each of these emails would far exceed our target of 10
of magnitude more than our target. Even though spam alerts per day.
might account for a proportion of these emails with new This large number of new email addresses per From
From names, LBNL’s security staff investigated a ran- name stems from a variety of different sources: work
dom sample of these emails and found a spectrum of be- vs. personal email addresses for a user, popular hu-
nign behavior: event/conference invitations, mailing list man names where each email address represents a
management notices, trial software advertisements, and different person in real life (e.g., multiple people
help support emails. Thus, a detector that only lever- named John Smith), professional society surveys,
ages the traditional approach of searching for anomalies and functionality-specific email addresses (e.g. Foo
in header values faces a stifling range of anomalous but <noreply@foo.com>, Foo <help@foo.com>,
benign behavior. Foo <donate@foo.com>). While it might be
tempting to leverage domain reputation or domain simi-
4.2 Challenge 2: Churn in Header Values larity between a new From address and the From name’s
Even if we were to give up on detecting attacks that prior addresses to filter out false positives, this fails in
come from previously unseen From names or addresses, a number of different cases. For example, consider the
a detector based on header anomalies still runs into yet case where Alice suddenly sends email from a new
another spectrum of diverse, benign behavior. Namely, email address, whose domain is a large email hosting
header values for a sender often change for a variety of provider; this could either correspond to Alice sending
benign reasons. To illustrate this, we consider all From email from her personal email account, or it might rep-
ture counts the number of days between the first visit by someone trustworthy or authoritative, there would be no
any employee to a URL on the clicked link’s FQDN and point in spoofing it, or it would manifest itself under our
the time when the clicked link’s email initially arrived at previously unseen attacker threat model. Thus, the sec-
LBNL. ond sender reputation feature for a clicked email reflects
We chose to characterize a clicked link’s reputation the trustworthiness of the name in its From header. We
in terms of its FQDN, rather than the full URL, be- measure the trustworthiness of a name by counting the
cause over half of the clicked URLs in our dataset had total number of weeks where this name sent at least one
never been visited prior to a click-in-email event. Con- email for every weekday of the week. Intuitively, the idea
sequently, operating at the granularity of the full URL is that From names that frequently and consistently send
would render the URL reputation features ineffective be- emails will be perceived as familiar and trustworthy.
cause the majority of URLs would have the lowest possi-
ble feature values (i.e., never been visited prior to the Previously Unseen Attacker: In this threat model
email recipient). Additionally, using a coarser granu- (§ 2.1.1), Mallory chooses a name and email address
larity such as the URL’s registered domain name or its that resembles a known or authoritative entity, but
effective second-level domain could allow attackers to where the name and email address do not exactly match
acquire high reputation attack URLs by hosting their any existing entity’s values (e.g.,IT Support Team
phishing webpages on popular hosting sites (e.g., at- <helpdesk@company.net>); if the name or ad-
tacker.blogspot.com). By defining a URL’s reputation in dress did exactly match an existing entity, the attack
terms of its FQDN, we mitigate this risk. would instead fall under the name spoofer or address
spoofer threat model. Compared to name-spoofing at-
5.2.2 Sender Reputation Features tacks, these attacks are more difficult to detect because
we have no prior history to compare against; indeed,
Name Spoofer: As discussed earlier (§ 2.1.1), in this at- prior work does not attempt to detect attacks from this
tacker model Mallory masquerades as a trusted entity by threat model. To deal with this obstacle, we rely on an
spoofing the name in the From header, but she does not assumption that the attacker will seek to avoid detection,
spoof the name’s true email address. Because the trusted and thus the spoofed identity will be infrequently used;
user that Mallory impersonates does not send email from each time Mallory uses the spoofed identity, she runs the
Mallory’s spoofed address, the spearphishing email will risk that the employee she’s interacting with might rec-
have a From email address that does not match any of ognize that Mallory has forged the name or email address
the historical email addresses for its From name. There- and report it. Accordingly, we use two features: the num-
fore, the first sender reputation feature counts the number ber of prior days that the From name has sent email, and
of previous days where we saw an email whose From the number of prior days that the From address has sent
header contains the same name and address as the email emails.
being scored.
Also, in this attacker model, the adversary spoofs the Lateral Attacker: This sub-detector aims to catch
From name because the name corresponds to someone spearphishing emails sent from a compromised user’s ac-
known and trusted. If that name did not correspond to counts (without using any spoofing). To detect this pow-
Table 3: Summary of our real-time detection results for emails in our test window from Sep 1, 2013 - Jan 14, 2017 (1,232 days).
Rows represent the type/classification of an alert following analysis by security staff members at LBNL. Columns 2–4 show alerts
broken down per attacker model (§ 5.2.2). Column 5 shows the total number of spearphishing campaigns identified by our real-time
detector in the numerator and the total number of spearphishing campaigns in the denominator. Out of 19 spearphishing attacks,
our detector failed to detect 2 attacks (one that successfully stole an employee’s credentials and one that did not); both of these
missed attacks fall under the previously unseen attacker threat model, where neither the username nor the email address matched
an existing entity.
20
7 known successful spearphishing attacks; this includes 1
spearphishing exercise, designed by an external security
firm and conducted independently of our work, that suc-
cessfully stole employee credentials. Additionally, mem- 15
bers of LBNL’s security team manually investigated and
labeled 15,521 alerts. We generated these alerts from a
Num days
combination of running (1) an older version of our detec- 10
tor that used manually chosen thresholds instead of the
DAS algorithm; and (2) a batched version of our anomaly
scoring detector, which ran the full DAS scoring proce- 5
dure over the click-in-email events in our evaluation win-
dow (Sep. 2013 onward) and selected the highest scoring
alerts within the cumulative budget for that timeframe. 00 5 10 15 20 25 30 35
From this procedure, we identified a total of 19 X = total alerts per day
spearphishing campaigns: 9 which succeeded in stealing
Figure 6: Histogram of the total number of daily alerts gener-
an employee’s credentials and 10 where the employee
ated by our real-time detector (cumulative across all three sub-
clicked on the spearphishing link, but upon arriving at detectors) on 100 randomly sampled days. The median is 7
the phishing landing page, did not enter their creden- alerts/day.
tials.5 We did not augment this dataset with simulated
or injected attacks (e.g., from public blogposts) because
the true distribution of feature values for spearphishing
Dropbox [7] that allowed users to host static HTML
attacks is unknown. Even for specific public examples,
pages under one of Dropbox’s primary hostnames, which
without actual historical log data one can only speculate
is both outside of LBNL’s NIDS visibility because of
on what the values of our reputation features should be.
HTTPS and inherits Dropbox’s high reputation. This
To evaluate our true positive rates, we ran our real-
represents a limitation of our detector: if an attacker can
time detector (§ 5.5) on each attack date, with a budget
successfully host the malicious phishing page on a high-
of 10 alerts per day. We then computed whether or not
reputation site or outside of the network monitor’s vis-
the attack campaign was flagged in a real-time alert gen-
ibility, then we will likely fail to detect it. However,
erated on those days. Table 3 summarizes our evaluation
Dropbox and many other major file sharing sites (e.g.,
results. Overall, our real-time detector successfully iden-
Google Drive) have dropped these website-hosting fea-
tifies 17 out of 19 spearphishing campaigns, a 89% true
tures due to a number of security concerns, such as facil-
positive rate.
itating phishing. Ironically, in the specific case of Drop-
Of these, LBNL’s incident database contained 7 box, industry reports mention a large increase in phishing
known, successful spearphishing campaigns (their inci- attacks targeted against Dropbox users, where the phish-
dent database catalogues successful attacks, but not ones ing attack would itself be hosted via Dropbox’s website
that fail). Although our detector missed one of these suc- hosting feature, and thus appear to victims under Drop-
cessful attacks, it identified 2 previously undiscovered at- box’s real hostname [11]. Among the attacks that our de-
tacks that successfully stole an employee’s credentials. tector correctly identified, manual analysis by staff mem-
The missed attack used a now-deprecated feature from bers at LBNL indicated that our sub-detectors aptly de-
5A campaign is identified by a unique triplet of hthe attack URL, tected spearphish that fell under each of their respective
email subject, and email’s From headeri. threat models (outlined in Section 2.1).
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Logins-by-Sending-User from New IP Addr's City Logins-by-Sending-User from New IP Addr's City
Figure 7: Both plots show the sender reputation feature values (scaled between [0, 1]) of a random sample of 10,000 lateral attacker
click-in-email events. Filled red points denote events that generated alerts within the daily budget by DAS (left-hand figure) and
KDE (right-hand figure).
Algorithm Detected Daily Budget dimensions. Even though the bottom-right corner repre-
kNN 3/19 10 sents employee logins where few other employees have
17/19 2,455 logged in from the same city, they are not suspicious, be-
GMM 4/19 10 cause that employee has previously logged in many times
17/19 147 from that location: they correspond to benign logins by
KDE 4/19 10 remote employees who live and work from cities far from
17/19 91 LBNL’s main campus. Thus, DAS can significantly out-
DAS (§ 5.4) 17/19 10 perform standard unsupervised anomaly detection tech-
niques because it allows us to incorporate domain knowl-
Table 4: Comparing classical anomaly detection techniques to edge of the features into DAS’s decision making.
our real-time detector, on the same dataset and features. For
each of the standard anomaly detection algorithms, the first row
shows the number of attacks detected under the same daily bud- 7 Discussion and Limitations
get as ours; the second row shows what the classical technique’s
Detection systems operate in adversarial environments.
budget would need to be to detect all 17 attacks that our real-
time detector identified on a daily budget of 10 alerts per day.
While we have shown our approach can detect both
known and previously undiscovered spearphishing at-
tacks, there are limitations and evasion strategies that ad-
of points in the upper-right corner, which illustrates one
versaries might pursue.
of limitations of standard techniques discussed in Sec-
tion 5.4: they do not take into account the directionality Limited Visibility: Our detection strategy hinges on
of feature values. Because extremely large feature val- identifying if an email’s recipient engaged in a po-
ues occur infrequently, KDE ranks those events as highly tentially dangerous action. In the case of credential
anomalous, even though they correspond to benign login spearphishing, LBNL’s network traffic logs allowed us
sessions where the user happened to login from a new to infer this behavior. However, our approach has two
IP address in a residential city nearby LBNL. Second, limitations: first, email and network activity conducted
KDE selects a group of events in the bottom-right cor- outside of LBNL’s network borders will not get recorded
ner, which correspond to login sessions where an em- in the NIDS logs. Second, LBNL made a conscious
ployee logged in from a city that they have frequently decision not to man-in-the-middle traffic served over
authenticated from in the past, but where few other em- HTTPS; thus, we will miss attacks where the email links
ployees have logged in from. KDE’s selection of these to an HTTPS website. Both of these are typical chal-
benign logins illustrates another limitation of standard lenges that network-level monitoring faces in practice.
techniques: they often select events that are anomalous in One strategy for alleviating this problem would be to use
just one dimension, without taking into account our do- endpoint monitoring agents on employee machines. Al-
main knowledge that an attack will be anomalous in all ternatively, a detector could leverage SNI [23] to develop
[10] Haibo He and Edwardo A Garcia. Learning from [19] Gianluca Stringhini and Olivier Thonnard. That
imbalanced data. IEEE Transactions on knowledge ain’t you: Blocking spearphishing through behav-
and data engineering, 21(9):1263–1284, 2009. ioral modelling. In Detection of Intrusions and
Malware, and Vulnerability Assessment, pages 78–
[11] Nick Johnston. Dropbox users targeted 97. Springer, 2015.
by phishing scam hosted on dropbox.
https://www.symantec.com/connect/ [20] Colin Tankard. Advanced persistent threats and
blogs/dropbox-users-targeted- how to monitor and deter them. Network security,
phishing-scam-hosted-dropbox, Oct 2011.
2014.
[21] Lisa Vaas. How hackers broke into John
[12] Mahmoud Khonji, Youssef Iraqi, and Andrew Podesta, DNC Gmail accounts. https:
Jones. Mitigation of spear phishing attacks: A //nakedsecurity.sophos.com/2016/
content-based authorship identification framework. 10/25/how-hackers-broke-into-
In Internet Technology and Secured Transactions john-podesta-dnc-gmail-accounts/,
(ICITST), 2011 International Conference for, pages October 2016.
416–421. IEEE, 2011.
[22] Colin Whittaker, Brian Ryner, and Marria Nazif.
[13] Aleksandar Lazarevic, Levent Ertoz, Vipin Kumar, Large-scale automatic classification of phishing
Aysel Ozgur, and Jaideep Srivastava. A compar- pages. In NDSS, volume 10, 2010.
ative study of anomaly detection schemes in net-
work intrusion detection. In Proceedings of the [23] Wikipedia. Server name indication.
2003 SIAM International Conference on Data Min- https://en.wikipedia.org/wiki/
ing, pages 25–36. SIAM, 2003. Server_Name_Indication, June 2017.
[14] Stevens Le Blond, Cédric Gilbert, Utkarsh Upad- [24] Mengchen Zhao, Bo An, and Christopher Kiek-
hyay, Manuel Gomez Rodriguez, and David intveld. Optimizing personalized email filtering
Choffnes. A broad view of the ecosystem of so- thresholds to mitigate sequential spear phishing at-
cially engineered exploit documents. In NDSS, tacks. In AAAI, pages 658–665, 2016.
2017.
Table 5: Summary of the feature vector for our name spoofer 500
sub-detector and the “suspiciousness” comparator we provide
to DAS for each feature. 0
0 20 40 60 80 100 120 140
X = alerts per RCPT TO: email address
Previously unseen attacker Features Comparator for DAS Figure 8: Histogram of alerts per RCPT TO address for our
Host age of clicked URL ≤ detector using an average budget of 10 alerts per day across the
(email ts − domain’s 1st visit ts)
# visits to clicked URL’s host prior to email ts ≤
Sep 1, 2013 – Jan 14, 2017 timeframe.
# days that From name has sent email ≤
# days that From addr has sent email ≤
1.0
Md Nahid Hossain1 , Sadegh M. Milajerdi2 , Junao Wang1 , Birhanu Eshete2 , Rigel Gjomemo2 ,
R. Sekar1 , Scott D. Stoller1 , and V.N. Venkatakrishnan2
1 Stony Brook University
2 University of Illinois at Chicago
Audit Stream
Linux
Tag-Based
Audit Dependence Graph Tag and Policy-Based
Root-Cause and
Construction Attack Detection
Windows Stream Impact Analysis
Alarms
Audit Stream
Scenario Graph
FreeBSD
Tagged Dependence Graph
2. Prioritizing entities for analysis: How can we assist faster than on-disk representations, an important factor
the analyst, who is overwhelmed with the volume of in achieving real-time analysis capabilities. In our ex-
data, prioritize and quickly “zoom in” on the most periments, we were able to process 79 hours worth of
likely attack scenario? audit data from a FreeBSD system in 14 seconds, with
a main memory usage of 84MB. This performance rep-
3. Scenario reconstruction: How do we succinctly
resents an analysis rate that is 20K times faster than the
summarize the attack scenario, starting from the at-
rate at which the data was generated.
tacker’s entry point and identifying the impact of the
entire campaign on the system? The second major contribution of this paper is the de-
velopment of a tag-based approach for identifying sub-
4. Dealing with common usage scenarios: How does jects, objects and events that are most likely involved in
one cope with normal, benign activities that may attacks. Tags enable us to prioritize and focus our anal-
resemble activities commonly observed during at- ysis, thereby addressing the second challenge mentioned
tacks, e.g., software downloads? above. Tags encode an assessment of trustworthiness and
5. Fast, interactive reasoning: How can we provide sensitivity of data (i.e., objects) as well as processes (sub-
the analyst with the ability to efficiently reason jects). This assessment is based on data provenance de-
through the data, say, with an alternate hypothesis? rived from audit logs. In this sense, tags derived from
Below, we provide a brief overview of S LEUTH, and audit data are similar to coarse-grain information flow la-
summarize our contributions. S LEUTH assumes that at- bels. Our analysis can naturally support finer-granularity
tacks initially come from outside the enterprise. For ex- tags as well, e.g., fine-grained taint tags [42, 58], if they
ample, an adversary could start the attack by hijacking are available. Tags are described in more detail in Sec-
a web browser through externally supplied malicious in- tion 3, together with their application to attack detection.
put, by plugging in an infected USB memory stick, or A third contribution of this paper is the development of
by supplying a zero-day exploit to a network server run- novel algorithms that leverage tags for root-cause iden-
ning within the enterprise. We assume that the adversary tification and impact analysis (Section 5). Starting from
has not implanted persistent malware on the host before alerts produced by the attack detection component shown
S LEUTH started monitoring the system. We also assume in Fig. 1, our backward analysis algorithm follows the
that the OS kernel and audit systems are trustworthy. dependencies in the graph to identify the sources of the
attack. Starting from the sources, we perform a full im-
1.1 Approach Overview and Contributions pact analysis of the actions of the adversary using a for-
Figure 1 provides an overview of our approach. S LEUTH ward search. We present several criteria for pruning these
is OS-neutral, and currently supports Microsoft Win- searches in order to produce a compact graph. We also
dows, Linux and FreeBSD. Audit data from these OSes present a number of transformations that further simplify
is processed into a platform-neutral graph representation, this graph and produce a graph that visually captures the
where vertices represent subjects (processes) and objects attack in a succinct and semantically meaningful way,
(files, sockets), and edges denote audit events (e.g., op- e.g., the graph in Fig. 4. Experiments show that our tag-
erations such as read, write, execute, and connect). This based approach is very effective: for instance, S LEUTH
graph serves as the basis for attack detection as well as can analyze 38.5M events and produce an attack scenario
causality analysis and scenario reconstruction. graph with just 130 events, representing five orders of
The first contribution of this paper, which addresses magnitude reduction in event volume.
the challenge of efficient event storage and analysis, is The fourth contribution of this paper, aimed at tackling
the development of a compact main-memory dependence the last two challenges mentioned above, is a customiz-
graph representation (Section 2). Graph algorithms on able policy framework (Section 4) for tag initialization
main memory representation can be orders of magnitude and propagation. Our framework comes with sensible
Instead of focusing on application behaviors that tend so that reads are treated the same as loads.
4 Binary code injection attacks on today’s OSes ultimately involve a
to be variable, we focus our detection techniques on the
call to change the permission of a writable memory page so that it be-
high-level objectives of most attackers, such as backdoor comes executable. To the extent that such memory permission change
insertion and data exfiltration. Specifically, we com- operations are included in the audit data, this policy can spot them.
bine reasoning about an attacker’s motive and means. If 5 Our implementation can identify mprotect operations that occur
Table 3: Dataset for each campaign with duration, distribution of different system calls and total number of events.
paigns carried out by a red team as part of the DARPA ber of loads on these two OSes. The “Others” column
Transparent Computing (TC) program. This set spans includes all the remaining audit operations, including
a period of 358 hours, and contains about 73 million rename, link, rm, unlink, chmod, setuid, and so on.
events. The last row corresponds to benign data collected The last column in the table identifies the scenario graph
over a period of 3 to 5 days across four Linux servers in constructed by S LEUTH for each campaign. Due to space
our research laboratory. limitations, we have omitted scenario graphs for cam-
paign L-2.
Attack data sets were collected on Windows (W-1
and W-2), Linux (L-1 through L-3) and FreeBSD (F-1
6.3 Engagement Setup
through F-3) by three research teams that are also part
of the DARPA TC program. The goal of these research The attack scenarios in our evaluation are setup as fol-
teams is to provide fine-grained provenance information lows. Five of the campaigns (i.e., W-2, L-2, L3, F-2, and
that goes far beyond what is found in typical audit data. F3) ran in parallel for 4 days, while the remaining three
However, at the time of the evaluation, these advanced (W-1, L-1, and F-1) were run in parallel for 2 days. Dur-
features had not been implemented in the Windows and ing each campaign, the red team carried out a series of
FreeBSD data sets. Linux data set did incorporate finer- attacks on the target hosts. The campaigns are aimed at
granularity provenance (using the unit abstraction devel- achieving varying adversarial objectives, which include
oped in [31]), but the implementation was not mature dropping and execution of an executable, gathering intel-
enough to provide consistent results in our tests. For this ligence about a target host, backdoor injection, privilege
reason, we omitted any fine-grained provenance included escalation, and data exfiltration.
in their dataset, falling back to the data they collected Being an adversarial engagement, we had no prior
from the built-in auditing system of Linux. The FreeBSD knowledge of the attacks planned by the red team. We
team built their capabilities over DTrace. Their data also were only told the broad range of attacker objectives de-
corresponded to roughly the same level as Linux audit scribed in the previous paragraph. It is worth noting that,
logs. The Windows team’s data was roughly at the level while the red team was carrying out attacks on the tar-
of Windows event logs. All of the teams converted their get hosts, benign background activities were also being
data into a common representation to facilitate analysis. carried out on the hosts. These include activities such
The “duration” column in Table 3 refers to the length as browsing and downloading files, reading and writing
of time for which audit data was emitted from a host. emails, document processing, and so on. On average,
Note that this period covers both benign activities and more than 99.9% of the events corresponded to benign
attack related activities on a host. The next several activity. Hence, S LEUTH had to automatically detect and
columns provide a break down of audit log events into reconstruct the attacks from a set of events including both
different types of operations. File open and close op- benign and malicious activities.
erations were not included in W-1 and W-2 data sets. We present our results in comparison with the ground
Note that “read” and “write” columns include not only truth data released by the red team. Before the release
file reads/writes, but also network reads and writes on of ground truth data, we had to provide a report of our
Linux. However, on Windows, only file reads and writes findings to the red team. The findings we report in this
were reported. Operations to load libraries were reported paper match the findings we submitted to the red team.
on Windows, but memory mapping operations weren’t. A summary of our detection and reconstruction results is
On Linux and FreeBSD, there are no load operations, provided in a tabular form in Table 7. Below, we first
but most of the mmap calls are related to loading. So, present reconstructed scenarios for selected datasets be-
the mmap count is a loose approximation of the num- fore proceeding to a discussion of these summary results.
/var/dropbear_latest/ 1. receive
10.write scp 9. fork sudo date
dropbearFREEBSD.tar
13. read sshd
ps
sudo 3. fork
bsdtar 12. fork 8. fork
2. fork 4. fork
11. fork
16.write 15.write 14.write
bash 5. fork ls
/var/dropbear_latest/ /var/dropbear_latest/ /var/dropbear_latest/ 17. fork
dropbear/dropbear dropbear/dropbearkey dropbear/dropbearscript 18. read vi 6. fork
25. fork
23.execute 20. read 7. fork whoami
19. fork
dropbearkey 22. fork sudo 21. fork sh sudo
hostname
26. fork
24. write
dropbear 30. fork bash
/usr/local/etc/dropbear/ 28. read
dropbear_rsa_host_key
6.4 Selected Reconstruction Results Campaign W-2. Figure 5 shows the graph recon-
structed by S LEUTH from Windows audit data. Although
Of the 8 attack scenarios successfully reconstructed by
the actual attack campaign lasted half an hour, the host
S LEUTH, we discuss campaigns W-2 (Windows) and F-3
was running benign background activities for 20 hours.
(FreeBSD) in this section, while deferring the rest to Sec-
These background activities corresponded to more than
tion 6.10. To make it easier to follow the scenario graph,
99.8% of the events in the corresponding audit log.
we provide a narrative that explains how the attack un-
folded. This narrative requires manual interpretation of Entry: The initial entry point for the attack is Firefox,
the graph, but the graph generation itself is automated. which is compromised on visiting the web server
In these graphs, edge labels include the event name and a 129.55.12.167.
sequence number that indicates the global order in which Backdoor insertion: Once Firefox is compromised, a ma-
that event was performed. Ovals, diamonds and rectan- licious program called dropper is downloaded and ex-
gles represent processes, sockets and files, respectively. ecuted. Dropper seems to provide a remote interactive
shell, connecting to ports 443 and then 4430 on the attack
C:\\dropper 4. execute dropper host, and executing received commands using cmd.exe.
2. write
6. send Intelligence gathering: Dropper then invokes cmd.exe
3. fork
firefox.exe
5. receive
multiple times, using it to perform various data gath-
129.55.12.167:443
1. receive
8. fork 10. send
ering tasks. The programs whoami, hostname and
129.55.12.167:8000
7. write dropper netstat are being used as stand-ins for these data
29. send
9. execute gathering applications. The collected data is written to
C:\\Users\\User1\\Downloads\\firefox\\dropper 30. receive
11. write 129.55.12.167:4430 C:\Users\User1\Documents\Thumbs\thumbit\test\thumbs.db.
13. fork
33. rm 12. read Data exfiltration: Then the collected intelligence is exfil-
32. rm C:\\Users\\User1\\Downloads\
trated to 129.55.12.51:9418 using git.
cmd.exe
\firefox\\burnout.bat
31. rm
14. fork
22. fork
Clean-up: Dropper downloads a batch file called
whoami 16. fork
git.exe
burnout.bat. This file contains commands to clean up
18. fork 129.55.12.51:9418 the attack footprint, which are executed by cmd.exe (see
netstat
23. fork
28. receive edges 11,12, 31-33).
27. send
15. write
hostname
Campaign F-3. (Figure 4). Under the command of
17. write 20. write git-receive-pack.exe
an attacker who uses stolen ssh credentials, sshd forks
21. chmod 129.55.12.51:80
19. write
a bash process. Note that though there is no direct ev-
26. read 24. fork 25. send
idence from the audit data about the stolen ssh creden-
C:\\Users\\User1\\Documents\\Thumbs\ tials, because of the subsequent events (scp) from this
git-remote-http.exe
\thumbit\\test\\thumbs.db
shell, we conclude this as a sign of an attacker that uses
stolen ssh credentials.
Fig. 5: Scenario graph reconstructed from campaign W-2.
Table 8: False alarms in a benign environment with software upgrades L-1 07:59:26 26 MB 8.71 s 3.3 K
and updates. No alerts were triggered during this period. L-2 79:06:39 329 MB 114.14s 2.5 K
L-3 79:05:13 175 MB 74.14 s 3.9 K
L-Mean 177 MB 3.2 K
team, we incorrectly identified 21 entities in F-1 that F-1 08:17:30 8 MB 1.86 s 16 K
F-2 78:56:48 84 MB 14.02 s 20.2 K
were not part of an attack. Subsequent investigation
F-3 79:04:54 95 MB 15.75 s 18.1 K
showed that the auditing system had not been shutdown
F-Mean 62.3 MB 18.1 K
at the end of the F-1 campaign, and all of these false pos-
itives correspond to testing/administration steps carried
Table 9: Memory use and runtime for scenario reconstruction.
out after the end of the engagement, when the auditing
system should not have been running. umn shows the campaign name, while the second shows
the total duration of the data set.
6.6 False Alarms in a Benign Environment
The third column shows the memory used for the de-
In order to study S LEUTH’s performance in a benign pendence graph. As described in Section 2, we have de-
environment, we collected audit data from four Ubuntu signed a main memory representation that is very com-
Linux servers over a period of 3 to 5 days. One of these pact. This compact representation enables S LEUTH to
is a mail server, another is a web server, and a third is store data spanning very long periods of time. As an ex-
an NFS/SSH/SVN server. Our focus was on software ample, consider campaign L-2, whose data were the most
updates and upgrades during this period, since these up- dense. S LEUTH used approximately 329MB to store
dates can download code from the network, thereby rais- 38.5M events spanning about 3.5 days. Across all data
ing the possibility of untrusted code execution alarms. sets, S LEUTH needed about 8 bytes of memory per event
There were four security updates (including kernel up- on the larger data sets, and about 20 bytes per event on
dates) performed over this period. In addition, on a the smaller data sets.
fourth server, we collected data when a software upgrade The fourth column shows the total run time, including
was performed, resulting in changes to 110 packages. the times for consuming the dataset, constructing the de-
Several thousand binary and script files were updated pendence graph, detecting attacks, and reconstructing the
during this period, and the audit logs contained over 30M scenario. We note that this time was measured after the
events. All of this information is summarized in Table 8. engagement when all the data sets were available. Dur-
As noted before, policies should be configured to per- ing the engagement, S LEUTH was consuming these data
mit software updates and upgrades using standard means as they were being produced. Although the data typically
approved in an enterprise. For Ubuntu Linux, we had covers a duration of several hours to a few days, the anal-
one policy rule for this: when dpkg was executed by ysis itself is very fast, taking just seconds to a couple of
apt-commands, or by unattended-upgrades, the pro- minutes. Because of our use of tags, most information
cess is not downgraded even when reading from files needed for the analysis is locally available. This is the
with untrusted labels. This is because both apt and principal reason for the performance we achieve.
unattended-upgrades verify and authenticate the hash The “speed-up” column illustrates the performance
on the downloaded packages, and only after these verifi- benefits of S LEUTH. It can be thought of as the num-
cations do they invoke dpkg to extract the contents and ber of simultaneous data streams that can be handled by
write to various directories containing binaries and li- S LEUTH, if CPU use was the only constraint.
braries. Because of this policy, all of the 10K+ files In summary, S LEUTH is able to consume and analyze
downloaded were marked benign. As a result of this, no audit COTS data from several OSes in real time while
alarms were generated from their execution by S LEUTH. having a small memory footprint.
6.7 Runtime and Memory Use 6.8 Benefit of split tags for code and data
Table 9 shows the runtime and memory used by S LEUTH As described earlier, we maintain two trustworthiness
for analyzing various scenarios. The measurements were tags for each subject, one corresponding to its code, and
made on a Ubuntu 16.04 server with 2.8GHz AMD another corresponding to its data. By prioritizing detec-
Opteron 62xx processor and 48GB main memory. Only a tion and forward analysis on code trustworthiness, we cut
single core of a single processor was used. The first col- down vast numbers of alarms, while greatly decreasing
the size of forward analysis output. can be seen from the table, S LEUTH achieved two to
Table 10 shows the difference between the number of three orders of magnitude reduction with respect to sin-
alarms generated by our four detection policies with sin- gle t-tag based analysis.
gle trustworthiness tag and with the split trustworthiness The output of forward analysis is then fed into the sim-
(code and integrity) tags. Note that the split reduces the plification engine. The sixth column shows the reduction
alarms by a factor of 100 to over 1000 in some cases. factor achieved by the simplifications over the output of
Table 11 shows the improvement achieved in forward our forward analysis. The last column shows the overall
analysis as a result of this split. In particular, the in- reduction we get over original events using split (code
creased selectivity reported in column 5 of this table and data) trustworthiness tags and performing the sim-
comes from splitting the tag. Note that often, there is plification.
a 100x to 1000x reduction in the size of the graph. Overall, the combined effect of all of these steps is
very substantial: data sets consisting of tens of millions
6.9 Analysis Selectivity of edges are reduced into graphs with perhaps a hundred
Table 11 shows the data reduction pipeline of the analy- edges, representing five orders of magnitude reduction
ses in S LEUTH. The second column shows the number in the case of L-2 and L-3 data sets, and four orders of
of original events in each campaign. These events in- magnitude reduction on other data.
clude all the events in the system (benign and malicious)
over several days with an overwhelming majority having
/home/User1/traffic_gen/mozillanightly 129.55.12.167:8000
a benign nature, unrelated to the attack.
1. receive
The third column shows the final number of events that 2. write
5. execute firefox
go into the attack scenario graph. 36. rm
3. fork
The fourth column shows the reduction factor when mozillanightly 4. fork sh
a naive forward analysis with single trustworthiness tag 11. send
7. chmod
12. receive 6. write
(single t-tag) is used from the entry points identified by
129.55.12.167:443 /home/User1/traffic_gen/mozillaautoup
our backward analysis. Note that the graph size is very
large in most cases. The fifth column shows the reduction 10. read 8. execute
13. fork
factor using the forward analysis of S LEUTH— which is mozillaautoup
9. write
based on split (code and data) trustworthiness tags. As /tmp/netrecon.log
35. rm
34. rm
rm
openssl
25. fork 33. execute
Initial Final Reduction Factor
Dataset # of # of Single Split S LEUTH sh
Total 26. write
Events Events t-tag t-tag Simplif. /etc/sudoers
W-1 100 K 51 4.4x 1394x 1.4x 1951x 28. write
16. execute 27. read
W-2 401 K 28 3.6x 552x 26x 14352x
18. execute 30. write
L-1 2.68 M 36 8.9x 15931x 4.7x 74875x whoami
20. execute
L-2 38.5 M 130 7.3x 2971x 100x 297100x 24. execute /etc/shadow
ls
L-3 19.3 M 45 7.6x 1208x 356x 430048x 22. execute 29. read
17. send
F-1 701 K 45 2.3x 376x 41x 15416x dir cat
14. send
F-2 5.86 M 39 1.9x 689x 218x 150202x 31. read
19. send
F-3 5.68 M 45 6.7x 740x 170x 125800x /etc/passwd
15. receive 21. send hostname
Average Reduction 4.68x 1305x 41.8x 54517x 23. send 32. send
/var/tmp/nginx/ /usr/ports/www/nginx/work/nginx-1.10.1/
dropper 6. execute make 33. fork cc sudo
client_body_temp/dropper src/http/ngx_http_request_body.c
8. receive 7. send 31. mmap 30. fork 34. fork 41. fork
9. fork
10. send
sh 17. write ld
/etc/sudoers cc sh
129.55.12.167:443
6.10 Discussion of Additional Attacks files: 1) burnout.bat, which is read, and later used to
In this section, we provide graphs that reconstruct at- issue commands to cmd.exe to gather data about the
tack campaigns that weren’t discussed in Section 6.4. system; 2) mnsend.exe, which is executed by cmd.exe
Specifically, we discuss attacks L-1, F-1, F-2, W-1, and to exfiltrate the data gathered previously.
L-3. Attack L-1. In this attack (Figure 12), firefox Attack L-3. In this attack (Figure 16), the file
is exploited to drop and execute via a shell the file dropbearLINUX.tar is downloaded and extracted.
mozillanightly. The process mozillanightly first Next, the program dropbearkey is executed to create
downloads and executes mozillaautoup, then starts a three keys, which are read by a program dropbear,
shell, which spawns several other processes. Next, the which subsequently performs exfiltration.
information gathered in file netrecon.log is exfiltrated
and the file removed. 7 Related Work
Attack F-1. In this attack (Figure 13), the nginx In this section, we compare S LEUTH with efforts from
server is exploited to drop and execute via shell the file academia and open source industry tools. We omit com-
dropper. Upon execution, the dropper process forks parison to proprietary products from the industry as there
a shell that spawns several processes, which write to a is scarce technical documentation available for an in-
file and reads and writes to sensitive files. In addition, depth comparison.
dropper communicates with the IP of the attacker. We
report in the figure the graph related to the restoration Provenance tracking and Forensics Several logging
and administration carried out after the engagement, as and provenance tracking systems have been built to mon-
discussed in Section 6.5. itor the activities of a system [21, 41, 23, 22, 13, 45, 9]
Attack F-2. The start of this attack (Figure 14) is sim- and build provenance graphs. Among these, Backtracker
ilar to F-1. However, upon execution, the dropper pro- [25, 26] is one of the first works that used dependence
cess downloads three files named recon, sysman, and graphs to trace back to the root causes of intrusions.
mailman. Later, these files are executed and used which These graphs are built by correlating events collected by
are used to exfiltrate data gathered from the system. a logging system and by determining the causality among
Attack W-1. In this attack (Figure 15), firefox system entities, to help in forensic analysis after an attack
is exploited twice to drop and execute a file is detected.
mozillanightly. The first mozillanightly process S LEUTH improves on the techniques of Backtracker
downloads and executes the file photosnap.exe, in two important ways. First, Backtracker was meant
which takes a screenshot of the victim’s screen and to operate in a forensic setting, whereas our analysis
saves it to a png file. Subsequently, the jpeg file and data representation techniques are designed towards
is exfiltrated by mozillanightly. The second real-time detection. Setting aside hardware comparisons,
mozillanightly process downloads and executes two we note that Bactracker took 3 hours for analyzing au-
129.55.12.167:8000 nginx
2. send
3. write 4. fork
/tmp/netrecon
/var/tmp/nginx/client_body_temp/dropper sh
9. execute
7. write
netrecon
6. execute 5. fork
12. send 8. fork
10. write
129.55.12.167:443 13. receive
dropper 11. read
/tmp/netrecon.log
29. receive
14. fork 28. write
27. write
sh
/tmp/mailer/mailman
30. fork
/tmp/sysman 35. fork
15. fork 25. fork
17. fork 23. fork 129.55.12.167:2525
129.55.12.167:6666
19. fork 37. execute
32. execute
/tmp/syslog.dat /tmp/mailer/mailer.log
129.55.12.167:8000
1. receive
C:\\Users\\User1\\Downloads\ C:\\Users\\User1\\Downloads\\firefox\ 129.55.12.167:4430
129.55.12.167:443
\mozillanightly \mozillanightly
15. receive 14. send 27. send
5. send 4. execute 2. write 21. write 23. execute 47. receive
51. send
16. send 129.55.12.167:443
mozillanightly 3. fork firefox.exe 22. fork mozillanightly 24. send
6. send
26. read
36. write 25. write
129.55.12.167:4430
C:\\Users\\User1\\Downloads\\firefox\ 49. rm
17. receive 8. write 28. fork C:\\Users\\User1\\Downloads\\firefox\
\mnsend.exe
7. receive \burnout.bat
18. fork 48. rm
9. fork 46. rm cmd.exe
C:\\Users\\User1\\Downloads\
19. rm \photosnap.exe 38. execute
cmd.exe
37. fork 29. fork 31. fork 33. fork
C:\\Users\\User1\\Downloads\ C:\\Users\\User1\\Downloads\\firefox\
129.55.12.167:7770 \4662.log
\pic.png
dit data from a 24-hour period, whereas S LEUTH was dence graphs for real-time detection from which scenario
able to process 358 hours of logs in a little less than 3 subgraphs are extracted during a forensic analysis. The
minutes. Secondly, Backtracker relies on alarms gener- forensic analysis of [31, 37] ensures more precision than
ated by external tools, therefore its forensic search and Backtracker [25] by heuristically dividing the execution
pruning cannot leverage the reasons that generated those of the program into execution units, where each unit rep-
alarms. In contrast, our analysis procedures leverage the resents one iteration of the main loop in the program.
results from our principled tag-based detection methods The instrumentation required to produce units is not al-
and therefore are inherently more precise. For example, ways automated, making the scalability of their approach
if an attack deliberately writes into a well-known log file, a challenge. S LEUTH can make use of the additional pre-
Backtracker’s search heuristics may remove the log file cision afforded by [31] in real-time detection, when such
from the final graph, whereas our tag-based analysis will information is available.
prevent that node from being pruned away.
While the majority of the aforementioned systems op-
In a similar spirit, BEEP [31] and its evolution Pro- erate at the system call level, several other systems track
Tracer [37] build dependence graphs that are used for information flows at finer granularities [24, 8, 31]. They
forensic analysis. In contrast, SLEUTH builds depen- typically instrument applications (e.g., using Pin [35]) to
7. write tar
/var/dropbear_latest/dropbearLINUX/dropbearkey
4. read
6. chmod
5. write
/var/dropbear_latest/dropbearLINUX.tar
2. write /var/dropbear_latest/dropbearLINUX/dropbear
3. fork
10. execute
scp
1. fork 15. execute
dropbearkey dropbear
16. read
11. write
/etc/dropbear/dropbear_ecdsa_host_key 17. read
12. write
13. write 18. read
/etc/dropbear/dropbear_dss_host_key
39. write 43. rm
40. write 44. rm
/etc/dropbear/dropbear_rsa_host_key
shred 41. write 45. rm rm
/proc/vmstat
/etc/shadow /proc/timer_list
/proc/interrupts
/etc/shells 19. fork
24. read 23. read
25. read
22. read
26. read
/etc/nsswitch /proc/loadavg
27. read 20. read
dropbear
28. read 21. read
/proc/sys/kernel/
/etc/localtime 29. write 31. write ngroups_max/run/utmp
30. write 32. write
/var/log/wtmp /dev/ptmx
33. send 35. send 37. send
34. send 36. send
/var/log/lastlog /run/utmp
track information flows through a program. Such fine- drawback, but their false positives rates deter widespread
grained tainting can provide much more precise prove- deployment. Specification/policy-based techniques can
nance information, at the cost of higher overhead. Our reduce these false positives, but they require application-
approach can take advantage of finer granularity prove- specific policies that are time-consuming to develop
nance, when available, to further improve accuracy. and/or rely on expert knowledge. Unlike these ap-
proaches, S LEUTH relies on application-independent
Attack Detection A number of recent research efforts policies. We develop such policies by exploiting prove-
on attack detection/prevention focus on “inline” tech- nance information computed from audit data. In particu-
niques that are incorporated into the protected system, lar, an audit event
e.g., address space randomization, control-flow integrity,
taint-based defenses and so on. Offline intrusion detec- Information Flow Control (IFC) IFC techniques as-
tion using logs has been studied for a much longer period sign security labels and propagate them in a manner sim-
[15, 36, 19]. In particular, host-based IDS using system- ilar to our tags. Early works, such as Bell-LaPadula
call monitoring and/or audit logs has been investigated [10] and Biba [12], relied on strict policies. These strict
by numerous research efforts [57, 32, 47, 55, 18, 29]. policies impact usability and hence have not found fa-
Host-based intrusion detection techniques mainly fall vor among contemporary OSes. Although IFC is avail-
into three categories: (1) misuse-based, which rely on able in SELinux [34], it is not often used, as users prefer
specifications of bad behaviors associated with known its access control framework based on domain-and-type
attacks; (2) anomaly-based [19, 32, 47, 20, 30, 11, 48], enforcement. While most above works centralize IFC,
which rely on learning a model of benign behavior decentralized IFC (DIFC) techniques [59, 17, 28] em-
and detecting deviations from this behavior; and (3) phasize the ability of principals to define and create new
specification-based [27, 54], which rely on specifications labels. This flexibility comes with the cost of nontrivial
(or policies) specified by an expert. The main drawback changes to application and/or OS code.
of misuse-based techniques is that their signature-based Although our tags are conceptually similar to those
approach is not amenable to detection of previously un- in IFC systems, the central research challenges faced in
seen attacks. Anomaly detection techniques avoid this these systems are very different from S LEUTH. In par-
In this section, we provide specific background on the All survey participants are investigative journalists who
Panama Papers project (unless otherwise noted, details actively participated in the Panama Papers project. All
here are sourced from [12], published by ICIJ). Addi- interview participants currently work full-time for the
tional related work is discussion in Section 7. ICIJ and/or had a significant role in determining the se-
The International Consortium of Investigative Journal- curity features and requirements of the collaboration sys-
ists (ICIJ) is a non-profit, selective-membership organi- tems used throughout the project by the journalists sur-
zation founded in 1997. Comprised of just under 200 veyed. In the results presented here, participants com-
investigative journalists in more than 65 countries, since pleted either a survey or an interview.
2012 ICIJ has obtained several caches of leaked docu- Survey. Survey participants were 118 journalists work-
ments that have led to collaborative investigations across ing in 58 different countries representing every continent
news organizations around the world (e.g., [28–30]). Yet, except Antarctica. No other demographic data was col-
in the words of one ICIJ staffer interviewed for this pa- lected. This sample represents approximately 33% (118
per, the Panama Papers project [31] — which lasted from of 354) of all non-ICIJ staff who worked on the project.
approximately May 2015 to April 2016 — was where the
organization’s work collaborative and analytical systems Interview. ICIJ consists of only twelve full-time em-
“all came together.” ployees. For this study we interviewed all five of the ICIJ
Consisting of over 11.5 million documents in dozens personnel with significant editorial or technical input on
of formats occupying 2.6 TB of disk space, the Panama the systems used during the Panama Papers project. In-
Table 1: Familiarity with and Usage of Security Practices Prior to Project (N=118). Scale items were “Never heard of it before”
(Unaware); “Knew about it, hadn’t used” (Never); “Had used a few times” (Few); “Used occasionally” (Occasionally) and; “Used
frequently” (Frequently).
terview participants were two technical and two edito- 3.3 Procedure
rial management staff of ICIJ, as well as the journalist
who received the original Panama Papers materials and Survey. The survey was conducted between July 28th
worked closely with ICIJ on the system requirements. Of and August 15th, 2016 by the ICIJ. Participants com-
these five participants, two participants were women and pleted the survey via a Google form and took around 10
three were men. To maximize the insight gained from minutes to complete. Participants could choose to an-
these interviews, we designed the interview script us- swer the survey anonymously or provide their name if
ing information from a careful review of public infor- they wished. ICIJ provided us with the survey responses
mation available about the systems (e.g., [10, 36]), as as a de-identified dataset. Participants were not provided
well as insight from an IRB-approved background (pi- an incentive to take the survey.
lot) interview with an individual member of the Panama
Papers project who had intimate knowledge of the sys- Interview. We interviewed participants between Decem-
tems involved. The team then collected and iteratively ber 2016 and January 2017. Interviews typically lasted
refined the major themes for the interviews, customiz- about one hour and were conducted via telephone/online
ing their content based on the individual’s primary (self- video/voice conference (four), with one taking place in
identified) role in the project as either an editorial (E) or person. All participants spoke fluent English and were
technical (IT) leader. interviewed in English. Participants were not provided
an incentive to participate in the interview.
3.2 Materials
3.4 Data Preparation and Analysis
Materials consisted of a survey and two interview scripts,
described here and reproduced in Appendices A and B. Once all interviews were complete, we transcribed the
audio recordings, producing 96 pages of text. Using an
Survey Instrument. The survey was created by ICIJ to inductive process we completed an initial round of qual-
investigate collaborating journalists’ use of the Black- itative coding to identify key themes, as substantive cat-
light, I-Hub, and Linkurious systems used during the egories emerged from the data via grounded theory anal-
Panama Papers project, as well as their experiences with ysis [19]. These themes were then evaluated and refined
the security of these systems. In this paper, we focus through group discussion among all researchers, with a
on the 10 survey questions related to the use and secu- goal of capturing the core variables constituting our par-
rity of the systems provided to journalists by ICIJ (see ticipants’ experiences.
Appendix A). In addition to these security-related ques-
tions, the survey also captured information about the 3.5 Ethical Considerations
value to journalists of other services provided by ICIJ.
Our entire protocol was IRB approved. Furthermore,
Interview Scripts. We created two distinct, but mostly because of the sensitive nature of our interview topic,
overlapping interview scripts for the editorial and tech- we took extra precautions to maintain the privacy and
nical interview participants. Topics for both groups anonymity of research participants. We explicitly did not
included questions about the participants’ background, request information about or publish details about secu-
their experience with the overall system, system func- rity protocols that could compromise source identities,
tionality, any training they offered as part of the project, sensitive information, or future work.
any breaches or failures they were aware of, and the po- All interview participants agreed to be audio recorded
tential scalability of the system. Additionally, we asked during the interview and answered all of the questions
editors about how they selected and recruited journalists in the interview script. We stored and transmitted audio
for project participation. Please see Appendix B for the recordings only in encrypted form and used de-identified
complete interview scripts. transcripts for the majority of the data analysis.
Service Never Daily Weekly Monthly Survey participants were asked to rate how much they
Blacklight 0% 64% 33% 3% collaborated outside of their own organization on a scale
I-Hub 3% 41% 48% 8% of 1 - 7 (where 1 indicated “I worked independently” and
Linkcurious 19% 4% 45% 31% 7 indicated “I’ve collaborated more than ever”). Nearly
Table 4: Frequency of Use of ICIJ Technologies (N=118).
one third (32%) of participants indicated they had collab-
Frequency of use during the three months preceding publica- orated with journalists outside their organization “more
tion; “monthly” includes responses “every now and then.” than ever” during the Panama Papers project, and the vast
majority of participants (74%) responded on the positive
side of the scale (5, 6, or 7), with a mean rating of 5.33.
4.1.4 Frequency of Use of ICIJ Technologies Only 13% indicating lower levels of inter-organizational
collaboration by responding on the negative side of the
In order to assess how well contributors’ reported useful-
scale (1, 2, or 3). This data is summarized in Table 5.
ness of the ICIJ systems matched their actual behaviors,
we analyzed survey data on how frequently journalists 4.1.6 Contributor Suggestions about Security
used Blacklight, I-Hub, and Linkurious (see Figures 1- The survey data we analyzed also included one open-
3). These results are summarized in Table 4. ended question: “Do you have any suggestions or com-
The majority of respondents (64%) indicated that ments about the digital security tools and requirements
they used Blacklight—the document-search platform for this project?” Fifty-seven contributors offered open-
where contributing journalists could access the leaked ended feedback. While the themes of these comments
documents—at least daily during the three months prior varied, the most frequent theme was a feature request
to the project launch date in April 2016. One third (33%) (14% total, of which more than half were requests for
used Blacklight at least weekly, and only 3% used it additional security features). The second most common
monthly. themes were compliments (13%), statements affirming
The vast majority of respondents (89%) used I-Hub— the need for security (5%), and requests for additional
the collaboration and communication platform with fo- training (4%). Notably, only 3% of comments described
rum and chat features—daily or weekly. Only 8% used the project’s security requirements as a barrier to work.
I-Hub only monthly, while just 3% reported never having For example, several participants (5) specifically men-
used it. tioned issues around phone-based authentication.
By contrast, a significant portion (19%) of respondents
The Google Authentificator [sic] tool... when
indicated they had never used Linkurious—the system
I changed my phone (twice during the investi-
that provided visual graphs of the relationships between
gation) I had to communicate with the support
entities mentioned in the leaked documents. About a
team to reboot the passwords. (J118)
third (31%) said they used it monthly and nearly a half
said they used it weekly. Only 4% used it daily. At certain times security turned into a bar-
rier into getting more done... Every time a
4.1.5 Collaboration Outside Home Organization cellphone died or went missing (frequently) I
A key objective for ICIJ in facilitating the Panama Papers needed to reconfigure authentication. (J68)
project was to encourage inter-organizational collabo- However, another participant noted that while security
ration among participating journalists, to maximize the was a barrier, it was worth the slow-down:
quality and impact of the resulting publications. The de-
It’s always a pain and even slowed us down.
gree of collaboration therefore offers insight into both the
But this work is important and anything to
utility and usability of these systems. Given the global
keep it secure is fine. (J78)
distribution of the journalist-contributors, collaborative
data management strategies like using local-only servers Finally, others explicitly called-out the need for secu-
or in-person meetings, were not feasible. These circum- rity and even praised ICIJ’s focus on it:
stances therefore also gave rise to specific technical secu- I like the fact that ICIJ considers security as a
rity requirements for inter-organizational collaboration. priority. Maybe ICIJ can explore other ways to
Jaehyuk Lee† Jinsoo Jang† Yeongjin Jang⋆ Nohyun Kwak† Yeseul Choi† Changho Choi†
Taesoo Kim⋆ Marcus Peinado∤ Brent Byunghoon Kang†
- 0x408000
r-x
0x607000
③ Return to non-executable area attack stack as in Figure 3. In essence, by exploiting the
- 0x608000
r-- (PF_Region_1)
…… AEX_handler in page fault handler memory corruption bug, we set the return address to be
0xF7500000
uint64_t PF_R[10] = {0xF7741000, 0xF7742000,
0xF7743000, 0xF7744000, ……}
the address that we want to probe whether it is a pop
- 0xF752b000
(Code)
r-x AEX_handler(unsigned long CR2, pt_regs *regs)
{
gadget or not. The probing will scan through the entire
// Indicate exception within enclave executable address space of the enclave memory. At the
…… if( regs → ax == 0x03) {
if (CR2 == 0) same time, we put several non-executable addresses, all
0xF7741000 gadget = CRASH;
- rw- else { of which reside in the address space of the enclave, on the
ENCLAVE
Figure 4: An overview of searching an ENCLU gadget and the Figure 5: An overview of finding memcpy() gadget. (1) The
behavior of EEXIT. (1) The attacker chains multiple pop gadgets attacker exploits a memory corruption bug inside the enclave
found in Figure 3, as many as possible, and put the value 0x4 and overwrites the stack with a gadget chain. (2) The gadgets in
as the number of pops in the gadget. (2) If the probing address the chain sets the arguments (rdi, rsi, rdx) as the destination
(the last return address) contains the ENCLU instruction, then it address (0x80000000) in rdi, the source address (0x75000000)
will invoke EEXIT and jump to the address specified in rbx (i.e., in rsi, and the size (0x8) in rdx to discover the memcpy() gadget.
0x4 because of the pop gadgets). (3) The execution of EEXIT (3) On the probing, if the final return address points to the
generates the page fault because the exit address in rbx (0x4) memcpy() gadget, then it will copy the 8 bytes of enclave code
does not belong to the valid address region. (4) At the page (0xf7500000) to the pre-allocated address in application memory
fault handler, the attacker can be notified that EEXIT is invoked (0x80000000), which was initialized with all zero. (4) To check
accordingly by examining the error code and the value of the if the memcpy() gadget is found, the attacker (application code)
rax register. The error code of EEXIT handler contains the value compares the contents of the memory region (0x80000000) with
that indicates the cause of page fault. In this case, the page fault zero after each probing. Any non-zero values in the compared
is generated by pointing an invalid address 0x4 as jump address area results the discovery of the memcpy().
(i.e., the value of rbx register). So if the error code contains the
flags for PF_PROT (un-allocated), PF_USER (userspace memory),
and PF_INSTR (fault on execution), and the value of rax is 0x4
(the value for EEXIT leaf function), then the attacker can assume address to scan at the end to probe whether the address is
the probed address is where the ENCLU instruction is located. a ENCLU gadget.
The mechanism behind the scene is like the following.
The value 0x4 is the index for the leaf function EEXIT.
the attacker can identify the values of the registers that What we aim to change the value for is the rax register
were changed by the pop gadget that is executed prior to because it is the selector of the EEXIT leaf function. For
EEXIT. This helps the attacker to identify the pop gadgets the combinations of pop gadget candidates and the address
among the candidates and the registers that are popped by of probing, the enclave will trigger EEXIT if the address of
the gadgets. a gadget that changes rax and the address of ENCLU sits on
To build this oracle, we need to find the ENCLU instruc- the stack. The attacker can catch this by using an SIGSEGV
tion first because the EEXIT leaf function can only be in- handler because the return address of EEXIT (stored in the
voked by the instruction by supplying the index at the rax rbx register) was not correct so that it will generate the
register as 0x4. Then, at the EEXIT handler, we identify exception. If the handler is invoked and the value of rax
the pop gadgets and the registers popped by the gadget. is 0x4, then the return address placed at the end of the
To find the ENCLU instruction, we take the following strat- attack stack points to the ENCLU instruction.
egy. First, for all of the pop gadget candidates, we set After we find the method to invoke EEXIT, we exploit
them as return addresses of a ROP chain. Second, we put the EEXIT gadget to identify which registers are popped by
0x4, the index of the EEXIT leaf function, as the value to the pop gadget. This is possible because, unlike AEX, the
be popped on that gadgets. For example, if the gadget has processor will not automatically clear the register values
three pops, we put the same number (three) 0x4 on the on running the EEXIT leaf function. Thus, if we put a pop
stack right after the gadget address. Finally, we put the gadget, and put some distinguishable values as its items to
Quote
the remote attestation of the enclave and establishing the REPORT and then signs it with Intel EPID securely. As
a secure communication channel between the remote a result, the REPORT generated by isv_enclave contains
server and the enclave. First, (1) the untrusted part of the the information for the ECDH public key that the enclave
application deployed by an Independent Software Ven- uses, and this information is signed by the QE.
dor (ISV, i.e., software distributor), called the untrusted Fourth, (4) the signed REPORT will be delivered to the
program isv_app, launches the enclave program (we remote server. The remote server can ensure that the
call this trusted program isv_enclave). On launching isv_enclave runs correctly at the client side and then use
isv_enclave, isv_app requests the generation of Elliptic- the ECDH public key received at step (1) if the signed
Curve Diffie-Hellman (ECDH) public/private key pair to REPORT is verified correctly.
the enclave. The ECDH key pair will be used for sharing Finally, the server run Compute_DH_Key to generate the
secret with the remote server. Then, the isv_enclave gen- shared secret. (5) the remote server and isv_enclave
erates the key pair, securely stores the private key in the can communicate securely because they securely shared
enclave memory and returns the public key to isv_app. the secret through the ECDH key exchange (with mutual
This public key will be sent to the remote server for later authentication).
use of sharing the secret for establishing a secure commu-
nication channel. Controlling the REPORT generation. To defeat the re-
mote attestation, and finally defeat the secure communica-
Second, on receiving the “hello” message from
tion channel between the remote server and isv_enclave,
isv_enclave, (2) the remote server generates its own
in the SGX malware, we aim to generate the REPORT from
ECDH key pair that the server will use.
isv_enclave with an arbitrary ECDH public key. For
Third, (3) the server sends a quote request to the this, we especially focus on part (3), how isv_enclave
isv_app, to verify if the public key that the server re- binds the generated ECDH public key with the REPORT on
ceived is from isv_enclave. Also, the server sends back calling the EREPORT leaf function.
the public key (of the remote server) to isv_enclave.
The Dark-ROP attack allows the SGX malware to have
To process the request, isv_app will invoke the func-
the power of invoking the EREPORT leaf function with any
tion named Compute_DH_Key in isv_enclave to generate
parameters. Thus, we can alter the parameter to generate
the shared secret and the measurement report (we re-
the REPORT that contains the ECDH public key that we
fer this as REPORT). It contains the ECDH public key
chose, instead of the key that is generated by isv_enclave.
that isv_enclave uses as one of the parameters to bind
On generating the REPORT, we prepare a REPORTDATA at
the public key with the REPORT. Inside the enclave,
the untrusted space using the chosen ECDH public key,
isv_enclave uses the EREPORT leaf function to generate
and then chain the ROP gadgets to copy the REPORTDATA
REPORT. On calling the leaf function, isv_enclave sets
to the enclave space. Note that the EREPORT requires
the REPORTDATA, an object that passed as an argument to
its parameters to be located in the enclave space. After
the EREPORT leaf function, to bind the generated ECDH
copying the REPORTDATA, we call the EREPORT leaf func-
public key to the REPORT. After isv_enclave generates
tion with copied data to generate the REPORT inside the
the REPORT, the untrusted isv_app delivers this to a Quot-
isv_enclave. After this, we copy the generated REPORT
ing Enclave(QE), a new enclave (trusted) for verifying
from the isv_enclave to isv_app and delivers the REPORT
Figure 7: The Man-in-the-middle (MitM) attack of the SGX malware for hijacking the remote attestation in SGX.
to the QE to sign it. but with the SGX malware so that the secure communica-
As a result, at the untrusted space, the attacker can tion channel is hijacked by the SGX malware. Note that
retrieve the REPORT that contains the ECDH parameter of the remote server cannot detect the hijacking because all
his/her own choice, and the REPORT is signed correctly. parameters and the signature are correct and verified.
Hijacking remote attestation. The full steps of hijack-
ing the remote attestation of an enclave are as follows (see
Figure 7). 6 Implementation
First, (1) instead of isv_enclave, the SGX malware
generates an ECDH public/private key pair and own the We implemented both the proof-of-concept attack and
private key. (2) the SGX malware sends the generated the SGX malware in the real SGX hardware. For the
public key to the remote server. hardware setup, we use the Intel Core i7-6700 Skylake
processor, which supports the first and only available
Then, (3) on receiving the quote request from the server,
specification of SGX, SGXv1. For the software, we run
the SGX malware calculates the shared secret correspond-
the attack on Ubuntu 14.04 LTS, running Linux kernel
ing to the parameters received by the remote server. Also,
4.4.0. Additionally, we use the standard Intel SGX SDK
the SGX malware prepares TARGETINFO and REPORTDATA
and compiler (gcc-5) to compile the code for the enclave
at isv_app. The TARGETINFO contains the information of
for both attacks.
the QE that enables the QE to cryptographically verify
and sign the generated REPORT. The REPORTDATA is gener- To launch the Dark-ROP attack on the real SGX hard-
ated with the chosen public key as a key parameter to run ware, we use the RemoteAttestation example binary in
EREPORT in the isv_enclave. After that, SGX malware the Intel SGX SDK, which is a minimal program that
launches the Dark-ROP attack (3-1, 3-2 and 3-3) to copy only runs the remote attestation protocol, with slight mod-
prepared parameters (TARGETINFO and REPORTDATA) from ification, to inject an entry point that has a buffer overflow
the untrusted app to the enclave and generate REPORT with vulnerability, as mentioned in Figure 2.
the ECDH public key that the SGX malware generated at Because the example is a very minimal one, we be-
the first step. Moreover (3-4), the generated report will be lieve that if the Dark-ROP attack is successful against the
copied out to the SGX malware from the isv_enclave, RemoteAttestation example, then any other enclave pro-
and the SGX malware sends the generated REPORT to the grams that utilizes the remote attestation are exploitable
Quoting Enclave to sign this with the correct key. Because by Dark-ROP if the program has memory corruption
the REPORT is generated by the enclave correctly, the QE bugs.
will sign this and return it to the attacker. Finding gadgets from standard SGX libraries. First,
Finally, (4) the SGX malware sends this signed REPORT we search for gadgets from the example binary. To show
to the remote server. Now, the remote server shares the the generality of finding gadgets, we find gadgets from the
secret; however, it is not shared with the isv_enclave, standard SGX libraries that are essential to run enclave
[4] BARNETT, R. Ghost gethostbyname () heap overflow in glibc [20] I NTEL C ORPORATION. Intel Software Guard Extensions Program-
(cve-2015-0235), january 2015. ming Reference (rev2), Oct. 2014. 329298-002US.
[5] BAUMAN , E., AND L IN , Z. A case for protecting computer games [21] I NTEL C ORPORATION. Intel SGX Enclave Writers Guide
with sgx. In Proceedings of the 1st Workshop on System Software (rev1.02), 2015. https://software.intel.com/sites/
for Trusted Execution (2016), ACM, p. 4. default/files/managed/ae/48/Software-Guard-
Extensions-Enclave-Writers-Guide.pdf.
[6] BAUMANN , A., P EINADO , M., AND H UNT, G. Shielding ap-
plications from an untrusted cloud with haven. In Proceedings [22] I NTEL C ORPORATION. Intel SGX SDK for Windows* User
of the 11th USENIX Symposium on Operating Systems Design Guide (rev1.1.1), 2016. https://software.intel.com/
and Implementation (OSDI) (Broomfield, Colorado, Oct. 2014), sites/default/files/managed/d5/e7/Intel-SGX-SDK-
pp. 267–283. Users-Guide-for-Windows-OS.pdf.
[7] B ITTAU , A., B ELAY, A., M ASHTIZADEH , A., M AZIÈRES , D., [23] J OHNSON , S., S AVAGAONKAR , U., S CARLATA , V., M C K EEN ,
AND B ONEH , D. Hacking blind. In 2014 IEEE Symposium on F., AND ROZAS , C. Technique for supporting multiple secure
Security and Privacy (2014), IEEE, pp. 227–242. enclaves, June 21 2012. US Patent App. 12/972,406.
[8] B LETSCH , T., J IANG , X., F REEH , V. W., AND L IANG , Z. Jump- [24] JP AUMASSON , L. M. Sgx secure enclaves in practice:security
oriented programming: a new class of code-reuse attack. In and crypto review, 2016. [Online; accessed 16-August-2016].
Proceedings of the 6th ACM Symposium on Information, Computer
and Communications Security (2011), ACM, pp. 30–40. [25] K IM , S., S HIN , Y., H A , J., K IM , T., AND H AN , D. A First Step
Towards Leveraging Commodity Trusted Execution Environments
[9] B UCHANAN , E., ROEMER , R., S HACHAM , H., AND S AVAGE , for Network Applications. In Proceedings of the 14th ACM Work-
S. When good instructions go bad: Generalizing return-oriented shop on Hot Topics in Networks (HotNets) (Philadelphia, PA, Nov.
programming to risc. In Proceedings of the 15th ACM conference 2015).
on Computer and communications security (2008), ACM, pp. 27–
38. [26] L EE , S., S HIH , M.-W., G ERA , P., K IM , T., K IM , H., AND
P EINADO , M. Inferring Fine-grained Control Flow Inside SGX
[10] C HECKOWAY, S., AND S HACHAM , H. Iago Attacks: Why the Sys-
Enclaves with Branch Shadowing (to appear). In Proceedings
tem Call API is a Bad Untrusted RPC Interface. In Proceedings of
of the 26th USENIX Security Symposium (Security) (Vancouver,
the 18th ACM International Conference on Architectural Support
Canada, Aug. 2017).
for Programming Languages and Operating Systems (ASPLOS)
(Houston, TX, Mar. 2013), pp. 253–264.
[27] O HRIMENKO , O., S CHUSTER , F., F OURNET, C., M EHTA , A.,
N OWOZIN , S., VASWANI , K., AND C OSTA , M. Oblivious multi-
[11] C HHABRA , S., S AVAGAONKAR , U., L ONG , M., B ORRAYO , E.,
party machine learning on trusted processors. In USENIX Security
T RIVEDI , A., AND O RNELAS , C. Memory encryption engine
Symposium (2016), pp. 619–636.
integration, June 23 2016. US Patent App. 14/581,928.
[12] D URUMERIC , Z., K ASTEN , J., A DRIAN , D., H ALDERMAN , [28] PAPPAS , V., P OLYCHRONAKIS , M., AND K EROMYTIS , A. D.
J. A., BAILEY, M., L I , F., W EAVER , N., A MANN , J., B EEKMAN , Smashing the gadgets: Hindering return-oriented programming
J., PAYER , M., ET AL . The matter of heartbleed. In Proceedings of using in-place code randomization. In 2012 IEEE Symposium on
the 2014 Conference on Internet Measurement Conference (2014), Security and Privacy (2012), IEEE, pp. 601–615.
ACM, pp. 475–488.
[29] RUTKOWSKA , J. Thoughts on Intel’s upcoming Software Guard
[13] G OOGLE. glibc getaddrinfo() stack-based buffer overflow (cve- Extensions (Part 2), Sept. 2013. http://theinvisiblethings.
2015-7547), february 2016. blogspot.com/2013/09/thoughts-on-intels-upcoming-
software.html.
[14] G REENE , J. Intel trusted execution technology. Intel Technology
White Paper (2012). [30] S CHUSTER , F., C OSTA , M., F OURNET, C., G KANTSIDIS , C.,
P EINADO , M., M AINAR -RUIZ , G., AND RUSSINOVICH , M.
[15] G UERON , S. A memory encryption engine suitable for general VC3: Trustworthy Data Analytics in the Cloud using SGX. In
purpose processors. Cryptology ePrint Archive, Report 2016/204, Proceedings of the 36th IEEE Symposium on Security and Privacy
2016. http://eprint.iacr.org/. (Oakland) (San Jose, CA, May 2015).
[34] S INHA , R., C OSTA , M., L AL , A., L OPES , N., S ESHIA , S.,
R AJAMANI , S., AND VASWANI , K. A design and verification
methodology for secure isolated regions. In Proceedings of the
36th ACM SIGPLAN Conference on Programming Language De-
sign and Implementation (2016), ACM.
[36] T SAI , C.-C., A RORA , K. S., BANDI , N., JAIN , B., JANNEN , W.,
J OHN , J., K ALODNER , H. A., K ULKARNI , V., O LIVEIRA , D.,
AND P ORTER , D. E. Cooperation and security isolation of library
oses for multi-process applications. In Proceedings of the Ninth
European Conference on Computer Systems (2014), ACM, p. 9.
[39] Z HANG , F., C ECCHETTI , E., C ROMAN , K., J UELS , A., AND S HI ,
E. Town crier: An authenticated data feed for smart contracts. In
Proceedings of the 2016 ACM SIGSAC Conference on Computer
and Communications Security (2016), ACM, pp. 270–282.
Table 2: Gadgets used for launching the Dark-ROP attack against the RemoteAttestation example code in the Intel SGX SDK. We
note that we found all gadgets from the standard library files, which are usually linked to the enclave program. First, the entire
objects in the libsgx_trts.a must be linked to the enclave binary because the library contains the code for controlling the enclave and
communication between the untrusted app and the enclave, which are essential to function the enclave. Finally, we found memcpy()
gadget from the standard c library for SGX (libsgx_tstdc.a).
sgx_register_exception_handler:
mov rax, rbx A gadget for manipulating the rax register. libsgx_trts.a
pop rbx Since the attacker can control the rbx register with the gadget above,
pop rbp the attacker can set rax to be an arbitrary value.
pop r12 This is for setting the index of the leaf function for
ret the ENCLU instruction.
relocate_enclave: libsgx_trts.a
pop rsi A gadget for manipulating rsi and rdi registers
pop r15 to set arguments for invoking memcpy
ret and the other library functions.
pop rdi
ret
Memcpy Gadget
memcpy: A gadget for copying enclave code and data to untrusted memory, libsgx_tstdc.a
and for copying in the reverse direction vice versa.
ENCLU Gadget
do_ereport:
enclu sgx_trts.lib
pop rax
ret
Zhichao Hua12 Jinyu Gu12 Yubin Xia12 Haibo Chen12 Binyu Zang1
Haibing Guan2
1 Institute
of Parallel and Distributed Systems, Shanghai Jiao Tong University
2 Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University
{huazhichao123,gujinyu,xiayubin,haibochen,byzang,hbguan}@sjtu.edu.cn
AXI Bus
of all VMs as well as the hypervisor, which makes the kernel mode: OS
TEE-kernel
Hypervisor
AXI-
RNG Peripherals
APB
One straightforward way to virtualize TrustZone smc smc
GIC
would be using a hypervisor in the normal world to sim- monitor mode: Secure Monitor Mode
Secure world
TrustZone
System Properties Properties Violation by Malicious Hypervisor → Consequence
Features
P-1.1: S/W must boot before N/W. Violate boot order. → Secure configuration bypass.
Secure Boot
P-1.2: Boot image of S/W must be checked. Violate integrity check of boot image. → Code injection in guest TEE.
P-1.3: S/W cannot be replaced. Replace a guest TEE with another one. → Providing malicious TEE.
CPU States P-2.1: smc must switch to the correct world. Switch to a wrong guest TEE. → Providing malicious TEE.
Protection
P-2.2: Protect the integrity of N/W CPU states
Tamper CPU states during switching. → Controlling execution of guest TEE.
during switching.
P-2.3: Protect S/W CPU states. Tamper guest TEE’s CPU states. → Controlling execution of guest TEE.
Memory P-3.1: Only S/W can access secure memory. Let arbitrary VM access guest secure memory. → Info leakage.
Isolation P-3.2: Only S/W can configure memory parti- Let arbitrary VM configure guest memory partition. → Reconfigure secure
tion. memory as normal.
P-4.1: Secure interrupts must be injected into Forbid interrupt being injected into guest TEE. → Disturbing the execution of
Peripheral S/W. guest TEE.
Assignment P-4.2: N/W cannot access secure peripherals. Let guest N/W access secure peripherals. → Info leakage of secure peripherals.
P-4.3: Secure peripherals are trusted for S/W. Provide malicious peripherals for guest TEE. → Info leakage of guest TEE.
P-4.4: Only S/W can partition interrupt/periph- Let arbitrary VM configure guest interrupt/peripherals. → Reconfigure secure
erals. peripheral as normal.
signs. We then show our threat model, assumptions and Like the hypervisor in the normal world, the TEE hyper-
the design of vTZ. visor is in charge of multiplexing the secure world and
offers a virtualized interface of TrustZone to the normal
world. It also needs to associate a guest VM with its cor-
3.1 Goals and Challenges
responding guest TEE. However, this design has several
The goal of vTZ is to embrace both strong security as issues.
well as high performance. We analyze a set of security The first issue is its large TCB. A TEE hypervisor
properties that physical TrustZone provides, as shown in needs to virtualize a full-featured execution environment
Table 1, which should be kept after TrustZone virtualiza- as the physical TrustZone provides, which requires non-
tion. To enforce these properties, vTZ needs to address trivial implementation complexity due to the lack of vir-
the following challenges: tualization support in the secure world. This leads to a
large code base for the TEE hypervisor. Further, since
• Challenge 1: A compromised hypervisor may vio- the TEE hypervisor needs to work together with the REE
late the booting sequence or booting a compromised hypervisor to bind one guest TEE for each guest, the
or even a malicious guest TEE-kernel. TCB of such a design includes not only the TEE hyper-
visor but also the REE hypervisor. Otherwise, a mali-
• Challenge 2: A compromised hypervisor may hi- cious REE hypervisor may let one guest VM in the nor-
jack a guest TEE’s execution by tampering with its mal world switch to another guest’s TEE.
CPU states. It may even switch to a malicious TEE
The second issue is the poor compatibility with ex-
when performing world switching.
isting TEE-kernels. The TEE-kernel cannot run in the
• Challenge 3: Once the hypervisor is compromised, kernel mode, since only the TEE hypervisor can reside
there is no confidentiality and integrity guarantee at the highest privilege level. While “trap and emulate”
for the secure memory owned by a guest TEE. may be a viable solution, there are some sensitive but
unprivileged instructions, which will silently fail instead
• Challenge 4: Guest TEEs trust peripherals that are of trapping into the hypervisor when being executed in
configured as secure. A malicious hypervisor may user mode. For example, modifying some privileged
provide a malicious virtual peripheral to a guest bits in CPSR register in user mode will just be ignored.
TEE. Para-virtualization is also possible, but it may lead to
compatibility and security problems since existing TEE-
kernels have to be modified and the TEE hypervisor has
3.2 Alternative Designs
to weaken isolation due to exposing more states and in-
Design-1: Dual-Hypervisor. This design uses a full- terfaces to guest TEE-kernels.
featured hypervisor to virtualize the secure world, which The third issue is the large overhead due to costly
can be called a TEE hypervisor, as shown in Figure 2(a). world switching, which involves two hypervisors to trap
Normal World Secure World Normal World Secure World Normal World Secure Normal World Secure
World World
REE TEE VMN VMS VMN VMS VMN VMS
mode
user
mode
Isolated
Control Flow Locking Secured Modules
Environment
Trusted Computing Base
Figure 2: Different designs of TrustZone virtualization. V Mn and V Ms mean the normal world and secure world of a VM, respectively. T-K
means TEE-kernel.
and emulates the world switching. An smc (secure mon- to the memory and SoC, like cold boot or bus snooping.
itor call) will first be trapped in the REE hypervisor and We also do not consider the vulnerabilities within a guest
then transferred to the TEE hypervisor, which in turn TEE.
transfer the call to the designated guest TEE. This results We assume that any guest VM or guest TEE can be
in a long call path and several privilege level crossings. malicious. Like prior work using same privilege level
Design-2: Full Simulation of TrustZone. This de- protection [30, 21, 31], we assume the hypervisor itself
sign fully simulates multiple guest TEEs by using the is not malicious, which is trust during system booting to
hypervisor running in the normal world to virtualize the do initialization correctly. After booting, the hypervisor
guest TEEs along with the normal world guest VMs and can be compromised.
to handle their interaction, as shown in Figure 2(b). All
switches between two virtualized worlds are simulated 3.4 Our Design: vTZ
by switching between two different VMs (called V Mn
and V Ms ). This also means that the physical TrustZone vTZ adopts the principle of separating functionality from
is not essential (not used). protection [68]. Specifically, vTZ relies on the nor-
Comparing with the first design, this design can mal world hypervisor to virtualize functionality of guest
achieve better performance, less complexity, and better TEEs but leverages the physical TrustZone to enforce
compatibility. The main problem is its large TCB of a protection, as shown in Figure 2(c).
commodity hypervisor, which includes not only the hy- To enforce the booting integrity, vTZ uses the secure
pervisor but also the management VM (for Xen) or the world to check the booting sequence as well as perform
host OS (for KVM). Either contains millions of lines of integrity checking (Section 5.1). To provide efficient
code (LoC). There have been 236 and 103 security vul- memory protection, vTZ uses Secured Memory Mapping
nerabilities uncovered in Xen [19] and KVM [18] respec- (SMM), a module in the secure world, to control all the
tively, not to mention those in Linux itself. stage-2 translation tables as well as hypervisor’s transla-
tion table (Section 4.1). Based on that, vTZ can set secu-
rity policies to, e.g., ensure that any of the guest TEE’s
3.3 Threat Model and Assumptions
secure memory will never be mapped to other VMs or
vTZ assumes an ARM-based platform that implements the hypervisor, and the hypervisor cannot map any new
the TrustZone extension together with the virtualization executable pages in hyp mode after booting up.
extension. All the hardware implementations are correct To protect the CPU states during context switching
and trustworthy. vTZ also assumes that the whole sys- and guest TEE’s execution, Secured World Switching
tem is loaded securely, including both the secure world’s (SWS) hooks and checks all the switches in the secure
and the normal world’s code, which is ensured by se- world, which alternatively saves and restores guest CPU
cure boot technology of TrustZone. Intuitively, secure states for each guest TEE (Section 4.3). The virtual pe-
boot only guarantees the integrity of the system during ripherals are also isolated through securely virtualizing
the boot-up process, but not the integrity thereafter. We resource partitioning of peripherals and interrupts (Sec-
do not consider side channel attacks or physical attacks tion 5.3).
hypervisor
code CIEEs and enforces different security policies on every mem-
ory mapping operation. It only provides one interface to
CFLock SMM SWS some specific CIEEs which contain virtual partitioning
controller emulator (introduced in Section 5.3) to config-
Secure Objs Mapped Secure Objs ure security policies.
Figure 3: System Architecture: Two secured modules (Secured
Exclusive Control of Memory Mapping: To ensure
World Switching (SWS) and Secured Memory Mapping (SMM)) are that only the SMM can load or change memory mapping,
located in secure world. Constrained Isolated Execution Environment vTZ enforces that the hypervisor does not contain any in-
(CIEE) is running in hyp mode in the normal world. Control-flow Lock struction to do so. There are three ways for a hypervisor
(CFLock) provides non-bypassable hook for CIEE and secured mod-
ules when an exception happens. to modify memory mapping: changing the base regis-
ter to a new translation table 2 , changing the entries of
The key to achieving the above interposition and pro- translation table, or disabling the address translation 3 . In
tection is efficiency and vTZ achieves these by extend- vTZ, the hypervisor is modified so that is no instruction
ing prior work with same privilege isolation [30, 21, 31] that loads new translation table or disables translation.
to virtualized environments. Specifically, vTZ provides Meanwhile, all the page tables are set as read-only to the
a set of Constrained Isolated Execution Environments hypervisor. The corresponding operations are replaced
(CIEEs) in hyp mode (Section 4.4). CIEE is isolated to invocations to SMM.
from each other as well as from the hypervisor. To pre- SMM maps the code pages of the hypervisor as read-
vent sophisticated attacks like ROP [53], vTZ ensures only so that the code will not be changed at runtime. It
the control flow and data flow integrity for code snip- also controls the swapping of hypervisor code and en-
pets within each CIEE. Control Flow Locking (CFLock) sures the integrity of code during swapping in process.
(Section 4.2) is used to provide non-bypassable hook for To prevent return-to-guest attack, SMM ensures that no
CIEE and secured modules (SMM and SWS) in the se- guest memory will be mapped as executable in hyper-
cure world. CIEE is not part of TCB since a compro- visor’s space. Moreover, SMM forbids to map any new
mised CIEE cannot affect the security of other CIEEs or page as executable in the hypervisor’s address space after
guest TEEs. system booting. vTZ also ensures that there is no ROP
The TCB of vTZ contains only the secured modules gadget that can be used to form new instructions to op-
running in the secure world, which has less than 2k LoC. erate the translation table (which is relatively trivial on
In contrast, the TCB for design-1 contains the huge REE ARM platform since instruction alignment is required).
hypervisor as well as a smaller TEE hypervisor contain- We also consider all the ISAs with different length.
ing tens of thousands of LoC (e.g., NOVA [57] con-
tains a 9k LoC microhypervisor and 29k LoC VMM, 4.2 CFLock: Control Flow Locking
OKL4 [33] contains 9.8k LoC). The TCB for design-2
includes a whole commodity hypervisor, which contains Locking the control flow means enforcing a control flow
several millions of LoC. transfer to specific code at some certain event, so that the
code will not be bypassed to handle the event. CFLock
can “lock” the control flow when any exception hap-
4 Protection Mechanisms pens, which is used to ensure the non-bypassability of
the SWS and CIEEs (described in following sections).
As mentioned in Section 3.4, vTZ leverages four mech- We force the control flow at the entry of different ex-
anisms: SMM, CFLock, SWS and CIEE to enforce the ception handlers. The ARM architecture uses a special
security properties for guest TEEs, as demonstrated in register to point to the base address of exception table,
Figure 3. This section will describe all these mechanisms and each table entry will correspond to one exception
in details. handler. We deprive the hypervisor of ability to mod-
ify this base register by replacing all the modification
4.1 SMM: Secured Memory Mapping instructions with invocations to secure world, similar as
Section 4.1. The exception table will also be marked as
The Secured Memory Mapping (SMM) module runs in read-only by SMM, to enforce that each exception will
the secure world for memory isolation. SMM controls eventually reach to one specific handler. After that, cer-
the VA-to-PA mapping in the hyp mode as well as IPA- tain control flow transfer instructions (e.g., an uncondi-
5.1 Virtualizing Secure Boot • P-2.2: Protect the integrity of N/W CPU states dur-
ing switching.
Secure boot is used to ensure the integrity of booting.
The booting process of a TrustZone-enabled device in- • P-2.3: Protect S/W CPU states.
cludes following steps: 1) Loading a bootloader from
ROM, which is tamper resistant. 2) The bootloader ini- SWS intercepts all the switching between a guest VM
tializes the secure world and loads a TEE-kernel to mem- and the hypervisor, A switching includes saving states
ory. 3) The TEE-kernel does a full initialization of the of the current VM, finding the next VM, and restoring
secure world, including the secure world translation ta- its states. The states saving and restoring are done by
ble, vector table and so on. 4) The TEE-kernel switches SWS in the secure world, while the finding of next VM
to the normal world and executes a kernel-loader. 5) The is done by the hypervisor, as shown in the bottom half of
kernel-loader loads a non-secure OS and runs it. Figure 5. Then SWS can check the restored target VM
∙ Schedule
∙ Check
2
∙ Restore VMN states
als.
• Interrupt partitioning, which is configured by We have implemented commonly used secure devices
General Interrupt Controller (GIC). for existing TEE-kernel, like TZASC, TZPC, GIC, uart,
RTIC and so on. We virtualize these devices for each
Once set as secure, the resource can only be accessed by guest by “trap and emulate”. Since all ARM devices use
the secure world. A secure interrupt must be injected to memory mapped I/O, access to a virtual secure device
the secure world and will lead to a world switching if it can be easily trapped by controlling stage-2 translation
happens in the normal world. All the three controllers table. When a virtual secure device is accessed, the CPU
can be used to repartition the resource only by the secure will get trapped. The trap will be handled by a corre-
world. sponding emulator, which runs in a CIEE. The device
It is needed for a guest to dynamically partition some states are stored as secure objects belonging to the CIEE.
critical virtual devices such as virtual Random Num- vTZ keeps different copies of devices state for different
ber Generator (vRNG), vGIC and so on. For example, guests, so that the emulator can virtualize the secure de-
a guest may only allow its guest TEE to configure the vice for different guests. If the device is marked as a
vGIC, or first initialize a vRNG in V Ms and then use it in secure peripheral by one guest, the emulator will enforce
Resume
(Secure memory
∙ Decrypt & check data
VMS & VMN states)
∙ Load VMS
∙ Register VMS & VMN
Figure 6: TEE suspension and resumption: During suspension operation, all V Ms ’s memory and CPU states are encrypted in secure world, and
a hash value is computed as well. During resumption operation, such states are decrypted in secure world and SWS will then verify the integrity.
6.2 Application Overhead 30% performance slowdown compared with native envi-
ronment.
Single Guest: We test four real applications (ccrypt,
mcrypt, GnuPG and GoHttp) and compare them with
original Xen on ARMv7 and ARMv8 platform. We use 6.3 Server Application Overhead
these applications to encrypt/transfer file about 1KB, and
We also evaluate two widely used server applications,
protect the encryption logic in real secure world/guest
MongoDB [5] and Apache [3] on Hikey board. Same as
TEE. The application/guest VM is pinned on one phys-
the multi-guest evaluation, we run applications on four
ical core in native environment/virtualization environ-
different environments. Meanwhile, the clients are ex-
ment, respectively. Figure 7(b) and Figure 7(c) show the
ecuted together with the server to bypass the network
overhead. Our system has little overhead compared with
overhead. One difference is that the guest has eight vir-
original Xen on both ARMv7 and ARMv8 platform.
tual cores instead of one.
Multi-Guest: We compare the concurrent perfor- Figure 9(a) shows the insert operation throughput of
mance of vTZ with native environment, real TrustZone MongoDB. The client continually inserts object to the
and original virtualization environment (Xen). The Go- server. We evaluate the throughput with different sizes
Http server is used to do the evaluation, and we protect of objects. The result shows that vTZ has little overhead
its encryption logic in secure world/guest TEE. In native compared with virtualization environment. For Apache
environment (including protection with TrustZone), each (shown in Figure 9(b)), we evaluate the downloading
GoHttp server runs as a normal process. In virtualization throughput by downloading a file (size is 100MB) from
environment (including protected by vTZ), each of them the server with https protocol. The result shows that us-
runs in one guest VM. The client, which sends the https ing virtual TrustZone caused less than 5% overhead in
request, runs in the same guest with the server to bypass virtualization environment.
the network overhead. Each client downloads a 20M file
from the server. We do not evaluate more applications
concurrently because the memory on the board limits the 7 Security Analysis
number of VMs. The results are shown in Figure 8.
On ARMv7 platform, which only has one core en- In this section, we assume a strong adversary who can di-
abled, the virtualization (Xen hypervisor) itself brings rectly boot a malicious guest (including a malicious guest
non-negligible (about 40%) performance slowdown. TEE) and even compromise the hypervisor.
While on ARMv8 platform, the overhead is remitted with
the benefit of 8 enabled cores. The overhead of virtu- 7.1 Breaking TrustZone Properties
alization becomes larger compared with the single case
because that the GoHttp server transfers a big file (20 Booting Protection: A compromised hypervisor can
MB) for each request. Such transferring time is larger tamper with the system image loaded in guest TEE. The
than Xen scheduler’s time slices (30ms), and then the Secured World Switching (SWS) module will check the
scheduler will influence the performance. Finally, vTZ integrity of the image before execution, and only allow
has about a 5% performance slowdown compared with the hypervisor to enter a registered guest TEE. Mean-
original Xen on ARMv8 implementation, and less than while, the attacker may want to boot a guest V Mn before
0.96 0.6
0.8
0.94 0.4
0.6
0.92 0.2
0.9 0 0.4
1k 16k 32k 64k 128k ccrpyt https mcrpyt GnuPG ccrpyt https mcrpyt GnuPG
server server
Normalized performance
Normalized performance
Normalized performance
5 5
0.8 0.8
4 4
0.6 0.6
3 3
0.4 0.4 2 2
0.2 0.2 1 1
0 0 0 0
N TZ X vTZ N TZ X vTZ N TZ X vTZ N TZ X vTZ N TZ X vTZ N TZ X vTZ N TZ X vTZ N TZ X vTZ N TZ X vTZ N TZ X vTZ N TZ X vTZ N TZ X vTZ N TZ X vTZ N TZ X vTZ
1 2 4 1 2 4 1 2 4 6 1 2 4 6
6
Throughput of Apache (MB/s)
Native Native
TZ TZ
Xen Xen
5
vTZ
17
vTZ Mapping (SMM) prevents these malicious behaviors by
16
4 controlling and checking all the mapping to the physical
15
3
address and ensuring that one guest’s secure memory can
14
2
only be mapped to its guest TEE.
13
vTZ allows a guest TEE to dynamically repartition its
1 12
128 256 512 1k 2k 4k 8k 4k 8k 16k 32k memory through a virtual memory configuration device,
Size of item value (Bytes) Size of TCP buffer (Bytes)
which invokes SMM to update the security policy. A
(a) (b)
Figure 9: Throughput of MongoDB and Apache (higher is better): compromised hypervisor may try to configure a guest’s
Figure (a) shows the throughput of MongoDB’s insert operation with secure memory to normal memory by sending a faked
different size of objects. Figure(b) shows the performance of Apache request to SMM or the configuration device. In vTZ,
server, with different TCP buffer sizes. SMM will only handle requests from the virtual configu-
ration device, which can be identified by its CIEE entry
its guest V Ms , and then tries to bypass the security con- address, and will deny the fake requests from the hyper-
figuration. SWS will defend against it by ensuring that visor. Meanwhile, the configuration device also verifies
guest TEE must be executed first. whether the request is from one guest TEE.
CPU States Protection: When a guest switches be- Peripheral Protection: vTZ enables guest TEE to
tween its two worlds, a compromised hypervisor may configure interrupts as secure or normal. A compromised
switch to another malicious world. SWS will check ev- hypervisor may try to inject it into a compromised VM.
ery world switch operation issued by guest and ensure CFLock ensures all the interrupts will first be handled in
that the target world must be executed thereafter. This a CIEE which identifies the type of interrupt and injects
will also forbid the hypervisor to ignore the world switch it to guest TEE. The interrupt will be later handled by the
operation or to make one guest’s two worlds being exe- hypervisor if it is non-secure. The hypervisor may also
cuted at the same time. During the world switch, the at- provide some malicious virtual devices to guest TEE.
tacker may also try to tamper with the CPU states. Since Guest TEE only trusts the virtual devices provided by
SWS is responsible for saving and restoring each VM’s vTZ, and all other devices must be treated as untrusted.
CPU states and synchronizing general registers between
guest’s two worlds, it will check and refuse the tamper-
ing.
7.2 Hacking vTZ
Memory Protection: A compromised hypervisor may Tampering with System Code: During system initial-
want to map one guest’s secure memory to a compro- ization, the attacker may try to modify the code of hyper-
[21] A ZAB , A. M., N ING , P., S HAH , J., C HEN , Q., B HUTKAR , [37] J, H WANG , S., S UH , S., H EO , C., PARK , J., RYU , S., PARK ,
R., G ANESH , G., M A , J., AND S HEN , W. Hypervision across C., AND K IM. Xen on arm: System virtualization using xen
worlds: Real-time kernel protection from the arm trustzone se- hypervisor for arm-based secure mobile phones. In IEEE CCNC
cure world. In Proceedings of the 2014 ACM SIGSAC Confer- (2008).
ence on Computer and Communications Security (2014), ACM, [38] JANG , J., KONG , S., K IM , M., K IM , D., AND K ANG , B. B.
pp. 90–102. Secret: Secure channel between rich execution environment and
[22] A ZAB , A. M., S WIDOWSKI , K., B HUTKAR , J. M., S HEN , W., trusted execution environment. In NDSS (2015).
WANG , R., AND N ING , P. Skee: A lightweight secure kernel- [39] K LEIN , G., A NDRONICK , J., E LPHINSTONE , K., M URRAY, T.,
level execution environment for arm. In Network & Distributed S EWELL , T., KOLANSKI , R., AND H EISER , G. Comprehensive
System Security Symposium (NDSS) (2016). formal verification of an os microkernel. ACM Transactions on
[23] BAUMANN , A., P EINADO , M., AND H UNT, G. Shielding appli- Computer Systems (TOCS) 32, 1 (2014), 2.
cations from an untrusted cloud with haven. ACM Transactions [40] K LEIN , G., E LPHINSTONE , K., H EISER , G., A NDRONICK , J.,
on Computer Systems (TOCS) 33, 3 (2015), 8. C OCK , D., D ERRIN , P., E LKADUWE , D., E NGELHARDT, K.,
[24] BEHRANG . A software level analysis of trust- KOLANSKI , R., N ORRISH , M., ET AL . sel4: Formal verification
zone os and trustlets in samsung galaxy phone. of an os kernel. In Proceedings of the ACM SIGOPS 22nd sym-
https://www.sensepost.com/blog/2013/a-software-level-analysis- posium on Operating systems principles (2009), ACM, pp. 207–
of-trustzone-os-and-trustlets-in-samsung-galaxy-phone/, 2013. 220.
[25] B OGGS , D., B ROWN , G., T UCK , N., AND V ENKATRAMAN , K. [41] K WON , Y., D UNN , A. M., L EE , M. Z., H OFMANN , O. S., X U ,
Denver: Nvidia’s first 64-bit arm processor. IEEE Micro 35, 2 Y., AND W ITCHEL , E. Sego: Pervasive trusted metadata for ef-
(2015), 46–55. ficiently verified untrusted system services. In Proceedings of
the Twenty-First International Conference on Architectural Sup-
[26] C HEN , X., G ARFINKEL , T., L EWIS , E., S UBRAHMANYAM , P., port for Programming Languages and Operating Systems (2016),
WALDSPURGER , C., B ONEH , D., DWOSKIN , J., AND P ORTS , ACM, pp. 277–290.
D. Overshadow: a virtualization-based approach to retrofitting
protection in commodity operating systems. In Proc. ASPLOS [42] L I , W., L I , H., C HEN , H., AND X IA , Y. Adattester: Secure on-
(2008), ACM, pp. 2–13. line mobile advertisement attestation using trustzone. In MobiSys
(2015).
[27] C HEN , Y., R EYMONDJOHNSON , S., S UN , Z., AND L U , L.
Shreds: Fine-grained execution units with private memory. In Se- [43] L I , W., M A , M., H AN , J., X IA , Y., Z ANG , B., C HU , C.-K.,
curity and Privacy (SP), 2016 IEEE Symposium on (2016), IEEE, AND L I , T. Building trusted path on untrusted device drivers for
pp. 56–71. mobile devices. In Proceedings of 5th Asia-Pacific Workshop on
Systems (2014), ACM, p. 8.
[28] C HRISTOFFER DALL , J. N. Kvm/arm:the design and implemen-
tation of linux arm hypervisor. In ASPLOS (2014). [44] L I , Y., M C C UNE , J., N EWSOME , J., P ERRIG , A., BAKER , B.,
AND D REWRY, W. Minibox: A two-way sandbox for x86 native
[29] C RISWELL , J., DAUTENHAHN , N., AND A DVE , V. Virtual
code. In 2014 USENIX Annual Technical Conference (USENIX
ghost: Protecting applications from hostile operating systems.
ATC 14) (2014), pp. 409–420.
ACM SIGARCH Computer Architecture News 42, 1 (2014), 81–
96. [45] L IU , D., AND C OX , L. P. Veriui: Attested login for mobile de-
vices. In Proceedings of the 15th Workshop on Mobile Computing
[30] C RISWELL , J., G EOFFRAY, N., AND A DVE , V. S. Memory
Systems and Applications (2014), ACM, p. 7.
safety for low-level software/hardware interactions. In USENIX
Security Symposium (2009), pp. 83–100. [46] M C C UNE , J. M., L I , Y., Q U , N., Z HOU , Z., DATTA , A.,
G LIGOR , V., AND P ERRIG , A. Trustvisor: Efficient tcb reduc-
[31] DAUTENHAHN , N., K ASAMPALIS , T., D IETZ , W., C RISWELL ,
tion and attestation. In Security and Privacy (SP), 2010 IEEE
J., AND A DVE , V. Nested kernel: An operating system architec-
Symposium on (2010), IEEE, pp. 143–158.
ture for intra-kernel privilege separation. In Proceedings of the
Twentieth International Conference on Architectural Support for [47] M ORGAN , T. P. Arm servers: Cavium is a contender with thun-
Programming Languages and Operating Systems (2015), ACM, derx. https://www.nextplatform.com/2015/12/09/arm-servers-
pp. 191–206. cavium-is-a-contender-with-thunderx/, 2015.
[32] G E , X., V IJAYAKUMAR , H., AND JAEGER , T. Sprobes: En- [48] P EREZ , R., S AILER , R., VAN D OORN , L., ET AL . vtpm: vir-
forcing kernel code integrity on the trustzone architecture. arXiv tualizing the trusted platform module. In Proc. 15th Conf. on
preprint arXiv:1410.7747 (2014). USENIX Security Symposium (2006), pp. 305–320.
Sangho Lee† Ming-Wei Shih† Prasun Gera† Taesoo Kim† Hyesoon Kim† Marcus Peinado∗
...
provided by mbed TLS for testing. By default, mbed Figure 6: Controlled-channel attack against sliding-window
TLS’s RSA implementation uses the Chinese Remainder exponentiation (window size: 5). It only knows the first bit of
Theorem (CRT) technique to speed up computation. Thus, each window (always one) and skipped bits (always zero).
we observed two executions of mbedtls_mpi_exp_mod
with two different 512-bit CRT exponents in each iter-
ation. The sliding-window size was five. (marked with +) according to the value of ei and both
On average, the branch shadowing attack recovered functions were located on different code pages. Thus, by
approximately 66% of the bits of each of the two CRT carefully unmapping these pages, an attacker can monitor
exponents from a single run of the victim (averaged over when mpi_montmul is called. However, as Figure 6 shows,
1,000 executions). The remaining bits (34%) correspond because of the sliding-window technique, the controlled-
to loop iterations in which the two shadowed branches channel attack cannot identify every bit unless it knows
returned different results (i.e., predicted versus mispre- W[wbits]—i.e., this attack can only know the first bit of
dicted). We discarded those measurements, as they were each window (always one) and skipped bits (always zero).
impacted by platform noise, and marked the correspond- The number of recognizable bits completely depends on
ing bits as unknown. The remaining 66% of the bits were how the bits of an exponent are distributed. Against the
inferred correctly with an accuracy of 99.8%, where the default RSA-1024 private key of mbed TLS, this attack
standard deviation was 0.003. identified 334 bits (32.6%). Thus, we conclude that the
The events that cause the attack to miss about 34% of branch shadowing attack is better than the controlled-
the key bits appear to occur at random times. Different channel attack for obtaining fine-grained information.
runs reveal different subsets of the key bits. After at 4.2 Case Study
most 10 runs of the victim, the attack recovers virtually We also studied other sensitive applications that branch
the entire key. This number of runs is small compared shadowing can attack. Specifically, we focused on ex-
to existing cache-timing attacks, which demand several amples in which the controlled-channel attack cannot
hundreds to several tens of thousands of runs to reliably extract any information, e.g., control flows within a sin-
recover keys [20, 35, 65]. gle page. We attacked three more applications: 1) two
Timing-based branch shadowing. Instead of using libc functions (strtol and vfprintf) in the Linux SGX
the LBR, we measured how long it takes to execute the SDK, 2) LibSVM, ported to Intel SGX, and 3) some
shadow branches using RDTSCP while maintaining other Apache modules ported to Intel SGX. We achieved in-
techniques, including the modified local APIC timer and teresting results, such as how long an input number
victim isolation. When the two target branches were taken, is (strtol), what the input format string looks like
the shadow branches took 55.51 cycles on average, where (vfprintf), and what kind of HTTP request an Apache
the standard deviation was 48.21 cycles (1,000 iterations). server gets (lookup_builtin_method), as summarized in
When the two target branches were not taken, the shadow Table 3. Note that the controlled-channel attack cannot
branches took 93.89 cycles on average, where the standard obtain the same information because those functions do
deviation was 188.49 cycles. Because of high variance, not call outside functions at least in the target basic blocks.
finding a good decision boundary was challenging, so we Detailed analysis with source code is in Appendix C.
built a support vector machine classifier using LIBSVM
(with an RBF kernel and default parameters). Its accuracy
5 Countermeasures
was 0.947 (10-fold cross validation)—i.e., we need to run We introduce our hardware-based and software-based
this attack at least two times more than the LBR-based countermeasures against the branch shadowing attack.
attack to achieve the same level of accuracy. 5.1 Flushing Branch State
Controlled-channel attack. We also evaluated the con- A fundamental countermeasure against the branch shad-
trolled channel attack against Figure 5. We found that owing attack is to flush all branch states generated in-
mbedtls_mpi_exp_mod conditionally called mpi_montmul side an enclave by modifying hardware or updating mi-
1.2
Normalized IPC
1
0.8
0.6
0.4
0.2
0
c
gr mp
us cs
c ay
p
2
hm k
qu ng
h2 m
po ex
G nx3
la ar
o
go cf
er
ne f
bw k
ga es
ze lc
le M
d
d
so I
sF ix
TD
m
sp wrf
N
es
gc
om re
I
tp
ip
bm
nt
bm
e3
m
a
al
tu
xa ast
EA
vr
m
em ul
lb
av
pl
D
lib sje
m
m
us
ca m
64
bz
to
na
D
de
hi
sli
an
G alc
A
nc
M
ct
no flushes flush per 100 cycles flush per 1k cycles flush per 10k cycles
flush per 100k cycles flush per 1M cycles flush per 10M cycles
Figure 7: Instructions per cycle of SPEC benchmark in terms of frequency of BTB and BPU flushing.
80
60 CPU 4 GHz out of order core, 4 issue width, 256 entry ROB
40 L1 cache 8 way 32 KB I-cache + 8 way 32 KB D-cache
20 L2 cache 8 way 128 KB
0 L3 cache 32 way 8 MB
n/a 100 1k 10k 100k 1M 10M BTB 4 way 1,024 sets
Flushing Frequency (cycles) BPU gshare, branch history length 16
100
80
60 We estimate the performance overhead of our counter-
40 measure at different enclave context switching frequen-
20 cies using a cycle-level out-of-order microarchitecture
0
n/a 100 1k 10k 100k 1M 10M simulator, MacSim [30]. To simulate branch history flush-
Flushing Frequency (cycles) ing for every enclave context switch, we modified Mac-
Sim to flush BTB and BPU for every 100 to 10 million
correct mispredict misfetch
cycles; this resembles enclave context switching for every
Figure 9: Average BTB statistics according to frequency of
100 to 10 million cycles. The details of our simulation
BTB and BPU flushing.
parameters are listed in Table 4. The BTB is modeled
after the BTB in Intel Skylake processors. We used a
method similar to that in [1, 58] to reverse engineer the
crocode. Whenever an enclave context switch (via the BTB parameters. From our experiments, we found that
EENTER, EEXIT, or ERESUME instructions or AEX) occurs, the BTB is organized as a 4-way set associative structure
the processor needs to flush the BTB and BPU states. with a total of 4,096 entries. We model a simple branch
Since the BTB and BPU benefit from local and global predictor, gshare [37], for the simulation. We use traces
ates less frequent system calls, which would be desired to (b) The protected code snippet by Zigzagger. All branch instruc-
avoid the Iago attack [9], flushing branch states for every tions are executed regardless of a and b variables. An indirect
enclave context switch will introduce negligible overhead. branch in the trampoline and CMOVs in the translated code are
used to obfuscate the final target address. r15 is reserved to
5.2 Obfuscating Branch store the target address.
Branch state flushing can effectively prevent the branch Figure 10: An example of Zigzagger transformation.
shadowing attack, but we cannot be sure when and
whether such hardware changes will be realized. Espe-
cially, if such changes cannot be done with micro code converts all conditional and unconditional branches into
updates, we cannot protect the Intel CPUs already de- unconditional branches targeting Zigzagger’s trampoline,
ployed in the markets. which jumps back-and-forth with the converted branches.
Possible software-based countermeasures against the The trampoline finally jumps into the real target address
branch shadowing attack are to remove branches [39] or to stored in a reserved register r15. Note that reserving a
use the state-of-the-art ORAM technique, Raccoon [44]. register is only for improving performance. We can use
Data-oblivious machine learning algorithms et al. [39] the memory to store the target address when an applica-
eliminate all branches by using a conditional move in- tion needs to use a large number of registers. To emulate
struction, CMOV. However, their approach is algorithm- conditional execution, the CMOV instructions in Figure 10b
specific, i.e., it is not applicable to general applications. update the target address in r15 only when a or b is zero.
Raccoon [44] always executes both paths of a conditional Otherwise, they are treated as NOP instructions. Since all
branch, such that no branch history will be leaked. But, of the unconditional branches are executed almost simul-
its performance overhead is high (21.8×). taneously in sequence, recognizing the current instruction
Zigzagger. We propose a practical, compiler-based mit- pointer is difficult. Further, since the trampoline now
igation against branch shadowing, called Zigzagger. It has five different target addresses, inferring real targets
obfuscates a set of branch instructions into a single indi- among them is not straightforward.
rect branch, as inferring the state of an indirect branch is Zigzagger’s approach has several benefits: 1) security:
more difficult than inferring those of conditional and un- it provides the first line of protection on each branch block
conditional branches (§3.5). However, it is not straightfor- in an enclave program; 2) performance: its overhead is at
ward to compute the target block of each branch without most 2.19× (Table 5); 3) practicality: its transformation
relying on conditional jumps because conditional expres- demands neither complex analysis of code semantics nor
sions could become complex because of nested branches. heavy code changes. However, it does not ensure perfect
In Zigzagger, we solved this problem by using a CMOV security such that we still need ORAM-like techniques to
instruction [39, 44] and introducing a sequence of non- protect very sensitive functions.
conditional jump instructions in lieu of each branch. Implementation. We implemented Zigzagger in LLVM
Figure 10 shows how Zigzagger transforms an exam- 4.0 as an LLVM pass that converts branches in each func-
ple code snippet having if, else-if, and else blocks. It tion and constructs the required trampoline. We also mod-
[51] S HINDE , S., T IEN , D. L., T OPLE , S., AND S AXENA , P. Panoply: distributed analytics platform. In Proceedings of the 14th USENIX
Low-TCB Linux applications with SGX enclaves. In Proceedings Symposium on Networked Systems Design and Implementation
of the 2017 Annual Network and Distributed System Security (NSDI) (Boston, MA, Mar. 2017).
Symposium (NDSS) (San Diego, CA, Feb.–Mar. 2017).
[52] S INHA , R., C OSTA , M., L AL , A., L OPES , N. P., R AJAMANI , A Manipulating Local APIC Timer
S., S ESHIA , S. A., AND VASWANI , K. A design and verification
methodology for secure isolated regions. In Proceedings of the The local APIC is a component of Intel CPUs to config-
2016 ACM SIGPLAN Conference on Programming Language ure and handle CPU-specific interrupts [26, §10]. An OS
Design and Implementation (PLDI) (Santa Barbara, CA, June
2016).
can program it through memory-mapped registers (e.g.,
device configuration register) or model-specific registers
[53] S INHA , R., R AJAMANI , S., S ESHIA , S., AND VASWANI , K.
Moat: Verifying confidentiality of enclave program. In Proceed- (MSRs) to adjust the frequency of the local APIC timer,
ings of the 22nd ACM Conference on Computer and Communica- which generates high-resolution timer interrupts, and de-
tions Security (CCS) (Denver, Colorado, Oct. 2015). liver an interrupt to a CPU core (e.g., inter-processor
[54] S ONG , C., L EE , B., L U , K., H ARRIS , W. R., K IM , T., AND L EE , interrupt (IPI) and I/O interrupt from the I/O APIC).
W. Enforcing kernel security invariants with data flow integrity. In
Proceedings of the 23rd Annual Network and Distributed System Intel CPUs support three local APIC timer modes: pe-
Security Symposium (NDSS) (San Diego, CA, Feb. 2016). riodic, one-shot, and timestamp counter (TSC)-deadline
[55] S TRACKX , R., AND P IESSENS , F. Ariadne: A minimal approach modes. The periodic mode lets an OS configure the initial-
to state continuity. In Proceedings of the 25th USENIX Security count register whose value is copied into the current-count
Symposium (Security) (Austin, TX, Aug. 2016).
register the local APIC timer uses. The current-count reg-
[56] T RUSTED C OMPUTING G ROUP. Trusted platform module ister’s value decreases at the rate of the bus frequency,
(TPM) summary. http://www.trustedcomputinggroup.
org/trusted-platform-module-tpm-summary/. and when it becomes zero, a timer interrupt is generated
[57] T SAI , C.-C., P ORTER , D. E., AND V IJ , M. Graphene-SGX:
and the register is re-initialized by using the initial-count
A practical library OS for unmodified applications on SGX. In register. The one-shot mode lets an OS configure the
Proceedings of the 2017 USENIX Annual Technical Conference initial-count counter value whenever a timer interrupt is
(ATC) (Santa Clara, CA, July 2017). generated. The TSC-deadline mode is the most advanced
[58] U ZELAC , V., AND M ILENKOVIC , A. Experiment flows and mi- and precise timer mode allowing an OS to specify when
crobenchmarks for reverse engineering of branch predictor struc-
tures. In Performance Analysis of Systems and Software, 2009.
the next timer interrupt will occur in terms of a TSC
ISPASS 2009. IEEE International Symposium on (2009), IEEE, value. Our target Linux system (kernel version 4.4) uses
pp. 207–217. the TSC-deadline mode, so we focus on this mode.
[59] W EICHBRODT, N., K URMUS , A., P IETZUCH , P., AND K APITZA , Figure 11 shows how we modified the
R. AsyncShock: Exploiting synchronisation bugs in Intel SGX
lapic_next_deadline() function specifying the next
enclaves. In Proceedings of the 21th European Symposium on
Research in Computer Security (ESORICS) (Heraklion, Greece, TSC deadline and the local_apic_timer_interrupt()
Sept. 2016). function called whenever a timer interrupt
[60] X U , Y., C UI , W., AND P EINADO , M. Controlled-channel attacks: is fired. We made and exported two global
Deterministic side channels for untrusted operating systems. In variables and function pointers to manipulate
Proceedings of the 36th IEEE Symposium on Security and Privacy
(Oakland) (San Jose, CA, May 2015).
the behaviors of lapic_next_deadline() and
local_apic_timer_interrupt() with a kernel module:
[61] YANG , K., H ICKS , M., D ONG , Q., AUSTIN , T., AND
S YLVESTER , D. A2: Analog malicious hardware. In Proceedings lapic_next_deadline_delta to change the delta;
of the 37th IEEE Symposium on Security and Privacy (Oakland) lapic_target_cpu to specify a virtual CPU running
(San Jose, CA, May 2016). a victim enclave process (via a CPU affinity); and
[62] Z HANG , F. mbedtls-SGX: A SGX-friendly TLS stack (ported timer_interrupt_hook to specify a function to be called
from mbedtls). https://github.com/bl4ck5un/mbedtls- whenever a timer interrupt is generated. In our evaluation
SGX.
environment having an Intel Core i7 6700K CPU (4GHz),
[63] Z HANG , F., C ECCHETTI , E., C ROMAN , K., J UELS , A., AND S HI ,
E. Town Crier: An authenticated data feed for smart contracts.
we were able to have 1,000 as the minimum delta value;
In Proceedings of the 23rd ACM Conference on Computer and i.e., it fires a timer interrupt about every 1,000 cycles.
Communications Security (CCS) (Vienna, Austria, Oct. 2016). Note that, in our environment, a delta value lower than
[64] Z HANG , Y., J UELS , A., O PREA , A., AND R EITER , M. K. Home- 1,000 made the entire system freeze because a timer
alone: Co-residency detection in the cloud via side-channel analy- interrupt was generated before an old timer interrupt was
sis. In Proceedings of the 32nd IEEE Symposium on Security and
handled by the interrupt handler.
Privacy (Oakland) (Oakland, CA, May 2011).
[65] Z HANG , Y., J UELS , A., R EITER , M. K., AND R ISTENPART, B Modifying SGX Driver
T. Cross-VM side channels and their use to extract private keys.
In Proceedings of the 19th ACM Conference on Computer and Figure 12 shows how we modified the Intel SGX driver
Communications Security (CCS) (Raleigh, NC, Oct. 2012). for Linux to manipulate the base address of an enclave.
CA Server (S )
+
ID(C), PhNum(C), ID(SCA ), KC
cols.
Client (C)
(1)
These protocols make use of certificates issued to NN et , ID(C), PhNum(C),
each client that indicate that a particular client controls (2)
ID(SCA ), PhNum(SCA ), T S
a specific phone number. In prior work we proposed a
CA
NAudio
full public key infrastructure for telephony [56] called a (3)
“TPKI” that would have as its root the North American NAudio , NN et , ID(C),
PhNum(C)ID(SCA ), TS, Signk
Numbering Plan Administration with licensed carriers (4)
C
Cert(ID(C), PhNum(C),
acting as certificate authorities. This PKI would issue +
KC , SignK )
an authoritative certificate that a phone number is owned (5)
SCA
by a particular entity, and AuthentiCall could enforce Data Channel Audio Channel
that calls take place between the entities specified in
those certificates. While AuthentiCall can leverage the
proposed TPKI, a fully-deployed TPKI is not necessary
Figure 3: Our enrollment protocol confirms phone
as AuthentiCall can act as its own certificate authority
number ownership and issues a certificate.
(this is discussed further in the enrollment protocol).
All of these protocols make use of a client-server
architecture, where an AuthentiCall server acts as 4.1 Enrollment Protocol
either an endpoint or intermediary between user clients.
There are several reasons for this design choice. First, The enrollment protocol ensures that a client controls a
having a centralized relay simplifies the development of claimed number and establishes a certificate that binds
AuthentiCall. Although there are risks of adding a cen- the identity of the client to a phone number. For our
tralized point on a distributed infrastructure, our design purposes, “identity” may be a user’s name, organization,
minimizes them by distributing identity verification to a or any other pertinent information. Binding the identity
certificate authority and only trusting a central server to to a phone number is essential because phone numbers
act as a meeting point for two callers. Second, it allows are used as the principal basis of identity and routing
the server to prevent abuses of AuthentiCall like robodi- in phone networks, and they are also used as such with
aling [71] by a single party by implementing rate limit- AuthentiCall. The enrollment protocol is similar to other
ing. The server can authenticate callers before allowing certificate issuing protocols but with the addition of a
the messages to be transmitted, providing a mechanism confirmation of control of the phone number.
for banning misbehaving users. Finally, all protocols Figure 3 shows the details of the enrollment protocol.
(including handshake and enrollment) implement end-to- The enrollment protocol has two participants: a client C
end cryptography. Assuming the integrity of the Authen- and an AuthentiCall enrollment server SCA . In message
tiCall certificate authority infrastructure and the integrity 1, C sends an enrollment request with SCA ’s identity, C’s
of the client, no other entity of the AuthentiCall network identity info, C’s phone number, and C’s public key. In
can read or fabricate protocol messages. We also assume message 2, the server sends a nonce NNet , the identities
that all communications between clients and servers use of C and SCA and the phone numbers of C and SCA with
a secure TLS configuration with server authentication. a timestamp to ensure freshness, liveness, and to provide
Our protocols have another goal: no human interac- a “token” for this particular authentication session.
tion except for choosing to accept a call. There are two In message 3, the server begins to confirm that C
primary reasons for this. First, it is well established that controls the phone number it claims. The number is
ordinary users (and even experts) have difficulty exe- confirmed when SCA places a call to C’s claimed phone
cuting secure protocols correctly [76]. Second, in other number. When the call is answered, SCA transmits a
protocols that rely on human interaction, the human nonce over the voice channel. Having SCA call C is
element has been shown to be the most vulnerable [63]. a critical detail because intercepting calls is far more
The following subsections detail the three protocols difficult than spoofing a source number.2 Using a voice
in AuthentiCall. First, the enrollment protocol ensures call is important because it will work for any phone –
that a given AuthentiCall user actually controls the including VoIP devices that may not have SMS access.
phone number they claim to own (G1). The enrollment In message 4, C sends both NNet and NAudio along with
protocol also issues a certificate to the user. Second, the IDs of server, clients, a timestamp, and a signature
the handshake protocol mutually authenticates two covering all other fields. This final message concludes
calling parties at call time (G2 and G3). Finally, the the proof of three things: possession of NNet , the ability
call integrity protocol ensures the security of the voice 2 We will revisit the threat of call interception later in this
channel and the content it carries (G4 and G5). subsection.
message is signed with the public key in the certificate Message via Server TLS Voice Call
issued by AuthentiCall.
After message 4, both sides combine their Diffie-
Hellman secret with the share they received to generate Figure 5: Our call integrity protocol protects all speech
the derived secret. Each client then generates keys content.
using the Diffie-Hellman result, the timestamps of both
parties, and the nonces of both parties. These keys are
used to continue the handshake and to provide keys for designed to prevent attacks where a call is redirected
the integrity protocol. to another phone. One possible attack is an adversary
Message 5A and 5B contain an HMAC of messages maliciously configuring call forwarding on a target; the
4A and 4B along with a string to differentiate message handshake would be conducted with the target, but the
5A from message 5B. The purpose of this message is to voice call would be delivered to the adversary. In such
provide key confirmation that both sides of the exchange a case, the target would not send a “call established”
have access to the keys generated after messages 4A and message and the attack would fail.
4B. This message concludes the handshake protocol. Once the voice call begins, each side will encrypt and
send the other audio digests at a regular interval. It is
important to note that we use unique keys generated
4.3 Call Integrity Protocol during the handshake for encryption, message authen-
The call integrity protocol binds the handshake con- tication codes, and digest calculation. The messages
ducted over the data network to the voice channel estab- also guarantee freshness because the index is effectively
lished over the telephone network. Part of this protocol a timestamp, and the message authentication codes are
confirms that the voice call has been established and con- computed under a key unique to this call. Timestamps
firms when the call ends. The remainder of the messages in messages 1-N are indexed against the beginning of
in this protocol exchange content authentication informa- the call, negating the need for a synchronized clock. In
tion for the duration of the call. This content integrity order to prevent redirection attacks, the messages are
takes the form of short “digests” of call audio (we discuss bound to the identities of the communicating parties by
these digests in detail in the following section). These di- including the IDs in the HMACs and by using keys for
gests are effectively heavily compressed representations the HMACs that are unique to the call.
of the call content; they allow for detection of tampered When the voice call ends, each side sends a “call con-
audio at a low bit rate. Additionally, the digests are ex- cluded” message containing the client IDs, a timestamp,
changed by both parties and authenticated with HMACs. and their HMAC. This alerts the end point to expect
Figure 5 shows the details of the call integrity no more digests. It also prevents a man-in-the-middle
protocol. The protocol begins after the voice call is from continuing a call that the victim has started and
established. Both caller R and callee E send a message authenticated.
indicating that the voice call is complete. This message
includes a timestamp, IDs of the communicating parties
4.4 Evaluation
and the HMAC of all of these values. The timestamp
is generated using the phone clock which is often Our protocols use standard constructions for certificate
synchronized with the carrier.4 These messages are establishment, certificate-based authentication, authen-
4 In this setting, loose clock synchronization (approximately one ticated key establishment, and message authentication.
minute) is sufficient; if necessary, S can also provide a time update at We therefore believe our protocols are secure based
login. on inspection. Nevertheless, we used ProVerif [20] to
5.2.1 Robustness
Robustness is one of the most critical aspects of our
speech digests, and it is important to show that these
digests will not significantly change after audio under-
goes any of the normal processes that occur during a
phone call. These include the effects of various audio
encodings, synchronization errors in audio, and noise.
To test robustness, we generate modified audio from
the TIMIT corpus and compare the BER of digests of Figure 7: This graph shows the histogram and kernel
standard TIMIT audio to digests of degraded audio. We density estimate of digest of adversarial audio on over
first downsample the TIMIT audio to a sample rate of 250 million pairs of 1-second speech samples. While
8kHz, which is standard for most telephone systems. the majority of legitimately modified audio has digest
We used the sox [5] audio utility for downsampling and errors less than 35%, adversarial audio has digest BERs
adding delay to audio to model synchronization error. averaging 47.8%.
We also used sox to convert the audio to two common
phone codecs, AMR-NB (Adaptive Multi-Rate Narrow
Band) and GSM-FR (Groupe Spécial Mobile Full-Rate).
We used GNU Parallel [67] to quickly compute these tested, 10ms delay has the least effect; this is a result of
audio files. To model frame loss behavior, we use a the fact that the digest windows the audio with a high
Matlab simulation that implements a Gilbert-Elliot loss overlap. For most digests, addition of white noise also
model [32]. Gilbert-Elliot models bursty losses using a has little effect; this is because LSF analysis discards
two-state Markov model parameterized by probabilities all frequency information except for the most important
of individual and continued losses. We use the standard frequencies. We see higher error rates caused by the
practice of setting the probability of an individual loss use of audio codecs like GSM-FR and AMR-NB;
(p) and probability of continuing the burst (1 − r) to the these codecs significantly alter the frequency content
desired loss rate of 5% for our experiments. We also use of the audio. We can also see that a 5% loss rate has
Matlab’s agwn function to add Gaussian white noise at negligible effect on the audio digests. Finally, we see
a 30 decibel signal to noise ratio. that combining transcoding, loss, delay, and noise has an
Figure 6 shows boxplots representing the distribution additive effect on the resulting digest error — in other
of BER rates of each type of degradation tested. All words, the more degradation that takes place, the higher
degradations show a fairly tight BER distribution near the bit error. These experiments show that RSH is robust
the median with a long tail. We see that of the effects to common audio modifications.
7.4 Speech Digest Performance Figure 11: This figure shows that 93.4% of individual
digests of adversarial audio are correctly detected while
Our final experiments evaluate our speech digest accu- 95.5% of individual digests of legitimate audio are
racy over real call audio. In these 10 calls, we play 10 detected as authentic. Using a 3-out-of-5 detection
sentences from 10 randomly selected speakers in the scheme, 96.7% of adversarial audio is detected.
TIMIT corpus through the call, and our AuthentiCall im-
plementation computed the sent and received digests. In
total this represented 360 seconds of audio. For simplic- are several mechanisms that can correct for this. One
ity, a caller sends audio and digests, and a callee receives option would be to digest audio after compression for
the audio and compares the received and locally com- transmission (our prototype uses the raw audio from
puted digests. We also compared these 10 legitimate call the microphone); such a scheme would reduce false
digests with an “adversary call” containing different au- positives partially caused by known-good transforma-
dio from the hashes sent by the legitimate caller. To com- tion of audio. Another option is to simply accept these
pare our live call performance to simulated audio from individual false positives. Doing so would result in a
Section 5, we first discuss our individual-hash accuracy. false alert on average every 58 minutes, which is still
Figure 11 shows the cumulative distribution of BER acceptable as most phone calls last only 1.8 minutes [3].
for digests of legitimate audio calls and audio sent by
an adversary. The dotted line represents our previously
established BER threshold of 0.348. 8 Discussion
First, in testing with adversarial audio, we see that
93.4% of the individual fraudulent digests were detected We now discuss additional issues related to AuthentiCall.
as fraudulent. Our simulation results saw an individual Applications and Use Cases: AuthentiCall provides
digest detection rate of 90%, so this means that our a mechanism to mitigate many open security problems
real calls see an even greater performance. Using our in telephony. The most obvious problems are attacks
3-out-of-5 standard for detection, we detected 96.7%. that rely on Caller ID fraud, like the perennial “IRS
This test shows that AuthentiCall can effectively detect scams” in the United States. Another problem is that
tampering in real calls. Next, for legitimate calls, 95.5% many institutions, including banks and utilities, use ex-
of the digests were properly marked as authentic audio. tensive and error-prone challenge questions to authenti-
Using our 3-out-of-5 standard, we saw no five-second cate their users. These challenges are cumbersome yet
frames that were marked as tampered. still fail to stop targeted social engineering attacks. Au-
While our individual hash performance false positive thentiCall offers a strong method to authenticate users
rate of 4.5% was low, we were surprised that the over the phone, increasing security while reducing the
performance differed from our earlier evaluation on authentication time and effort.
simulated degradations. Upon further investigation, we Another valuable use case is emergency services,
learned that our audio was being transmitted using the which have faced “swatting” calls that endanger the lives
AMR-NB codec set to the lowest possible quality setting of first responders [73] as well as denial of service attacks
(4.75kbps); this configuration is typically only used that have made it impossible for legitimate callers to re-
when reception is exceptionally poor, and we anticipate ceive help [8]. AuthentiCall provides a mechanism to al-
this case will be rare in deployment. Nevertheless, there low essential services to prioritize authenticated calls in
2016. Index l2
B2
[71] H. Tu, A. Doupé, Z. Zhao, and G.-J. Ahn. SoK: Every- Index l2 + w >
Telephone Spam. 2016 IEEE Symposium on Security and Audio Features Compression Function
(once per second) (64 times per second)
Privacy (S&P), 2016.
[72] H. Tu, A. Doupé, Z. Zhao, and G.-J. Ahn. Toward
Figure 13: This figure illustrates the digest construction
Authenticated Caller ID Transmission: The Need for a
described in Section 5.1. Audio digests summarize call
Standardized Authentication Scheme in Q.731.3 Calling
Line Identification Presentation. In Proceedings of the content by taking one second of speech data, deriving
ITU Kaleidoscope (ITU), Nov. 2016. audio features from the data, and compressing blocks of
those features into a bit string.
[73] D. Tynan. The terror of swatting: how the
law is tracking down high-tech prank callers.
https://www.theguardian.com/technology/
That is, LSFs represent phonemes – the individual sound
2016/apr/15/swatting-law-teens-anonymous-
units present in speech. While pitch is useful for speaker
prank-call-police, Apr 2016.
recognition, LSFs are not a perfect representation of all of
[74] Vassilis Prevelakis and Diomidis Spinellis. The Athens the nuances of human voice. This is one reason why it is
Affair. IEEE Spectrum, June 2007. sometimes difficult for humans to confidently recognize voices
[75] X. Wang and R. Zhang. VoIP Security: Vulnerabilities, over the phone. This means that the digest more accurately
Exploits and Defenses. Elsevier Advances in Computers, represents semantic content rather than the speaker’s voice
March 2011. characteristics. This is important because a number of tech-
niques are able to synthesize new speech that evades speaker
[76] A. Whitten and J. D. Tygar. Why Johnny Can’t Encrypt: recognition from existing voice samples [38,66]. Finally, LSFs
A Usability Evaluation of PGP 5.0. In 25th USENIX are numerically stable and robust to quantization — meaning
Security Symposium (USENIX Security 16), 1999. that modest changes in input yield small changes in output. In
[77] P. Zimmermann, A. Johnston, and J. Callas. RFC 6189 RSH, the input audio is grouped into 30ms frames with 25ms
ZRTP: Media Path Key Agreement for Unicast Secure audio overlap between frames, and 10 line spectral frequencies
RTP. Internet Engineering Task Force, 2011. are computed for each frame to create a matrix L.
The second phase of digest computation involves com-
pressing the large amount of information about the audio into
A RSH Digest Construction a digest. Because audio rarely changes on millisecond time
scales, the representation L is highly redundant. To compress
In this appendix, we describe the construction of the RSH digest
this redundant data, RSH uses the two-dimensional discrete
used by AuthentiCall for channel binding and content integrity.
cosine transform (DCT). The DCT is related to the Fourier
There are a number of constructions of speech digests, and
transform, is computationally efficient, and is commonly
they all use the following basic process. First, they compute
used in compression algorithms (e.g., JPEG, MP3). RSH
derived features of speech. Second, they define a compression
computes the DCT over different sections of the matrix L
function to turn the real-valued features into a bit string. In
to produce the final digest. RSH only uses first eight DCT
this paper, we use the construction of Jiao et al. [36], which
coefficients (corresponding to the highest energy components
they call RSH. We chose this technique over others because it
and discarding high-frequency information).
provides good performance on speech at a low-bitrate, among
The second phase of digest computation – the compression
other properties. We note that the original work did not eval-
function – uses the DCT algorithm in the computation of the
uate the critical case where an adversary can control the audio
bitwise representation of the audio sample. The following
being hashed. Our evaluation shows that RSH maintains audio
process generates 8 bits of a digest; it is repeated 64 times to
integrity in this crucial case. The construction also selects
generate a 512-bit digest.
audio probabilistically; we show in Appendix B that the digest
indeed protects all of the semantic content in the input audio. 1. Obtain a window size w and two window start indexes l1
Finally, to our knowledge we are the first to use any robust and l2 from the output of a keyed pseudorandom function.
speech digest for an authentication and integrity scheme. 2. Select from L two blocks of rows. These blocks B1 and
Figure 13 illustrates how RSH computes a 512-bit digest for B2 contain all columns from l1 : l1 + w and l2 : l2 + w
one second of audio. In the first step of calculating a digest, respectively.
RSH computes the Line Spectral Frequencies (LSFs) of the 3. Compress these individual blocks into eight coefficients
input audio. LSFs are used in speech compression algorithms each using the DCT.
to represent the major frequency components of human voice, 4. Set eight digest bits by whether the corresponding coeffi-
which contain the majority of semantic information in speech. cients of the first block (B1 ) are greater than the coefficients
another place.
In particular, the malicious app on payer’s phone brings
itself to the foreground and takes picture when it discov- 2.Malicious app
Connect? connect BT.
ers that the payment app on the same phone is in the Connect?
Mark O’Neill Scott Heidbrink Scott Ruoti Jordan Whitehead Dan Bunker
Luke Dickinson Travis Hendershot Joshua Reynolds
Kent Seamons Daniel Zappala
Brigham Young University
mto@byu.edu, sheidbri@byu.edu, ruoti@isrl.byu.edu, jaw@byu.edu, dbunked@gmail.com
luke@isrl.byu.edu, tmanhendy@isrl.byu.edu, joshua@isrl.byu.edu
seamons@cs.byu.edu, zappala@cs.byu.edu
3 TrustBase
TrustBase is motivated by the need to fix broken authen-
tication in applications and strengthen the CA system,
using the two motivating principles that authentication Figure 1: TrustBase architecture overview
should be centralized in the operating system and system
administrators should be in control of authentication poli-
cies on their machines. In this section, we discuss the 3.2 Design Goals
threat model, design goals, and architecture of the system.
The design goals for TrustBase are: (1) Secure existing
applications. TrustBase should override incorrect or ab-
3.1 Threat Model sent validation of certificates received via TLS in current
In our threat model, an active attacker attempts to imper- applications. (2) Strengthen the CA system. TrustBase
sonate a remote host by providing a fake certificate. Our should provide simple deployment of authentication ser-
attacker includes remote hosts as well as MITM attackers vices that strengthen the validation provided by the CA
located anywhere along the path to a remote host. The system. (3) Full application coverage. All incoming
goal of the attacker is to establish a secure connection certificates should be validated by TrustBase, including
with the client. those provided to both existing applications and future
The application under attack may accept the fake cer- applications. However, this does not include certificates
tificate for the following reasons: from connections not made directly by the system, such as
certificates delivered through onion routing. (4) Univer-
• The application employs incorrect certificate valida- sal deployment. The TrustBase architecture should be
tion procedures (e.g., limited or no validation) and designed to work on any major operating system, includ-
the attacker exploits his knowledge of this to trick ing both desktop and mobile platforms. (5) Negligible
the application into accepting his fake certificate. overhead. TrustBase should have negligible performance
overhead. This includes ensuring that the user experience
• The attacker or malware managed to place a rogue for applications is not affected in any way, except when
certificate authority into the user’s root store (or an- TrustBase prevents an application from establishing an
other trust store used by the application) so that he insecure connection.
has become a trusted certificate authority. The fake
certificate authority’s private key was then used to
generate the fake certificate used in the attack. 3.3 Architecture
• Non-privileged malware has altered or hooked secu- The architecture for TrustBase is given in Figure 1. These
rity libraries the application uses to force acceptance components are described below:
of fake certificates (e.g., via malicious OpenSSL
hooks and LD PRELOAD).
3.3.1 Traffic Interceptor
• A legitimate certificate authority was compromised
The traffic interceptor intercepts all network traffic and
or coerced into issuing the fake certificate to the
delivers it to registered handlers for further processing.
attacker.
The interceptor is generic, lightweight, and can provide
Local attackers (malware) with root privilege are out- traffic to any type of handler. Traffic for any specific
side the scope of our threat model. In addition, we stream is intercepted only as long as a handler is interested
consider only certificates from TLS connections directly in it. Otherwise, traffic is routed normally.
made from the local system to a designated host, and The traffic interceptor is needed to secure existing appli-
not those that may be present in streams higher up in the cations. If a developer is willing to modify her application
OSI stack or indirectly from other hosts or proxies via to call TrustBase directly for certificate validation, then
protocols like onion routing. she can use the validation API. Administrators can con-
3.5 TLS 1.3 and Overriding the CA System 3.6 Operating System Support
There are two situations where TrustBase cannot use de- We designed the TrustBase architecture so that it could
fault traffic interception to accomplish its primary goals. be implemented on additional operating systems. The
First, when an application uses TLS 1.3, the certificates main component that may need to be customized for each
that are exchanged are encrypted, preventing TrustBase operating system is the traffic interception module. We
from using passive traffic interception to independently are optimistic that this is possible because the TCP/IP
validate certificates. Second, in some cases a system ad- networking stack and sockets are used by most operating
ministrator may want to distrust the CA system entirely systems.
and rely solely on alternative authentication services. For Our Linux implementation is described in the following
example, the administrator may want to force applications section. We also have a working prototype of TrustBase
currently using CA validation to accept self-signed certifi- for Windows, which uses the Windows Filtering Platform
cates that have been validated using a notary system such API.
as Convergence[31], or she may want to use DANE[21] Mac OSX provides a native interface for traffic inter-
with trust anchors that differ from those shipped with the ception between the TCP and socket levels of the oper-
system. When this occurs, TrustBase will use the new au- ating system. Apple’s Network Kernel Extensions suite
thentication service and determine the certificate is valid provides a “Socket Filter” API that could be used as the
and allow a connection as configured by the administra- traffic interceptor.
tor, but applications using the CA system may reject the For iOS, Apple provides a Network Extension frame-
certificate and terminate the connection. We stress that work that includes a variety of APIs for different kinds of
such a policy would not be intended to override strong traffic interception. The App Proxy Provider API allows
certificate checks done by a browser (e.g., when commu- developers to create a custom transparent proxy that cap-
nicating with a bank), but to provide a path for migrating tures application network data. Also available is the Filter
away from the CA system as stronger alternatives emerge. Data Provider API that allows examination of network
To handle both TLS 1.3 and overriding the CA system, data with built-in “pass/block” functionality.
TrustBase provides two options. The preferred option is Because Android uses a variant of the Linux kernel,
to modify applications to rely on TrustBase for certificate we believe our Linux implementation could be ported
validation, rather than performing their own checks. This to Android with relative ease. We have a prototype of
is facilitated by the validation API described above. This TrustBase on Android that instead uses the VPNService
enables new or modified applications to use the full set of to intercept traffic.
authentication services provided by TrustBase in a natural
manner.
A second option is to employ a local TLS proxy that can 4 Linux Implementation
coerce existing applications that rely on the CA system
to use new authentication services instead. The use of We have designed and built a research prototype of
a proxy also allows TrustBase plaintext access to the TrustBase for Linux. The source code is available at
server’s certificate under TLS 1.3. TrustBase gives the owntrust.org.
administrator the option of running such a proxy, but it is We have developed a loadable kernel module (LKM)
activated only in those cases where it is needed, namely to intercept traffic at the socket layer, as data transits be-
when the policy engine determines a certificate is valid tween the application and TCP handling kernel code. No
but the CA system would reject it. The proxy employed modification of native kernel code is required, and the
is a modified fork of sslsplit [38] and has shown itself LKM can be loaded and unloaded at runtime. Similarly to
how Netfilter operates at the IP layer, TrustBase can inter- invoked by the same common socket API. The function
cept traffic at the socket layer, before data is delivered for pointers in the protocol structures correspond to basic
TCP handling, and pass it to application-level programs, socket operations such as sending and receiving data, and
where it can be (optionally) modified and then passed creating and closing connections. Application calls to
back to the native kernel code for delivery to the original read, write, sendmsg, and other system calls on that
TCP functionality. Likewise, interception can occur after socket then use those protocol functions to carry out their
data finishes TCP processing and before it is delivered operations within the kernel. Note that within the kernel,
to the application. This enables TrustBase to efficiently all socket-reading system calls (read, recv, recvmsg,
intercept TLS connections in the operating system and and recvmmsg) eventually call the recvmsg function pro-
validate certificates in the application layer. vided by the protocol structure. The same is true for the
The following discussion highlights the salient features corresponding socket write system calls, as each result in
of our implementation. calling the kernel sendmsg function. When the LKM is
unloaded, the original TCP functionality is restored in a
4.1 Traffic Interceptor safe manner.
From top to bottom in Figure 2, the functionality of the
TrustBase provides generic traffic interception by captur-
traffic interceptor is as follows. First, a call to connect in-
ing traffic between sockets and the TCP protocol. This is
forms the handler that a new connection has been created,
done by hooking several kernel functions and wrapping
and the handler can choose to intercept its traffic.
them to add traffic interception as needed. An overview
of which functions are hooked and how they are modified Second, when an application makes a system call to
is given in Figure 2. Items in white boxes on the left side send data on the socket, the interceptor checks with the
of the figure are system calls. Items in white boxes on the handler to determine if it is tracking that connection. If
right side of the figure are the wrapped kernel functions. so, it forwards the data to the handler for analysis, and
The additional logic added to the native flow of the kernel the handler chooses what data (potentially modified by
is shown by the arrows and gray boxes in Figure 2. the handler), if any, to relay to native TCP-sending code.
When the TrustBase LKM is loaded, it hooks into the After attempting to send data, the interceptor informs the
native TCP kernel functions whose pointers are stored handler how much of that data was successfully placed
in the global kernel structures tcp prot (for IPv4) and into the kernel’s send buffer and provides notification of
tcpv6 prot (for IPv6). When a user program invokes any errors that occurred. At this point the interceptor
a system call to create a socket, the function pointers allows the handler to send additional data, if desired. This
within the corresponding protocol structure are copied process continues until the handler indicates it no longer
into the newly-created kernel socket structure, allowing wishes to send data. The interceptor then queries the
different protocols (TCP, UDP, TCP over IPv6, etc.) to be handler for the return values it wishes to report to the
6.2 Compatibility
One goal of TrustBase is to strengthen certificate authenti-
cation for existing, unmodified applications and to provide
additional authentication services that strengthen the CA
system. To meet this goal, TrustBase must be able to
enforce proper authentication behavior by applications,
as defined by the system administrator’s configuration.
There are three possible cases for the policy engine
to consider. (1) If a certificate has been deemed valid
by both TrustBase and the application, the policy engine
Figure 4: Handshake Timings for TCP (left) and TLS allows the original certificate data to be forwarded on to
(right) handshakes with and without TrustBase running. the application, where it is accepted naturally. (2) In the
case where the application wishes to block a connection,
regardless of the decision by TrustBase, the policy engine
chain that also employed an intermediate authority, rep- allows this to occur, since the application may have a valid
resenting a realistic circumstance for web browsing and reason to do so. We discuss in Section 3.5, the special
forcing plugins to execute all of their validity checks. Our case when a new authentication service is deployed that
testing used a modern PC running Fedora 21 and averaged wishes to accept a certificate that the CA system normally
across 1,000 trials. would not. (3) In the case where validation with TrustBase
Figure 4 shows boxplots that characterize the timing fails, but the application would have allowed the connec-
of TCP and TLS handshakes, with and without Trust- tion to proceed, the policy engine blocks the connection
Base active. There is no discernible difference for TCP by forwarding an intentionally invalid certificate to the
handshake timings and the average difference is less than application, which triggers any SSL application valida-
10 microseconds, with neither configuration consistently tion errors an application supports, and then subsequently
beating the other in subsequent experiments. This is terminates the connection.
expected behavior because the traffic interceptor is ex- We tested TrustBase with 34 popular applications and
tremely light-weight for TCP connections. Average TLS libraries and tools, shown in Table 1.7 TrustBase success-
handshake times with and without TrustBase also have fully intercepted and validated certificates for all of them.
no discernible difference, with average handshake times For each library tested, and where applicable, we created
for this experiment of 5.9 ms and 6.0 ms, respectively. sample applications that performed no validation and im-
Successive experiments showed again that neither average proper validation (bad checking of signatures, hostnames,
consistently beat the other. This means that the inherent and validity dates). We then verified that TrustBase cor-
fluctuations in system and network conditions account rectly forced these applications to reject false certificates
for more time than the additional control paths TrustBase despite those vulnerabilities in each case. In addition, we
introduces. This is also expected, as the brevity of TLS observed that TrustBase caused no adverse behavior, such
handling code, its place in the kernel, the use of efficient as timeouts, crashes, or unexpected errors.
Netlink transport and other design choices were made
with performance in mind. 6.3 Android Prototype
Our experimentation with varying file sizes also exhib-
ited no difference between native and TrustBase cases. To verify that the TrustBase approach works on mobile
Note that the TrustBase timings for the TLS handshake platforms and is compatible with mobile applications, we
may increase if a particular plugin is installed that requires built a prototype for Android. Source code can be found
more processing time or relies on Internet queries to func- at owntrust.org.
tion, and that this overhead is inherent to that service and Our Android implementation uses the VPNService so
not the TrustBase core. that it can be installed on an unaltered OS and without
root permissions. The drawback of this choice is that only
The memory footprint in our Linux prototype is also
one VPN service can be active on the Android at a time.
negligible. For each network connection, TrustBase tem-
In the long-term, adding socket-level interception to the
porarily stores less than 300 bytes of data, plus the length
Android kernel would be the right architectural choice,
of any TLS handshake messages encountered. Connec-
tions not using TLS use even less memory than this and 7 These are a superset of the tools and libraries tested with CertShim
vent connections that attempt to utilize weak cipher suites voke the TrustBase validation API natively. This removes
and signing algorithms, using the additional handshake the need for developers to explicitly invoke the validation
information provided to all plugins. API, and provides the OS with visibility and control over
all TLS data, including TLS 1.3 handshakes, as the OS
becomes the de facto TLS client. Such a measure enables
7 Future Work system-wide deployment of security measures, such as
cipher suite customization, TLS extension deployment,
TrustBase explores the benefits and drawbacks of pro-
and responses to CVEs. This also allows OS vendors
viding authentication as an operating system service and
and system administrators an easier upgrade path for TLS
giving administrators control over how all authentication
versions.
decisions on their machines are made. In doing so, a
step has been taken toward empowering administrators to
control secure connections on their machines. However,
some drawbacks have been noted, such as the reliance
on a local proxy to support TLS 1.3 interception and Since network application developers are already fa-
CA overriding in some cases on Linux. These issues miliar with the POSIX socket API, we are working on
are caused by applications dictating the security of the providing TLS as a protocol type in the socket API, the
machine’s connections, using their own (or third party) same way the OS provides TCP and UDP protocols as
security features and keys, reducing operating system and a service. In contrast to using a userspace library, this
administrator control. approach allows network application developers unfamil-
We are currently investigating further steps into this iar with security to operate in a well-known environment,
territory to provide great administrator control of security utilizes an existing OS API that can be shared by many
without some of these drawbacks. One such step is pro- different platform implementations, and allows strict con-
viding TLS as an operating system service, meaning that figuration and control by administrators. By creating a
the operating system provides encryption for applications, socket using a new IPPROTO TLS parameter (as opposed
not just authentication. Current TLS libraries are a burden to IPPROTO TCP), developers can use the bind, connect,
on application developers, who are often not security ex- send, recv, and other socket API calls with which thy
perts. In addition, developers do not necessarily share the are already familiar, focusing solely on application data
same security goals as the vendors or administrators who and letting the OS handle all TLS functionality. The gen-
configure the systems upon which applications run. By eralized setsockopt and getsockopt are available to
providing TLS as an operating system service, application specify remote hostnames and additional options to the
developers are relieved of this burden and the OS can in- OS TLS service without violating the existing socket API.
Roberto Jordaney+ , Kumar Sharad ∗§ , Santanu Kumar Dash ∗‡ , Zhi Wang ∗† , Davide Papini ∗• ,
Ilia Nouretdinov+ , and Lorenzo Cavallaro+
+ Royal Holloway, University of London
§ NEC Laboratories Europe
‡ University College London
† Nankai University
• Elettronica S.p.A.
Threshold malicious
Threshold malicious
Threshold malicious
0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Threshold benign Threshold benign Threshold benign Threshold benign
(a) Elements above the threshold (b) Performance of elements above the threshold
1.0 Element discarded by p-value 1.0 1.0 Element discarded by probability 1.0 1.0 Performance by p-value 1.0 1.0 Performance by probability 1.0
0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
Threshold malicious
Threshold malicious
Threshold malicious
Threshold malicious
0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Threshold benign Threshold benign Threshold benign Threshold benign
(c) Elements below the threshold (d) Performance of elements below the threshold
Figure 1: Performance comparison between p-value and probability for the objects above and below the threshold used
to accept the algorithm’s decision. The p-values are given by CE with SVM as non-conformity measure, the probabil-
ities are given directly by SVM. As we can see from the graph, p-values tend to contribute to a higher performance of
the classifier, identifying those (drifting) objects that would have been erroneously classified.
evaluation is agnostic to the algorithm, making it versa- Conformal evaluation computes a notion of similarity
tile and compatible with multiple ML algorithms; it can through p-values. For a set of objects K , the p-value
be applied on top of any classification or clustering algo- pCz∗ for an object z∗ is the proportion of objects in class
rithm that uses a score for prediction. K that are at least as dissimilar to other objects in C as
We note that some algorithms already have built-in z∗ . There are two standard techniques to compute the
quality measures (e.g., the distance of a sample from the p-values from K : Non-Label-Conditional (employed by
hyperplane in SVM). However, these are algorithm spe- decision and alpha assessments outlined in § 3.1 and
cific and cannot be directly compared with other algo- § 3.2), where K is equal to D , and Label-Conditional
rithms. On the other hand, Transcend unifies such quality (employed by the concept drift detection described in
measures through uniform treatment of non-conformity § 3.3), where K is the set of objects C with the same
in an algorithm-agnostic manner. label. The calculations for the non-conformity measures
for the test object and the set of objects in K is shown
in equation 1 and 2 respectively. The computation of p-
2.2 P-values as a Similarity Metric value for the test object is shown in equation 3.
Table 1: Binary classification case study datasets [2]. Table 2: Multiclass classification case study datasets [1].
thresholds should be marked as untrustworthy, as they and associated metrics does not provide the ability to see
drift away from the trained model (see §3.3). decision boundaries nor predictions (statistical) quality.
We elaborate this further in § 4.1 and § 4.2 for binary
and multiclass classification tasks, respectively. 4.1.1 Detecting Concept Drift
This section presents a number of experiments to show
4.1 Binary Classification Case Study how Transcend identifies concept drift and correctly
marks as untrustworthy the decisions the NCM-based
This section assesses the quality of the predictions of
classifier predicts erroneously.
Drebin4 , the learning algorithm presented in [2]. We
We first show how the performance of the learning
reimplemented Drebin and achieved results in line with
model introduced in [2] decays in the presence of con-
those reported by Arp et al. in absence of concept
cept drift. To this end, we train a model with the Drebin
drift (0.95 precision and 0.92 recall, and 0.99 precision
dataset [2] and we test it against 9,000 randomly selected
and 0.99 recall for malicious and benign classes, re-
malicious and benign Android apps (with equal split)
spectively on hold out validation with 66-33% training-
drawn from the Marvin dataset [14]. The confusion ma-
testing Drebin dataset split averaged on ten runs).
trix in Table 3a clearly shows how the model is affected
Figure 2a shows how CE’s decision assessment sup-
by concept drift as it reports low precision and recall for
ports such results. In particular, the average algorithm
the positive class representing malicious objects5 . This
credibility and confidence for the correct choices are 0.5
is further outlined in Figure 3a, which shows how the p-
and 0.9, respectively. This reflects a high prediction qual-
value distribution of malicious objects is pushed towards
ity: correctly classified objects are very different (from
low values (poor prediction quality).
a statistical perspective) to the other class (and an aver-
Table 3b shows how enforcing cut-off quality thresh-
age p-value of 0.5 as algorithm credibility is expected
olds affect—by improving—the performance of the
due to mathematical properties of the conformal evalua-
same learning algorithm. For this experiment, we di-
tor). Similar reasoning applies for incorrect predictions,
vided the Drebin dataset in training and calibration sets
which are affected by a poor statistical support (average
with a 90-10% averaged over 10 rounds. This ensures
algorithm credibility of 0.2).
that each object in the dataset has a p-value. We then
Figure 2b shows CE’s alpha assessment of Drebin. We asked Transcend to identify suitable quality thresholds
plot this assessment as a boxplot to show details of the p- (cfr § 3.3) with the aim to maximize the F1-score as de-
value distribution. The plot shows that the p-value distri- rived by the calibration dataset, subject to a minimum
bution for the wrong predictions (i.e., second and third F1-score of 0.99 and a minimum percentage of kept el-
column) is concentrated in the lower part of the scale ement of 0.766 . It is worth noting that such thresholds
(less than 0.1), with few outliers; this means that, on av- are derived from the calibration dataset but are enforced
erage, the p-value of the class which is not the correct to detect concept drift on a testing dataset. Results show
one, is much lower than the p-value of the correct predic- how flagging predictions of testing objects with p-values
tions. Benign samples (third and fourth columns) seem below the cut-off thresholds as unreliable improves pre-
more stable to data variation as the p-values for benign cision and recall for the positive (malicious) class, from
and malicious classes are well separated. Conversely, the 0.61 to 0.89 and from 0.36 to 0.76, respectively.
p-value distribution of malicious samples (first and sec-
ond columns) is skewed towards the bottom of the plot; 5 Drebin spans the years 2010–2012 while Marvin covers from 2010
this implies that the decision boundary is loosely defined, to 2014. Most of the Drebin’s features capture information (e.g., string
and IP addresses) that is likely to change over time, affecting the ability
which may affect the classifier results in the presence of of the classifier to identify non-stationary data.
concept drift. A direct evaluation of the confusion matrix 6 In [2], Arp et al. report a TPR of 94% at a FPR of 1%. Such
0.8 0.8
0.6 0.6
0.4
0.4
0.2
0.2
0.0 Correct choices Incorrect choices 0.0 Given label: malicious Given label: benign
Figure 2: Binary Classification Case Study (Drebin [2]): Decision assessment and Alpha assessment.
Table 3: Binary classification case study ([2]). Table 3a: confusion matrix when the model is trained on Drebin and
tested on Marvin. Table 3b: confusion matrix when the model is trained on Drebin and tested on Marvin with p-value-
driven threshold filtering. Table 3c: retraining simulation with training samples of Drebin as well as the filtered out
element of Marvin of Table 3b (2386 malicious samples and 241 benign) and testing samples coming from another
batch of Marvin samples (4500 malicious and 4500 benign samples). The fate of the drifting objects is out of scope
of this paper as that would require to solve a number of challenges that arise once concept drift is identified (e.g.,
randomly sampling untrustworthy samples according to their p-values, effort of relabeling depending on available
resources, model retraining). We nonetheless report the result of a realistic scenario in which objects drifting from a
given model, correctly identified by Transcend, represent important information to retrain the model and increase its
performance (assuming a proper labeling as briefly sketched above).
Table 4: Binary classification case study ([2]): examples of thresholds. From the results we can see that increasing the
threshold will lead to keep only the sample where the algorithm is sure about. The number of discarded samples is
very subjective to the severity of the shift in the dataset, together with the performance of those sample it is clear the
advantage of the p-value metric compared to the probability one.
0.0
Given label: malicious Given label: benign 0.0
Given label: malicious Given label: benign 0.0
Given label: malicious Given label: benign 0.0
Given label: malicious Given label: benign
Figure 3: Binary Classification Case Study: p-value and probability distribution for true malicious and benign samples
when the model is trained on Drebin dataset and tested on Marvin. Graph (a): p-value distribution for true malicious
samples. Graph (b): p-value distribution of true benign samples. Graph (c): probability distribution of true malicious
samples. Graph (d): probability distribution of true benign samples.
We would like to remark that drifting objects are still Comparison with Probability. In the following, we
given a label as the output of a classifier prediction; compare the distributions of p-values, as derived from
Transcend flags such predictions as untrustworthy, de- CE, and probabilities, as derived from Platt’s scaling for
facto limiting the mistakes the classifier would likely SVM, in the context of [2] under the presence of con-
make in the presence of concept drift. It is clear that one cept drift (i.e., training on Drebin, testing on Marvin as
needs to deal with such objects, eventually. Ideally, they outlined). The goal of this comparison is to understand
would represent an additional dataset that, once labeled which metric is better-suited to identify concept drift.
properly, would help retraining the classifier to predict Figure 3a shows the alpha assessment of the classifi-
similar objects. This opens a number of challenges that cations shown in Table 3a. The figure shows the distribu-
are out of the scope of this work; however, one could still tion of p-values when the true label of the samples is ma-
rely on CE’s metrics to prioritize objects that should be licious. Correct predictions (first and second columns),
labeled (e.g., those with low p-values as they are the one reports p-values (first column) that are are slightly higher
the drift the most from the model). This might require than those corresponding to incorrect ones (second col-
to randomly sample drifting objects once enough data is umn), with a marginal yet well-marked separation as
available as well as understanding how much resources compared to the values they have for the incorrect class
one can rely on for data labeling. It is important to note (third and fourth columns). Thus, when wrong predic-
that Transcend plays a fundamental role in this pipeline: tions refer to the benign class, the p-values are low and
it identifies concept drift (and, thus, untrustworthy pre- show a poor fit to both classes. Regardless of the classi-
dictions), which gives the possibility of start reasoning fier outcome, the p-value for each sample is very low, a
on the open problems outlined above. likely indication of concept drift.
The previous paragraphs show the flexibility of the Figure 3b depicts the distribution of p-values when
parametric framework we outlined in § 3.3, on an arbi- true label of the samples is benign. Wrong predictions
trary yet meaningful example, where statistical cut-off (first and second columns) report p-values representing
thresholds are identified based on an objective function benign (second column) and malicious (first column)
to optimize, subject to specific constraints. Such goals classes to be low. Conversely, correct predictions (third
are however driven by business requirements (e.g., TPR and fourth columns) represent correct decisions (fourth
vs FPR) and resource availability (e.g., malware ana- column) and have high p-values, much higher compared
lysts available vs number of likely drifting samples— to the p-values of the incorrect class (third column). This
either benign or malicious—for which we should not is unsurprising as benign samples have data distributions
trust a classifier decision) thus providing numerical ex- that do not drift with respect to malicious ones.
ample might be challenging. To better outline the suit- A similar reasoning can be followed for Figures 3c
ability of CE’s statistical metrics (p-values) in detecting and 3d. Contrary to the distribution of p-values, prob-
concept drift, we provide a full comparison between p- abilities are constrained to sum up to 1.0 across all the
values and probabilities as produced by Platt’s scaling classes; what we observe is that probabilities tend to be
applied to SVM. We summarize a similar argument (with skewed towards high values even when predictions are
probabilities derived from decision trees) for multiclass wrong. Intuitively, we expect to have poor quality on
classification tasks in § 4.2. all the classes of predictions in the presence of a drifting
0.8
P-values
0.6
0.4
0.2
Figure 4: Multiclass Classification Case Study: Alpha assessment for the Microsoft classification challenge showing
the quality of the decision taken by the algorithm. Although, perfect results are observed on the confusion matrix , the
quality of those results vary a lot across different families.
3rd QT
by P-value by Probability
3rd QT
3rd QT
0.16
0.96
0.975
0.14
0.93
0.950
0.12
0.90
Threshold benign
0.925 0.10
Threshold benign
Threshold benign
0.87
2nd QT
0.900
2nd QT
2nd QT
0.08
0.84
0.875 0.06
0.81
0.850 0.04
0.78
0.825 0.02
0.75
3rd QT
0.00
by P-value by Probability
3rd QT
3rd QT
Threshold benign
0.48
Threshold benign
Threshold benign
2nd QT
0.570 0.12
2nd QT
2nd QT
0.44
0.555 0.15
0.40
0.540 0.18
0.36
0.525 0.21
0.32
0.510 0.24
0.28 1st QT 2nd QT 3rd QT
1st QT 2nd QT 3rd QT 1st QT 2nd QT 3rd QT Threshold malicious
Threshold malicious Threshold malicious
by p-value by probability
3rd QT
3rd QT
0.50 0.12
0.36
0.45 0.10
0.32
0.40
0.08
Threshold benign
0.28
Threshold benign
Threshold benign
0.35
2nd QT
0.06
2nd QT
2nd QT
0.30 0.24
0.04
0.25 0.20
0.02
0.20 0.16
0.00
0.15
0.12
1st QT 2nd QT 3rd QT
0.10 Threshold malicious
1st QT 2nd QT 3rd QT 1st QT 2nd QT 3rd QT
Threshold malicious Threshold malicious
Figure 6: Binary Classification Case Study [2]: complete comparison between p-value and probability metrics. Across
all the threshold range we can see that the p-value based thresholding is providing better performance than the proba-
bility one, discarding the samples that would have been incorrectly classified if kept.
0.8
P-values
0.6
0.4
0.2
Figure 7: Multiclass Classification Case Study [1]: P-value distribution for samples of Tracur family omitted from the
training dataset; as expected, the values are all close to zero.
1.0
0.8
Probabilities
0.6
0.4
0.2
Figure 8: Multiclass Classification Case Study [1]: probability distribution for samples of Tracur family omitted from
the training dataset. Probabilities are higher then zero and not equally distributed across all the families, making the
classification difficult. It is worth noting some probabilities are skewed towards large values (i.e., greater than 0.5)
further hindering a correct classification result.
0.8
P-values
0.6
0.4
0.2
Figure 9: Multiclass Classification Case Study [1]: a new family is discovered by relying on the p-value distribution
for known malware families. The figure shows the amount of conformity each sample has with its own family; for
each sample, there is only one family with high p-value.
0.8
P-values
0.6
0.4
0.2
Figure 10: Multiclass Classification Case Study [1]: probability distribution for samples of families included in the
training dataset. High probabilities support the algorithm classification choice.
U1 U2
UU* UU+ a b
Figure 6: The left-most U in U3 U2 U1 ∗ + is the top-
most-right-most non-terminal in the abstract syntax tree.
(The indices are provided for illustrative purposes only.)
Ub+ UUU++ Ua+ UUU*+
the x86 architecture. For every data type (U8 , Example 4. The expression (U + (U ∗U)) is represented
U16 , U32 and U64 ), we define the set of operations as U U U ∗ +. If we successively replace the right-most
as O = {+,−,∗,/s ,/,%s ,%,∧,∨,⊕,,,a ,−1 ,¬, U, the algorithm is unlikely to find expressions such as
sign_ext, zero_ext, extract, ++, 1}, where the ((a + b) + (b ∗ (b ∗ a))), since it is directed into the subex-
operations are binary addition, subtraction, multiplication, pression with the multiplication. Instead, replacing the
signed/unsigned division, signed/unsigned remainder, top-most-right-most non-terminal directs the search to the
bitwise and/or/xor, logical left shift, logical/arithmetic top-most addition and then explores the subexpressions.
right shift as well as unary minus and complement.
The unary operations sign_ext and zero_ext extend
smaller data types to signed/unsigned larger data types.
4.3 Random Playout
Conversely, the unary operator extract transforms One of the key concepts of MCTS are random playouts.
larger data types into smaller data types by extracting the They are used to determine the outcome of a node; this
respective least significant bits. Since the x86 architecture outcome is represented by a reward. In the first step,
allows register concatenation (e. g., for division), we we randomly apply production rules to the current node,
employ the binary operator ++ to concatenate two until we obtain a terminal expression. To avoid infinite
expressions of the same data type. Finally, to synthesize derivations, we set a maximum playout depth. This max-
expressions such as increment and decrement, we use the imum playout depth defines how often a non-terminal
constant 1 as niladic operator. The input set V consists symbol can be mapped to rules that contain non-terminal
of |V | = n variables, where n represents the number of symbols; at the latest we reached the maximum, we map
inputs. non-terminals only to terminal expressions. This happens
in a top-most-right-most manner. Afterwards, we evaluate
Tree Structure. The sentential form U is the root node the expression for all inputs from the I/O samples.
of the MCTS tree. Its child nodes are other expressions
that are produced by applying the production rules to a Example 5. Assuming a maximum playout depth of 2
single non-terminal symbol of the parent. The expression and the expression U U ∗, the first top-most-right-most U
depth (referred to as layer) is equivalent to the number of is randomly substituted with U U ∗, the second one with
derivation steps, as depicted in Figure 5. U U +. After that, the remaining non-terminal symbols
are randomly replaced with variables: U U ∗ ⇒ U U U ∗
Example 3. The root node U is an expression of layer 0. ∗ ⇒ U U + U U ∗ ∗ ⇒ · · · ⇒ a a + b a ∗ ∗. A random
Its children are a, b, U U +, and U U ∗, where a and b are playout for U U + is a b b + +.
terminal expressions of layer 1. Assuming that the right- For the I/O sample (2, 2) → 4, we evaluate g(2, 2) = 0
most U in an expression is replaced, the children of U U + for g(a, b) = ((a + a) ∗ (b ∗ a)) mod (28 ) and h(2, 2) = 6
are U b +, U a +, U U U + +, and U U U ∗ +. To obtain for h(a, b) = (a + (b + b)) mod 28 .
the layer 3 expression b a +, the following derivation
steps are applied: U ⇒ U U + ⇒ U a + ⇒ b a +. We set terminal nodes to inactive after their evaluation,
since they already are end states of the game; there is
To direct the search towards outer expressions, we re- no possibility to improve the node’s reward by random
place the top-most-right-most non-terminal. If we, in- playouts. As a result, MCTS will not take these nodes
Collberg et al. [9] which uses the technique to encode inte- Table 3: Trace window statistics and synthesis perfor-
ger variables and expressions in which they are used [11]. mance for Tigress (MBA), VMProtect (VMP), Themida
Further, Tigress also supports common arithmetic encod- (flavor Tiger White, TM), and ROP gadgets.
ings to increase an expression’s complexity, albeit not MBA VMP TM ROP
based on MBAs [10]. #trace windows 500 12,577 2,448 78
For example, the rather simple expression x + y + z is #unique windows 500 449 106 78
transformed into the layer 23 expression (((x ⊕ y) + ((x ∧ #instructions per window 116 49 258 3
#inputs per window 5 2 15 3
y) 1)) ∨ z) + (((x ⊕ y) + ((x ∧ y) 1)) ∧ z) using its
#outputs per window 1 2 10 2
arithmetic encoding option. In a second transformation #synthesis tasks 500 1,123 1,092 178
step, Tigress encodes it into a linear MBA expression of I/O sampling time (s) 110 118 60 17
layer 383 (omitted due to complexity). Such expressions overall synthesis time (s) 2,020 4,160 9,946 829
are hard to simplify symbolically. synthesis time per task (s) 4.0 3.7 9.1 4.7
Evaluation Results. We evaluated our approach to sim- To get a better feeling for this probabilistic behavior, we
plify MBA expressions using Syntia. As a testbed, we compared the cumulative numbers of synthesized MBA
built a C program which calls 500 randomly generated expressions for 10 subsequent runs. Figure 7 shows the
functions. Each of these random functions takes 5 in- results averaged over 15 separate experiments. On aver-
put variables and returns an expression of layer 3 to 5. age, the first run synthesizes 89.6% (448 expressions) of
Then, we applied the arithmetic encoding provided by the 500 expressions. A second run yields 22 new expres-
Tigress, followed by the linear MBA encoding. The re- sions (94.0%), while a third run reveals 10 more expres-
sulting program contained expressions of up to 2, 821 sions (96.0%). While converging to 500, the number of
layers, the average layer being 156. The arithmetic encod- newly synthesized expressions decreases in subsequent
ing is applied to highlight that our approach is invariant runs. Comparing the fifth and the eighth run, we only
to the code’s increased symbolic complexity and is only found 5 new expressions (from 489 to 494). After the
concerned with semantical complexity. ninth run, Syntia synthesized 495 (99.0%) of the MBA
Based on a concrete execution trace it can be observed expressions.
that the 500 functions use, on average, 5 memory inputs
(as parameters are passed on the stack) and one register 6.3 VM Instruction Handler
output (the register containing the return value). Table 3
shows statistics for the analysis run using the configura- As introduced in Section 2.1.1, an instruction handler of
tion vector (1.5, 50000, 50, 0). The first two components a Virtual Machine implements the effects of an atomic in-
indicate a strong focus on exploration in favor of exploita- struction according to the custom VM-ISA. It operates on
tion; due to the small number of synthesis tasks, we used the VM context and can perform arbitrarily complex tasks.
50 I/O samples to obtain more precise results. As handlers are heavily obfuscated, manual analysis of a
The sampling phases completed in less than two min- handler’s semantics is a time-consuming task.
utes. Overall, the 500 synthesis tasks were finished
in about 34 minutes, i. e., in 4.0 seconds per expres- Attacking VMs. When faced with virtualization-based
sion. We were able to synthesize 448 out of 500 expres- obfuscations, an attacker typically has two options. For
sions (89.6%). The remaining expressions are not found one, she can analyze the interpreter and, for each han-
due to the probabilistic nature of our algorithm; after 4 dler, extract all information required to re-translate the
subsequent runs, we synthesized 489 expressions (97.8%) bytecode back to native instruction. Especially in face of
in total. handler duplication and bytecode blinding, this requires
9 Conclusion
Deobfuscation. Rolles provides an academic analysis
of a VM-based obfuscator and outlines a possible attack With our prototype implementation of Syntia we have
on such schemes in general [44]. He proposes using shown that program synthesis can aid in deobfuscation
static analysis to re-translate the VM’s bytecode back into of real-world obfuscated code. In general, our approach
native instructions. This, however, requires minute analy- is vastly different in nature compared to proposed deob-
sis of each obfuscator and hence is time-consuming and fuscation techniques and hence may succeed in scenarios
prone to minor modifications of the scheme. Kinder is where approaches requiring precise code semantics fail.
We thank the reviewers for their valuable feedback. This lenge of Computer Go: Monte Carlo Tree Search and Extensions.
Communications of the ACM (2012).
work was supported by the German Research Foundation
(DFG) research training group UbiCrypt (GRK 1817) and [18] G ODEFROID , P., AND TALY, A. Automated Synthesis of Sym-
bolic Instruction Encodings from I/O Samples. In ACM SIGPLAN
by ERC Starting Grant No. 640110 (BASTION). Notices (2012).
[19] G RAZIANO , M., BALZAROTTI , D., AND Z IDOUEMBA , A. ROP-
References MEMU: A Framework for the Analysis of Complex Code-Reuse
Attacks. In ACM Symposium on Information, Computer and Com-
[1] A NDRIESSE , D., B OS , H., AND S LOWINSKA , A. Parallax: Im- munications Security (ASIACCS) (2016).
plicit Code Integrity Verification using Return-Oriented Program-
ming. In Conference on Dependable Systems and Networks (DSN) [20] G UINET, A., E YROLLES , N., AND V IDEAU , M. Arybo: Ma-
(2015). nipulation, Canonicalization and Identification of Mixed Boolean-
Arithmetic Symbolic Expressions. In GreHack Conference (2016).
[2] BANESCU , S., C OLLBERG , C., G ANESH , V., N EWSHAM , Z.,
AND P RETSCHNER , A. Code Obfuscation against Symbolic Exe- [21] G ULWANI , S. Dimensions in Program Synthesis. In Proceedings
cution Attacks. In Annual Computer Security Applications Con- of the 12th international ACM SIGPLAN symposium on Principles
ference (ACSAC) (2016). and practice of declarative programming (2010).
[3] BANSAL , S., AND A IKEN , A. Automatic Generation of Peephole [22] G ULWANI , S., J HA , S., T IWARI , A., AND V ENKATESAN , R.
Superoptimizers. In ACM Sigplan Notices (2006). Synthesis of Loop-free Programs. ACM SIGPLAN Notices (2011).
[4] B ELL , J. R. Threaded Code. Communications of the ACM (1973). [23] H EULE , S., S CHKUFZA , E., S HARMA , R., AND A IKEN , A. Strat-
[5] B ROWNE , C. B., P OWLEY, E., W HITEHOUSE , D., L UCAS , ified synthesis: Automatically Learning the x86-64 Instruction
S. M., C OWLING , P. I., ROHLFSHAGEN , P., TAVENER , S., Set. In ACM SIGPLAN Conference on Programming Language
P EREZ , D., S AMOTHRAKIS , S., AND C OLTON , S. A Survey Design and Implementation (PLDI) (2016).
of Monte Carlo Tree Search Methods. IEEE Transactions on [24] J HA , S., G ULWANI , S., S ESHIA , S. A., AND T IWARI , A. Oracle-
Computational Intelligence and AI in Games (2012). guided Component-based Program Synthesis. In ACM/IEEE 32nd
[6] C AVALLARO , L., S AXENA , P., AND S EKAR , R. Anti-Taint- International Conference on Software Engineering (2010).
Analysis: Practical Evasion Techniques against Information Flow [25] K IM , D.-W., K IM , K.-H., JANG , W., AND C HEN , F. F. Unre-
based Malware Defense. Secure Systems Lab at Stony Brook lated Parallel Machine Scheduling with Setup Times using Simu-
University, Tech. Rep (2007). lated Annealing. Robotics and Computer-Integrated Manufactur-
[7] C AZENAVE , T. Monte carlo beam search. IEEE Transactions on ing (2002).
Computational Intelligence and AI in Games (2012). [26] K INDER , J. Towards Static Analysis of Virtualization-Obfuscated
[8] C HASLOT, G. Monte-Carlo Tree Search. PhD thesis, Universiteit Binaries. In IEEE Working Conference on Reverse Engineering
Maastricht, 2010. (WCRE) (2012).
[9] C OLLBERG , C., M ARTIN , S., M YERS , J., AND NAGRA , J. Dis- [27] K IRKPATRICK , S., G ELATT, C. D., AND V ECCHI , M. P. Opti-
tributed Application Tamper Detection via Continuous Software mization by Simulated Annealing. Science (1983).
Updates. In Annual Computer Security Applications Conference
(ACSAC) (2012). [28] K LINT, P. Interpretation Techniques. Software, Practice and
Experience (1981).
[10] C OLLBERG , C., M ARTIN , S., M YERS , J., AND Z IMMER -
MAN , B. Documentation for Arithmetic Encodings in [29] KOCSIS , L., AND S ZEPESVÁRI , C. Bandit based Monte-Carlo
Tigress. http://tigress.cs.arizona.edu/transformPage/ Planning. In European Conference on Machine Learning (2006).
docs/encodeArithmetic. [30] K RAHMER , S. x86-64 Buffer Overflow Exploits and the Borrowed
[11] C OLLBERG , C., M ARTIN , S., M YERS , J., AND Z IMMERMAN , Code Chunks Exploitation Technique, 2005.
B. Documentation for Data Encodings in Tigress. http:// [31] L IBERATORE , P. The Complexity of Checking Redundancy of
tigress.cs.arizona.edu/transformPage/docs/encodeData. CNF Propositional Formulae. In International Conference on
[12] C OLLBERG , C., T HOMBORSON , C., AND L OW, D. Manufactur- Agents and Artificial Intelligence (2002).
ing Cheap, Resilient, and Stealthy Opaque Constructs. In ACM
[32] L IM , J., AND YOO , S. Field Report: Applying Monte Carlo Tree
Symposium on Principles of Programming Languages (POPL)
Search for Program Synthesis. In International Symposium on
(1998).
Search Based Software Engineering (2016).
[13] C OOGAN , K., L U , G., AND D EBRAY, S. Deobfuscation
of Virtualization-obfuscated Software: A Semantics-Based Ap- [33] L U , K., X IONG , S., AND G AO , D. RopSteg: Program Steganog-
proach. In ACM Conference on Computer and Communications raphy with Return Oriented Programming. In ACM Conference on
Security (CCS) (2011). Data and Application Security and Privacy (CODASPY) (2014).
[14] E YROLLES , N. Obfuscation with Mixed Boolean-Arithmetic Ex- [34] M A , H., L U , K., M A , X., Z HANG , H., J IA , C., AND G AO , D.
pressions: Reconstruction, Analysis and Simplification Tools. PhD Software Watermarking using Return-Oriented Programming. In
thesis, Université de Versailles Saint-Quentin-en-Yvelines, 2017. ACM Symposium on Information, Computer and Communications
Security (ASIACCS) (2015).
[15] E YROLLES , N., G OUBIN , L., AND V IDEAU , M. Defeating MBA-
based Obfuscation. In ACM Workshop on Software PROtection [35] M ARC , S EBAG , M., S ILVER , D., S ZEPESVÁRI , C., AND T EY-
(SPRO) (2016). TAUD , O. Nested Monte-Carlo Search. Communications of the
ACM (2012).
[16] F INNSSON , H. Generalized Monte-Carlo Tree Search Extensions
for General Game Playing. In AAAI Conference on Artificial [36] M ICROSOFT R ESEARCH. The Z3 Theorem Prover. https://
Intelligence (2012). github.com/Z3Prover/z3.
3 Approach
Resilience is defined as a function of deobfuscator ef-
fort1 and programmer effort (i.e. the time spent building
the deobfuscator) [15]. However, in many cases we can
consider the effort needed to build the deobfuscator to
be negligible, because an attacker needs to invest the ef-
fort to build a deobfuscator only once and can then reuse
it or share it with others. Our general approach is il-
lustrated as a work-flow in Figure 1, where ovals depict
inputs and outputs of the software tools, which are rep-
resented by rectangles. The work-flow requires a dataset Figure 1: General attack time prediction framework.
of original (un-obfuscated) programs to be able to start
(step 0 in Figure 1). To generate these programs we have
developed a C code generator presented in Section 3.1. features are selected, the framework uses this subset of
Afterwards, an obfuscation tool is then used to generate features to build a regression model via a machine learn-
multiple obfuscated (protected) versions of each of the ing algorithm (step 4 in Figure 1).
original programs (step 1 in Figure 1). Subsequently, an Note that the proposed approach is not limited to ob-
implementation of a deobfuscation attack (e.g. control- fuscation and deobfuscation. One can substitute the ob-
flow simplification [49], secret extraction [5], etc.) is ex- fuscation tool in Figure 1, with any kind of software
ecuted on all of the obfuscated programs, and the time protection mechanism (e.g. code layout randomization
needed to successfully complete the attack for each of [38]) and the deobfuscation tool by any known attack
the obfuscated programs is recorded (step 2 in Figure 1). implementation corresponding to that software protec-
In parallel, feature values (e.g. source code metrics) are tion mechanism (e.g. ROPeme [27]). This way the set
extracted from the obfuscated programs. of relevant features and the output prediction model will
Once the attack times are recorded and software fea- estimate the strength of the chosen protection mechanism
tures are extracted from all programs, one could directly against the chosen attack implementation.
use this information to build a regression model for pre-
dicting the time needed for deobfuscation. However, 3.1 C Program Generator
some features could be irrelevant to the deobfuscation
attack and/or they could be expensive to compute. More- One important challenge of the proposed approach is ob-
over, for most regression algorithms the resource usage taining a dataset of unobfuscated (original) programs for
during the training phase grows linearly or even expo- the input to the framework. This dataset should be large
nentially with the number of different features used as enough to serve as a training set for the regression model
predictors. Therefore, we add an extra step to our ap- in the last step of the framework, because the quality of
proach, namely a Feature Selection Algorithm, which se- the model depends on the training set. Ideally, we would
lects only the subset of features which are most relevant have access to a large corpus of open source programs
to the attack (step 3 in Figure 1). Feature selection can be that contain a security check (such as a license check)
performed in many ways. Section 3.2 briefly describes that needed to be protected against discovery and tam-
how we approached feature selection. After the relevant pering, as presented in [4]. For example, we could select
1 In this paper we quantify deobfuscator effort via the time it takes to
a collection of such programs from popular code sharing
run a successful attack on a certain hardware platform; however, note
sites such as GitHub. Unfortunately, open source pro-
that we could easily map time to CPU cycles, to provide a hardware grams tend not to contain the sorts of security checks
independent measure of attack effort. required by our study. To mitigate this we could manu-
Table 3: Overview of un-obfuscated randomly generated Table 4: Overview of un-obfuscated simple hash pro-
programs. grams.
these 4608 programs vary in size and complexity, as was We obfuscated each of the generated programs us-
intended, in order to capture a representative range of li- ing these transformations with all the default settings
cense checking algorithms. (except for opaque predicates where we set the number
To increase the number of programs in this set, we of inserted predicates to 16), we obtained 5 × 4608 =
generated 275 different variants for each of the non- 23040 obfuscated programs4 . We obfuscated each of
cryptographic hashes using combinations of multiple ob- the non-cryptographic hash functions with every possible
fuscation transformations. The point which we aim to pair of these 5 obfuscation transformations and obtained
show here is that even if we add a small heterogeneous 25 × 11 = 275 obfuscated programs.
subset to our larger homogeneous set of programs, the Table 5 and Table 6 show the minimum, median,
smaller subset is going to be predicted with the same ac- average and maximum values of various code metrics
curacy as the programs from the larger set. Table 4 shows of the obfuscated set of randomly generated programs,
the minimum, median, average and maximum values of respectively the obfuscated programs involving simple
various code metrics of only the original (un-obfuscated) hash functions, as computed by the UCC tool. Each
non-cryptographic hash functions, as computed by the metric was computed on the entire C file of each pro-
UCC tool and the total number of lines of code (LOC). gram, which includes the randomly generated function,
Each metric was computed on the entire C file of each the main function and other functions generated by the
program, which includes the hash function and the main obfuscating transformation which is applied. For in-
function, but no comment lines or empty lines. stance, the encode literals transformation generates an-
other function which dynamically computes the values
4.1.2 Obfuscation Tool of constants in the code using a switch statement with
a branch for each constant. Due to this reason we also
We have used five obfuscating transformations offered
notice that after applying the encode literals transfor-
by Tigress [13], in order to generate five obfuscated ver-
mation to a program, its average cyclomatic complexity
sions of each of the 4608 programs generated by our code
(CC) is slightly reduced because this function has CC=1
generator and the 11 non-cryptographic hash functions.
and it is averaged with two other functions with higher
The obfuscating transformations we have used are:
CCs. Comparing the numbers in these two table with
• Opaque predicates: introduce branch conditions in those from Tables 3 and 4, it is important to note that the
the original code, which are either always true or size and complexity of the obfuscated programs have in-
always false for any possible program input. How- creased by one order of magnitude on average, w.r.t. un-
ever, their truth value is difficult to learn statically. obfuscated programs.
• Literal encoding: replaces integer/string constants
by code that generates their value dynamically. 4.1.3 Deobfuscation Tool
• Arithmetic encoding: replaces integer arithmetic
Since all of the original programs print a distinctive mes-
with more complex expressions, equivalent to the
sage (i.e. “You win!”) when a particular input value
original ones.
is entered, we can define the deobfuscation attack goal
• Flattening: replaces the entire control-flow struc- as: finding an input value that leads the obfuscated pro-
ture by a flat structure of basic blocks, such that it is gram to output “You win!”, without tampering with the
unclear which basic block follows which.
4 Tigress transforms have multiple options that affect the generated
• Virtualization: replaces the entire code with byte- code, and makes it more or less amenable to analysis. For this study,
code that has the same functional semantics and an we avoid transformations and options that would generate obfuscated
emulator which is able to interpret the bytecode. code not analyzable by KLEE, which we use as our deobfuscation tool.
Table 6: Overview of obfuscated simple hash programs. 4.1.4 Software Feature Extraction Tools
Many papers [32, 43, 1, 4] suggest that the complexity of
program. As presented in [4], this deobfuscation goal branch conditions is a program characteristic with high
is equivalent to finding a hidden secret key and can be impact on symbolic execution. However, these papers
achieved by employing an automated test case genera- do not clearly indicate how this complexity should be
tor. A state of the art approach for test case generation measured. One way to do this is by first converting the
is called dynamic symbolic execution (often it is simply C program into a boolean satisfiability problem (SAT in-
called symbolic execution). Such an approach is imple- stance), and then extracting features from this SAT in-
mented by several free and open source software tools stance. There are several tools that can convert a C pro-
such as KLEE [10], angr [44], etc. The first step of gram into a SAT instance, e.g. the C bounded model
symbolic execution is to mark a subset of program data checker (CBMC) [12] or the low-level bounded model
(e.g. variables) as symbolic, which means that they can checker (LLBMC) [31], etc. However, the drawback of
take any value in the range of its type. Afterwards, the these tools is that the generated SAT instances may be as
program is interpreted and whenever a symbolic value large as 1GBs even for programs containing under 1000
is involved in an instruction, its range is constrained ac- lines of code, because they are not optimized. Hence, for
cordingly. Whenever a branch based on a symbolic value our dataset, the generated SAT instances would require
is encountered, symbolic execution forks the state of the somewhere in the order of 10TBs of data and several
program into two different states corresponding to each weeks of computational power, which is prohibitively ex-
of the two possible truth values of the branch. The ranges pensive.
of the symbolic variable in these two forked states are Instead, we took a faster alternative approach for ob-
disjoint. This leads to different constraints on symbolic taining an optimized SAT instance from a C program,
variables for different program paths. The symbolic ex- which we describe next. KLEE generates a satisfiability
ecution engine sends these constraints to an SMT solver, modulo theories (SMT) instance for each execution path
which tries to find a concrete set of values (for the sym- of the C program. We selected the SMT instance cor-
bolic variables), which satisfy the constraints. Giving the responding to the difficult execution path that prints out
output of the SMT solver as input to the program will the distinctive message on standard output5 . These SMT
lead the execution to the path corresponding to that con- 5 Note that it is not necessary to execute KLEE to obtain the SMT
straint. instance corresponding to the difficult execution path. The developer
Since we have the C source code for the obfuscated knows the correct license key, therefore s/he can give the correct license
programs, we chose to use KLEE as a test case genera- key and record the instruction trace of the execution. Afterwards, the
developer can substitute the constant input argument in the trace, with
tor in this study. We ran KLEE with a symbolic argu- a symbolic input and then extract the path constraint by combining all
ment length of 5 characters, on all of the un-obfuscated expressions in the trace. This path constraint does not need to be solved
and obfuscated programs generated by our code gen- by the SMT solver, but simply converted to SAT.
In total we have 64 features out of which 49 are SAT 4.2.1 First approach: Pearson Correlation
features which characterize the complexity of the con-
After employing the algorithm described in Sec-
straints on symbolic variables and 15 are program fea-
tion 3.2.1, we were left with a set of 25 features,
tures which characterize the structure and size of the
with their Pearson correlation coefficients ranging from
code. In the following we show that not all of these fea-
tures are needed for good prediction results. 6 https://www.r-project.org/
0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.0e+00 4.0e+06 8.0e+06 1.2e+07
mean_clause weight
meanvar sdinter
sdedgeratio ol_coms
meanintra meaninter
mean_reused sdedgeratio
meancom meancom
unique_edges meanintra
vars sdcom
sdinter sdintra
max_community ol_q
ol_coms edgeratio
maxedgeratio Risk
maxinter L1.Loops
max_total max_clause
Trans num_max_inter
Figure 5: Top 15 features via first approach. Figure 6: Top 15 features via second approach.
Figure 8: Graph representation of SAT instance corre- • RandomFunsOperators was set to Shiftlt,
sponding to a program whose symbolic execution time is Shiftrt, Lt, Gt, Le, Ge, Eq, Ne, BAnd, BOr and
under 1 second. BXor.
Relative error
0.60 Median error with Pearson correlation metrics
0.55 Median error with variable importance metrics
0.50
0.45
Table 7: The NRMSE between model prediction and 0.40
0.35
ground truth (average over NRMSE of 10 models) 0.30
0.25
0.20
0.15
0.10
1.00 0.05
0.95 Type of error 0.00
0.90
0.85 Maximum error with 10% of outliers removed
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
0.80 Maximum error with 4% of outliers removed
Percentage of programs
0.75 Maximum error with 0% of outliers removed
0.70 Median error with 10% of outliers removed
0.65
Relative error
Relative error
0.60 Median error with 4% of outliers removed
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Median error with 0% of outliers removed
0.55
Percentage of programs 0.50
0.45
0.40
0.35
0.30
Figure 11: Relative prediction error of RF model. 0.25
0.20
0.15
0.10
0.05
0.00
4.3.1 Random Forests (RFs) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Percentage of programs
Relative error
0.60 Median error with Pearson correlation metrics 0.60 Median error with 4% of outliers removed
0.55 Median error with variable importance metrics 0.55 Median error with 0% of outliers removed
0.50 0.50
0.45 0.45
0.40 0.40
0.35 0.35
0.30 0.30
0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Percentage of programs Percentage of programs
Figure 14: SVM models with different feature sets. Figure 16: Relative prediction error of NN model.
1.00
0.95 Type of error
0.90
0.85 Maximum error with 10% of outliers removed 4.3.4 Neural Networks (NNs)
0.80 Maximum error with 4% of outliers removed
0.75 Maximum error with 0% of outliers removed
0.70
0.65
Median error with 10% of outliers removed
Multi-layer neural networks (NNs) were introduced by
Relative error
The authors would like to thank the anonymous review- the quality of java code. Empirical Software Engineering 20, 6
(2015), 1486–1524.
ers, Prof. Vijay Ganesh and Zack Newsham for the feed-
[12] C LARKE , E., K ROENING , D., AND L ERDA , F. A tool for check-
back regarding the SATGraf tool and tips for converting
ing ANSI-C programs. In Tools and Algorithms for the Construc-
SMT instances into SAT instances. This work was par- tion and Analysis of Systems (2004), vol. 2988 of LNCS, Springer,
tially supported by NSF grants 1525820 and 1318955. pp. 168–176.
[13] C OLLBERG , C., M ARTIN , S., M YERS , J., AND NAGRA , J. Dis-
tributed application tamper detection via continuous software up-
7 Availability dates. In Proc. of the 28th Annual Computer Security Applica-
tions Conference (New York, NY, USA, 2012), ACM, pp. 319–
Our code generator is part of the Tigress C Diversi- 328.
fier/Obfuscator tool. Binaries are freely available at: [14] C OLLBERG , C., AND NAGRA , J. Surreptitious software. Upper
Saddle River, NJ: Addision-Wesley Professional (2010).
http://tigress.cs.arizona.edu/ [15] C OLLBERG , C., T HOMBORSON , C., AND L OW, D. A taxon-
transformPage/docs/randomFuns omy of obfuscating transformations. Tech. rep., Department of
Computer Science, The University of Auckland, New Zealand,
Source code is available to researchers on request. Our 1997.
dataset of original (unobfuscated) programs, as well as [16] C ORTES , C., AND VAPNIK , V. Support-vector networks. Ma-
chine learning 20, 3 (1995), 273–297.
all scripts and auxiliary software used to run our experi-
ments, are available at: [17] DALLA P REDA , M. Code obfuscation and malware detection by
abstract interpretation. PhD thesis, University of Verona, 2007.
https://github.com/tum-i22/ [18] D E M OURA , L., AND B JØRNER , N. Z3: An efficient smt solver.
obfuscation-benchmarks/ In International conference on Tools and Algorithms for the Con-
struction and Analysis of Systems (2008), Springer, pp. 337–340.
[19] G ANESH , V., AND D ILL , D. L. A decision procedure for bit-
References vectors and arrays. In International Conference on Computer
Aided Verification (2007), Springer, pp. 519–531.
[1] A NAND , S., B URKE , E. K., C HEN , T. Y., C LARK , J., C OHEN ,
M. B., G RIESKAMP, W., H ARMAN , M., H ARROLD , M. J., [20] G ENT, I. P., M AC I NTYRE , E., P ROSSER , P., WALSH , T.,
M C M INN , P., ET AL . An orchestrated survey of methodologies ET AL . The constrainedness of search. In AAAI/IAAI, Vol. 1
for automated software test case generation. Journal of Systems (1996), pp. 246–252.
and Software 86, 8 (2013), 1978–2001. [21] G RANITTO , P. M., F URLANELLO , C., B IASIOLI , F., AND
[2] A NCKAERT, B., M ADOU , M., D E S UTTER , B., D E B US , B., G ASPERI , F. Recursive feature elimination with random forest
D E B OSSCHERE , K., AND P RENEEL , B. Program obfuscation: for ptr-ms analysis of agroindustrial products. Chemometrics and
a quantitative approach. In Proc. of the ACM workshop on Quality Intelligent Laboratory Systems 83, 2 (2006), 83–90.
of protection (2007), ACM, pp. 15–20. [22] H ALL , M. A. Correlation-based feature selection for machine
[3] A PON , D., H UANG , Y., K ATZ , J., AND M ALOZEMOFF , A. J. learning. PhD thesis, The University of Waikato, 1999.
Implementing cryptographic program obfuscation. IACR Cryp- [23] H EFFNER , K., AND C OLLBERG , C. The obfuscation execu-
tology ePrint Archive 2014 (2014), 779. tive. In International Conference on Information Security (2004),
[4] BANESCU , S., C OLLBERG , C., G ANESH , V., N EWSHAM , Z., Springer, pp. 428–440.
AND P RETSCHNER , A. Code obfuscation against symbolic exe- [24] H OLLAND , J. H. Adaptation in natural and artificial systems: an
cution attacks. In Proc. of 2016 Annual Computer Security Ap- introductory analysis with applications to biology, control, and
plications Conference (2016), ACM. artificial intelligence. U Michigan Press, 1975.
[5] BANESCU , S., O CHOA , M., AND P RETSCHNER , A. A frame- [25] K ANZAKI , Y., M ONDEN , A., AND C OLLBERG , C. Code artifi-
work for measuring software obfuscation resilience against auto- ciality: a metric for the code stealth based on an n-gram model.
mated attacks. In 1st International Workshop on Software Pro- In Proc. of the 1st International Workshop on Software Protection
tection (2015), IEEE, pp. 45–51. (2015), IEEE Press, pp. 31–37.
[6] BANESCU , S., O CHOA , M., P RETSCHNER , A., AND K UNZE , [26] K ARNICK , M., M AC B RIDE , J., M C G INNIS , S., TANG , Y., AND
N. Benchmarking indistinguishability obfuscation - a candidate R AMACHANDRAN , R. A qualitative analysis of java obfuscation.
implementation. In Proc. of 7th International Symposium on ES- In proceedings of 10th IASTED international conference on soft-
SoS (2015), no. 8978 in LNCS. ware engineering and applications, Dallas TX, USA (2006).
[7] BARAK , B. Hopes, fears, and software obfuscation. Communi- [27] L E , L. Payload already inside: datafire-use for rop exploits.
cations of the ACM 59, 3 (2016), 88–96. Black Hat USA (2010).
[8] BARLAK , E. S. Feature selection using genetic algorithms. [28] L I , C.-M., AND Y E , B. Sat-encoding of step-reduced md5. SAT
[9] B REIMAN , L. Random forests. Machine learning 45, 1 (2001), COMPETITION 2014, 94–95.
5–32.
[29] L IN , S.-W., L EE , Z.-J., C HEN , S.-C., AND T SENG , T.-Y. Pa-
[10] C ADAR , C., D UNBAR , D., AND E NGLER , D. R. Klee: Unas- rameter determination of support vector machine and feature se-
sisted and automatic generation of high-coverage tests for com- lection using simulated annealing approach. Applied soft comput-
plex systems programs. In OSDI (2008). ing 8, 4 (2008), 1505–1512.
Path
yes accessible? no
Chrome extension developers will require minimal ef- Extension Case B
Request
fort. installed?
no
Case A
Beside the general web APIs, a special exten-
sion API will provide a deeper integration with the
x time y time
browser, making possible to access features such as
tab and window manipulation. The manifest will
x+y time
be named manifest.json and will use the same
JSON-formatted structure and general properties of
the Chromium implementation. The URL to access Figure 4: Resource accessibility control schema.
the extension resources follows the ms-browser-ex-
tension://[extID]/[path] schema.
to access the randomized URI that changes each time
As the design is in its preliminary stages and it is not
Safari is launched. Absolute URIs are stored in the
yet fully working, we are not including it in our analysis.
safari.extension.baseURI field, as shown in Fig-
ure 3.
2.2 URI Randomization
As Safari was one of the last major browsers to adopt ex-
3 Security Analysis
tensions, its developers implemented a resource control In the previous section we presented the two complemen-
from the beginning to avoid enumeration or vulnerabil- tary approaches adopted by all major browser families to
ity exploitations of installed extensions. Instead of re- protect the access to extension resources. The first so-
lying on settings included in a manifest file like all the lution relies on a public resource URI, whose access is
other major browsers, Apple developers adopted a URI protected by a centralized code in the browser accord-
randomization approach. In this solution there is no dis- ing to settings specified by the extension developers in a
tinction between private or public resources, but instead manifest file. The second solution replaces the central-
the base URI of the extension is randomly re-generated ized check by randomizing the base URI at every execu-
in each session. tion. In this case, the extension needs to access its own
Safari extensions are coded using a combination of resources by using a dedicated Javascript API.
HTML, CSS, and JavaScript. To interact with the web While their design is completely different, both solu-
browser and the page content, a JavaScript API is pro- tions provide the same security guarantees, preventing an
vided and each extension runs within its own “sand- attacker from enumerating the installed extensions and
box” [1]. To develop an extension, a developer has to accessing their resources. We now examine those two
provide: (i) the global HTML page code, (ii) the content approaches in more detail and discuss two severe limita-
(HTML, CSS, JavaScript media), (iii) the menu items tions that often undermine their security. It is important
(label and images), (iv) the code of the injected scripts, to note that these attacks can also be used in any type of
(v) the stylesheets, and (vi) the required icons. device with a browser with extension capability, such as
These components are grouped into two categories: smartphones or smartTVs.
the first including the global page and the menu items,
and the second including the content, and the injected 3.1 Timing Side-Channel on Access Con-
scripts and stylesheets. This second group cannot ac-
trol Settings Validation
cess any resource within the extension folder using rel-
ative URLs as the first group does. Instead, these As already mentioned, the vast majority of browsers
extension components are required to use JavaScript adopt a centralized method to prevent third parties from
accessing any resource of the extensions that have not the two requests and, therefore, the extension is not in-
been explicitly marked as public. Therefore, when a stalled in the browser. Otherwise, significantly different
website tries to load a resource not present in the list of execution times mean that only the second test failed and,
accessible resources, the browser will block the request. therefore, that the requested extension is present in the
Despite the fact that, from a design point of view, this so- browser.
lution may seem secure, we discovered that all their im- We performed an experiment in order to empirically
plementations suffer from a serious problem that derives tune the time difference threshold and the number of cor-
from the fact that these browsers are required to perform rect requests required to ensure the correctness of our at-
two different checks: (i) to verify if a certain extension is tack. In particular, the following configuration was used:
installed and (ii) to access their control settings to iden-
tify if the requested resource is publicly available (see • We configured 5 different CPU usages: 0%, 25%,
Figure 4 for a simple logic workflow for the enforcement 50%, 75%, and 100%. The experiment was exe-
process). When this two-step validation is not properly cuted on a 2.4GHz Intel Core 2 Duo with 4 GB
implemented, it is prone to a timing side-channel attack RAM commodity computer.
that an adversary can use to identify the actual reasons
behind a request denial: the extension is not present or • The attack was configured to be repeated from 1 to
its resources are kept private. To this end, we used the 10 iterations. Note that each iteration performs two
User Timing API 1 , implemented in every major browser, calls to the browser: one that asks for the fake ex-
in order to measure the performance of web applications. tension and one that asks for the actual extension
As an example, an attacker can code few lines of with a fake path.
Javascript to measure the response time when invoking
• We repeated each attack testing 500 times to avoid
a fake extension (refer to case A in Figure 4). For in-
any bias. In this way, we performed: 2 calls × 10
stance, in Chromium the requested URI could look like
iteration configurations × 500 times × 5 CPU us-
this:
ages, resulting in a total number of 275,000 calls.
chrome-extension://[fakeExtID]/[fakePath] We observed that, when the execution paths were dif-
ferent, the response times differed by more than 5%. It
Then, the attacker can generate a second request to is important to remark that our method exploits the pro-
measure the response time when requesting an exten- portional timing difference between two different calls
sion that actually exists, but using a non-existent resource rather than using a pre-computed time for a specific de-
path (case B in Figure 4): vice. Figure 5 shows the precision across different CPU
loads and different numbers of iterations. Five iterations
chrome-extension://[realExtID]/[fakePath] were sufficient enough to achieve a 100% success rate
even under a 100% CPU usage.
By comparing the two timestamps, the attacker can
easily determine whether an extension is installed or not Affected Browsers
in the browser. Similar response times mean that the cen-
tral validation code followed the same execution path on We tested our timing side-channel attack on the two
browser families (Chromium-based and Firefox-based)
1 https://www.w3.org/TR/user-timing/ that use extensions access control settings.
sented the highest entropy of the analyzed fingerprinting 3 const Extension * extension =
attributes — making them more precise than using the R e n d e r e r E x t e n s i o n R e g i s t r y :: Get () ->
list of fonts or canvas-based techniques. G e t E x t e n s i o n O r A p p B y U R L ( resource_url )
;
4 if (! extension ) {
5 return true ;
5 Vulnerability Disclosure and 6 }
Countermeasures
Figure 10: Snip of Resource Request Policy function of
5.1 Attack Coverage & Effects Chromium that causes the difference between existing
and not existing extensions (see Appendix for full code).
In this paper we presented two different classes of attacks
against the resource control policies adopted by all fam-
ilies of browsers on the market. Table 5 summarizes the not include any check for extensions and, therefore, both
overall impact of our methods. websites and extensions are able to access each other.
As already mentioned, the coverage of our enumer-
ation attack is complete in the case of the timing side-
channel attack to access-control-based browser families 5.2 Timing Side-Channel Attack
(i.e., Chromium and Firefox Families) while approxi- The first class of attacks is the consequence of a poor
mately around 40% in URL randomization browsers (Sa- implementation of the browser access control settings:
fari). Firefox-family browsers usage of extensions can be ex-
ploited to recognize the reason behind a failed resolution,
Effects of Private Mode and Chromium family timing-side channel allows an at-
tacker to precisely tell apart the two individual checks
“Incognito” or private mode is present in most of the performed by the browser engine.
modern browsers and it protects and restricts several ac- The consequence, in both cases, is a perfect technique
cesses to the browser resources such as cookies or brows- to enumerate all the extensions installed by the user.
ing history. Therefore, we decided to analyze if our at- Given the open-source nature of these two browsers, we
tacks can enumerate extensions even when this mode is manually identified the functions responsible of the prob-
activated. lem and indicated how to fix each of them.
We discovered that all of our attacks accurately identi-
fied the list of installed extensions also within the private Chromium family
mode. This fact is due to several reasons. In the case of
Chromium family browser, the browser checks for exten- We contacted the Chromium team to report the timing
sions in incognito mode, even though extensions are not problem. The developers were quite surprised about the
allowed to access the websites [9]. Firefox and Safari did attack, because they believed that the time differences in
Content Security Policy (CSP) is a W3C standard intro- 2. white-lists are hard to get right. On the one hand,
duced to prevent and mitigate the impact of content in- if a white-list is too liberal, it can open the way
jection vulnerabilities on websites [11]. It is currently to security breaches by allowing the communica-
supported by all modern commercial web browsers and tion with JSONP endpoints or the inclusion of li-
deployed on a number of popular websites, which justi- braries for symbolic execution, which would enable
fied a recently growing interest by the research commu- the injection of arbitrary malicious scripts [18]. On
nity [20, 5, 13, 1, 18, 10]. the other hand, if a white-list is too restrictive, it
A content security policy is a list of directives supplied can break the intended functionality of the protected
in the HTTP headers of a web page, specifying browser- web page [1]. The right equilibrium is difficult to
enforced restrictions on content inclusion. Roughly, the achieve, most notably because it is hard for policy
directives associate different content types to lists of writers to predict what needs to be included by ac-
sources (web origins) from which the CSP-protected web tive contents (scripts, stylesheets, etc.) loaded by
page can load contents of that specific type. For instance, their web pages;
CSP-Intersect:
scope *.twimg.com;
6 Implementation and Testing script-src https://*;
default-src ’none’;
6.1 Prototype Implementation
This gives contents hosted on twimg.com the ability of
We developed a proof-of-concept implementation of relaxing the content security policy of Twitter to load
CCSP as a Chromium extension. The extension essen- arbitrary scripts over HTTPS. This is a very liberal be-
tially works as a proxy based on the webRequest API6 . It haviour, but it may be a realistic possibility if the team
inspects all the incoming HTTP(S) responses looking for working at abs.twimg.com develops products indepen-
CCSP headers: if they are present, the extension parses dently from their final users at Twitter.
6 https://developer.chrome.com/extensions/ We then injected the following CSP-Union header in
webRequest the script provided by abs.twimg.com:
to dedicate some efforts to foster the integration between tegration with browser extensions [5], unexpected bad
their contents and the content security policies of the em- interactions with the Same Origin Policy [10] and sub-
bedding resources. optimal protection against code injection [6].
The idea of dynamically changing the enforced CSP
policy advocated in CCSP is also present in the design of 8 Conclusion
COWL, a confinement system for JavaScript code [12].
COWL assigns information flow labels to contexts (e.g., We proposed CCSP, an extension of CSP based on run-
pages and iframes) and restricts their communication time policy composition. By shifting part of the policy
based on runtime label checks. Labels are allowed to specification process to content providers and by adding
change dynamically using meet and join operators, and a dynamic dimension to CSP, CCSP reconciles the pro-
implemented on top of CSP, which makes runtime pol- tection offered by fine-grained white-listing with a rea-
icy composition part of COWL. However, COWL targets sonable policy specification effort for web developers
more ambitious security goals than (C)CSP by enforcing and a robust support for the highly dynamic nature of
non-interference on labeled contexts and, as such, it is common web interactions. We analysed CCSP from dif-
less flexible and harder to retrofit on existing websites. ferent perspectives: security, backward compatibility and
For these reasons, we believe that COWL and (C)CSP deployment cost. Moreover, we assessed its potential im-
are complementary: one system may be better than the pact on the current web and we implemented a working
other one, depending on the desired security properties. prototype, which we tested on major websites. Our ex-
CSP Embedded Enforcement is a draft specification periments show that popular content providers can de-
by the W3C which allows a protected resource to em- ploy CCSP with limited efforts, leading to significant
bed an iframe only if the latter accepts to enforce upon benefits for the large majority of the web.
itself an embedder-specified set of restrictions expressed As future work, we plan to implement CCSP directly
in terms of CSP [17]. The embedder advertises the re- in Chromium and carry out a large-scale analysis of its
strictions using a new Embedding-CSP header includ- effectiveness, including a performance evaluation. We
ing a content security policy, while the embedded con- would also like to investigate automated ways to generate
tent must attach a Content-Security-Policy header CCSP policies for both websites and content providers:
including a policy with at least the same restrictions to since CCSP splits policy specification concerns between
declare its compliance. It is worth noticing that CSP Em- these two parties, we hope there is room for simplifying
bedded Enforcement is a first step towards making the the automated policy generation process and making it
CSP enforcement depend upon an interaction between more effective than for CSP. Finally, we would like to
the protected resource and the content providers, though investigate the problem of supporting robust debugging
the problems it addresses are orthogonal to CCSP. Simi- facilities for CCSP in web browsers.
larly to CSP, CSP Embedded Enforcement asks web de-
velopers to get a thorough understanding of the contents
they include to write a content security policy for them. Acknowledgements
Other papers on CSP studied additional shortcomings We thank Daniel Hausknecht, Artur Janc, Sebastian
of the standard, touching on a number of different issues: Lekies, Andrei Sabelfeld, Michele Spagnuolo and Lukas
ineffectiveness against data exfiltration [13], difficult in- Weichselbaum for the lively discussions about the cur-
[6] J OHNS , M. Script-templates for the Content Secu- [20] W EISSBACHER , M., L AUINGER , T., AND
rity Policy. J. Inf. Sec. Appl. 19, 3 (2014), 209–223. ROBERTSON , W. K. Why is CSP failing? Trends
and challenges in CSP adoption. In RAID (2014),
[7] K ERSCHBAUMER , C., S TAMM , S., AND B RUN - pp. 212–233.
THALER , S. Injecting CSP for fun and security. In
ICISSP (2016), pp. 15–25.
A Theory
[8] PAN , X., C AO , Y., L IU , S., Z HOU , Y., C HEN , Y.,
AND Z HOU , T. Cspautogen: Black-box enforce- Our theory of joins and meets is based on a core frag-
ment of content security policy upon real-world ment of CSP called CoreCSP. This fragment captures the
websites. In CCS (2016), pp. 653–665. essential ingredients of the standard and defines their (de-
notational) semantics, while removing uninspiring low-
[9] PATIL , K., AND F REDERIK , B. A measurement level details.
study of the Content Security Policy on real-world
applications. I. J. Network Security 18, 2 (2016),
383–392. A.1 CoreCSP
[10] S OME , D. F., B IELOVA , N., AND R EZK , T. On the We presuppose a denumerable set of strings, ranged over
Content Security Policy violations due to the Same- by str. The syntax of policies is shown in Table 5, where
Origin Policy. In WWW (2017), pp. 877–886. we use dots (. . .) to denote additional omitted elements
of a syntactic category (we assume the existence of an
[11] S TAMM , S., S TERNE , B., AND M ARKHAM , G. arbitrary but finite number of these elements).
Reining in the web with Content Security Policy. This a rather direct counterpart of the syntax of CSP.
In WWW (2010), pp. 921–930. The most notable points to mention are the following:
1. we assume the existence of a distinguished scheme Assumption 2 (Typing of Objects). We assume that ob-
il, used to identify inline scripts and stylesheets. jects are typed. Formally, this means that O is partitioned
This scheme cannot occur inside policies, but it is into the subsets Ot1 , . . . , Otn , where t1 , . . . ,tn are the avail-
convenient to define their formal semantics; able content types. We also assume that, for all objects
o = ((il, str0 ), str), we have o ∈ Oscript ∪ Ostyle .
2. we do not discriminate between hashes and nonces
in source expressions, since this is unimportant at The judgement se s L defines the semantics of
our level of abstraction. Rather, we uniformly repre- source expressions. It reads as: the source expression se
sent them using the source expression inline(str), allows the subject s to include contents from the locations
where str is a string which uniquely identifies the L. The formal definition is given in Table 6, where we let
white-listed inline script or stylesheet; B be the smallest reflexive relation on schemes such that
http B https. The judgement generalizes to values by
3. we define directive values as sets of source expres- having: v s {l | ∃se ∈ v, ∃L ⊆ L : se s L ∧ l ∈ L}.
sions, rather than lists of source expressions. This
difference is uninteresting in practice, since source
expression lists are always parsed as sets; self s {π1 (s)} sc s {l | π1 (l) = sc}
4. for simplicity, we do not model ports and paths in ∗ s {l | π1 (l) 6∈ {data, blob, filesys, il}}
the syntax of source expressions.
str s {l | π1 (s) B π1 (l) ∧ π2 (l) = str}
To simplify the formalization, we only consider well-
formed policies, according to the following definition. ∗.str s {l | π1 (s) B π1 (l) ∧ ∃str0 : π2 (l) = str0 .str}
We then define the lookup operator d~ ↓ t as follows: Lemma 1 (Properties of Normalization). The following
properties hold true:
~ ~ 6= ⊥
d.t if d.t 1. for all policies p and subjects s, hpis is normal;
if d.t = ⊥ ∧ d~ = d~1 , default-src v, d~2 ∧
v
d~ ↓ t = 2. for all policies p, subjects s and content types t, we
∀d ∈ {d~1 } : d 6= default-src v0
have p ` s t O if and only if hpis ` s t O;
{∗} otherwise
3. for all normal policies p, subjects s1 , s2 and content
The judgement p ` s t O defines the semantics of types t, we have that p ` s1 t O1 and p ` s2 t O2
policies. It reads as: the policy p allows the subject s to imply O1 = O2 .
include as contents of type t the objects O. The formal
definition is given in Table 7.
A.2 Technical Preliminaries
We now introduce the technical ingredients needed to
(D-VAL ) implement our proposal. We start by defining a binary
d~ ↓ t = v v sL relation vsrc on normal source expressions. Intuitively,
d ` s t {o ∈ Ot | π1 (o) ∈ L}
~ we have se1 vsrc se2 if and only if se1 denotes no more
locations than se2 (for all subjects).
(D-C ONJ )
p1 ` s t O1 p2 ` s t O2 Definition 8 (vsrc Relation). We let vsrc be the least re-
flexive relation on normal source expressions defined by
p1 + p2 ` s t O1 ∩ O2
the following rules:
We define normal source expressions and normal direc- 1. a policy p may enforce multiple restrictions on the
tive values accordingly. same content type t, specifically when p = p1 + p2
for some p1 , p2 . In this case, multiple directive
Definition 7 (Normalization). Given a source expression values must be taken into account when reasoning
se and a subject s, we define the normalization of se un- about the inclusion of contents of type t;
Abstract (e.g., [15, 16, 5]). Therefore, recurrent browser bugs en-
abling SOP bypasses are not surprising.
The term Same-Origin Policy (SOP) is used to denote a SOP rules can roughly be classified according to the
complex set of rules which governs the interaction of dif- problem areas which they were designed to solve (cf. Ta-
ferent Web Origins within a web application. A subset of ble 1). It is impossible to cover all these subsets in a sin-
these SOP rules controls the interaction between the host gle research paper and even may be impossible to find a
document and an embedded document, and this subset “unifying formula” which covers all subsets.1 However,
is the target of our research (SOP-DOM). In contrast to it is possible to cover single subsets, as previous work on
other important concepts like Web Origins (RFC 6454) HTTP cookies has shown [12]. Thus, we restricted our
or the Document Object Model (DOM), there is no for- attention to the following research questions:
mal specification of the SOP-DOM.
In an empirical study, we ran 544 different test cases I How is SOP for DOM access (SOP-DOM) imple-
on each of the 10 major web browsers. We show that in mented in modern browsers?
addition to Web Origins, access rights granted by SOP-
DOM depend on at least three attributes: the type of the I Which parts of the HTML markup influences SOP-
embedding element (EE), the sandbox, and CORS at- DOM?
tributes. We also show that due to the lack of a formal
I How does the detected behavior match known ac-
specification, different browser behaviors could be de-
cess control policies?
tected in approximately 23% of our test cases. The is-
sues discovered in Internet Explorer and Edge are also More precisely, we concentrate on a subset of SOP
acknowledged by Microsoft (MSRC Case 32703). We rules according to the following criteria:
discuss our findings in terms of read, write, and execute
rights in different access control models. I Web Origins. We use RFC 6454 as a foundation.
I Browser Interactions. We concentrate on the inter-
1 Introduction action of web objects once they have been loaded.
The Same-Origin Policy (SOP) is perhaps the most im- It is a difficult task to select a test set for SOP-DOM that
portant security mechanism for protecting web applica- has constantly evolved over nearly two decades. The
tions, and receives high attention from developers and SOP-DOM has been adapted several times to include
browser vendors. new features (e.g., CORS) and to prevent new attacks.
15 out of 142 HTML elements have a URL attribute and
may thus have a different Web Origin [17]. Additionally,
Complex Set of SOP Rules. Today there is no for-
sandbox and CORS attributes also modify SOP-DOM.
mal definition of the SOP itself. Web Origins as de-
scribed in RFC 6454 are the basis for the SOP, but they
do not formally define the SOP. Documentation pro- The Need for Testing. Amongst web security re-
vided by standardization bodies [1] or browser vendors searchers, SOP-DOM is partially common knowledge,
[2] is still incomplete. Our evaluation of related work but not thoroughly documented. Although this means
has shown that the SOP does not have a consistent de- 1 For example, the SOP rules for DOM access and HTTP cookies
scription – both in the academic and non-academic world are inconsistent, because their concept of “origin” differs.
that most researches are familiar with many edge cases in Host Document (HD)
SOP-DOM, especially those relating to attacks and coun- Web Origin HD
termeasures, it is likely that some of those edge cases will Embedding
not be covered in this paper. Additionally, each individ- Element (EE)
ual researcher will be unaware of other edge cases, which Web Origin ED
may include novel vulnerabilities. For example, it is well SOP Embedded
known that JavaScript code from a different web origin Document (ED)
read?
has full read and write access to the host document; nev- Subject: Web
JavaScript write?
Object
ertheless, recently Lekies et al. [5] pointed our that there
is also read access from the host document to JavaScript read?
Web Subject:
write?
code, which may constitute a privacy problem. Object JavaScript
allow script execution?
Additionally, HTML5 has brought greater diversity to
seemingly well-known HTML elements. For instance, {ee,sandbox,cors}
doctype document
Contributions. We make the following contributions: HTML
5
<html>
head body
I We systematically test edge cases of the SOP that <head>
<body>
window.
frames[0]
have not been previously documented like the influ- <iframe
src="URL2"
id="ID1">
id=ID1
ence of the embedding element, and the CORS and <script
src="URL1">
4 Evaluation
We implemented a testbed as a web application which
automatically evaluates the SOP implementation of the
currently used browsers. Additionally, it displays the re-
sults of 10 tested browsers from six different vendors and
highlights the differences between them. Our testbed is
publicly available at www.your-sop.com.
[24, 25, 26]. However, a complete separation between cross-origin, FF, IE, and Edge do not allow read ac-
two HTML documents is often not possible; for example, cess in the following CORS cases of <canvas> with
to allow UI redressing countermeasures with JavaScript SVG and PNG: Access-Control-Allow-Origin
frame-busters [22]. : your-sop.com (ED sets the domain of HD) and
To allow a better separation between the iFrame ED Use-Credentials: true. Irrespective of CORS,
and the HD, sandboxed iframes were introduced [27]. <canvas> and SVG have 44 differences that are based
We limited our evaluation to the attribute values on a denied access in IE 11.9
that directly affect our read, write, and execute re- Second, over 12% of the test cases show differences
sults: allow-scripts, allow-same-origin, between Safari 9 and the other browsers by looking on <
allow-top-navigation. object> and <embed> elements that load SVG files.
The sandbox attribute is a special case that is dis- Safari 9 does not show an SVG if it is loaded by code
cussed in Section 7. like <object data="image.svg"></object>.
Therefore, JavaScript code contained in the SVG file
Recommendations for Browser Vendors. From the cannot be executed. It needs an additional type attribute
perspective of a browser vendor, it is interesting to know with the value image/svg+xml such that JavaScript
how the results of our tool can be used to identify bugs execution is allowed. Since Safari 10.1 Apple has
and therefore potential vulnerabilities. In our analysis, changed their implementation and both elements behave
we have automatically compared each SOP-DOM differ- similar to the other browsers. The attribute type="
ence with the behavior of all other browsers. In case that image/svg+xml" is no more required.
at least one browser grants SOP-DOM access that the Third, over 51% of the test cases show different behav-
other browsers restrict, a browser vendor should have a iors because of <link>. Nearly all the cases have dif-
closer look on this test case. We recommend to adjust ferent CORS implementations. CORS thus shows that a
the SOP-DOM behavior to the majority of other browser relatively new and complex technology leads to different
behaviors for reasons of clarity. For each test, our web- interpretations of “well-known” web concepts like SOP.
site recommends a result, which is based on the major- Similarly to Chromium’s testbed that have been ap-
ity of all ten tested browsers (see Figure 4). Because plied to other browsers to find bugs, our testbed could
our testbed includes browsers of different vendors (e.g., be used and extended by browser vendors and security
Apple, Google, Mozilla, Microsoft), we believe that this researchers to identify browser differences leading to ex-
might be a representative SOP-DOM result. ploits.10
tial read access with the help CORS from HD to ED that they have fixed them in the newest browser versions.
10 https://github.com/thomaspatzke/
8 http://www.your-sop.com/stats.php BrowserCrasher
We verified our login oracle with the startpage service SVG. We only covered <svg> as an ED which directly
start.me (ED); an attacker is clearly able to decide embeds the JavaScript code for testing read/write access.
whether a user is logged in or not. This attack is sim- It is also possible, to use <svg> as a HD; for example,
ilar to [5]. We have informed the website administra- an external JavaScript can be loaded by using <svg><
tors about this vulnerability. Microsoft (Research Center, script xlink:href=".."></svg>. Our testbed
MSRC) acknowledged this bug (Case 32703) and the fix always uses an HTML document as HD.
will be incorporated into a future version of IE/Edge.
JavaScript. We only cover a small, but hopefully rep-
5 Limitations resentative, set of DOM properties. Our testbed only
covers the location property, but sub-properties such
Even if we restrict our attention to SOP-DOM, the Same- as location.hash or location.path were not
Origin Policy has a very large scope. We have 15 HTML analyzed. The same holds for the window.name prop-
elements with src attributes, and several more with a erty, which is well-known to be writable across origins.
similar functionality (e.g. <canvas>). There are six A design decision for our testbed was to be able to
different sandbox attributes, and they (e.g., the CORS at- easily execute all test simultaneously. Therefore, only
tribute) may be influenced by HTTP-based security poli- one index.html is capable to run all 544 tests with only
cies like CSP. There are many different ways how to em- one click by the user. For this reason, we excluded pop-
bed a document of a given MIME type into a webpage ups and the corresponding window.opener property.
(e.g., SVG via <img> or <iframe>), and there are
many different MIME types with and without a DOM
Other Mime Types. Our testbed is limited to HTML,
structure to consider. There are pseudoprotocols like
JavaScript, CSS, and SVG. For example, it would be
data: and about:, which have different Web Origin
interesting to investigate PDF, which can also include
definitions. There is also a large number of DOM prop-
JavaScript code. There are many more active MIME
erties which could be tested for partial access.
types, such as Flash or ActiveX, which should be ad-
Covering all interactions within this scope would re-
dressed in further research.
sult in an exponential number of test cases, which can-
not be covered in one research paper. For example, Zal-
weski [28] lists four classes of common URL schemes Pseudoprotocols. We excluded pseudo protocols (e.g.,
(e.g., document-fetching and third-party) consisting of about:, chrome:) and Data and JavaScript-URIs
different subclasses (e.g., browser specific schemes like from our tests, because in a (possibly outdated) overview,
vbscript, firefoxurl, and cf). Moreover, it is Zalewski [28] already pointed out that there are different
possible to register self-defined handlers for particular Web Origin assignments in different browser implemen-
schemes via registerProtocolHandler. In this tations. However, extending the testbed to selected pseu-
section, we therefore discuss several technologies that doprotocols is future work.
Attribute-Based Access Control (ABAC) [49] is a flex- Additional Attributes. Similar to the ee attribute, the
ible access control mechanism used in, for example, cors and sandbox attributes are only defined for
XACML [50]. It may also be used to implement RBAC: the embedded document ED. For cors, our tests
roles can be modeled as role attributes assigned to both revealed that this attribute modifies access rights to
subject and object. The policy decision in ABAC may a web object and therefore, it is only an object at-
depend on other subject, object and environment at- tribute.
tributes as well.
Attributes not fixed by the HTML source code. The
Definition 4 Let Ai = {NULL, value1i , ..., valueki i } be the ABAC model also defines environment attributes,
set of different values of attribute i. Let S A = A1 × ... × which may not depend on subject or object alone
Al , OA = Al+1 × ... × Am and E A = Am+1 × ... × An but rather on the execution environment. The only
be the cartesian products of all subject, object and en- attribute we could qualify to be in E A during our
vironment attribute values. Let R be the set of all ac- tests is sandbox, since it may be set interactively
cess rights. Then an ABAC policy P is defined as by using a suitable directive of Content Security
P ⊆ S A × OA × E A × R. Policy.
Now let s~a be the array of subject attributes of subject
s, o~a the array of object attributes of object o, and e~a Extended Web Origin. The ABAC model for SOP-
the actual array of environment attributes. Then subject DOM can be presented as the set P but this does not
s has access r ∈ R to object o if the array ~a, formed give any insights into the structure of SOP-DOM. How-
by concatenating s~a, o~a, e~a, and r, is contained in P: ever, four of the seven variables can be combined into a
~a ∈ P. very elegant description of an extended Web Origin. This
shows that the ABAC model can also be used to simplify
ABAC could be suitable for SOP-DOM because we the description of SOP-DOM.
can model any parameter that influences the access deci-
sions as an attribute. This allows to give a unified treat- 1 Read(protocol,domain,port,dd);
ment to some well-known concepts. 2 if dd=NULL or (dd is not a
superdomain)
Extended Web Origins. Both subject and object have 3 then wo:=(protocol,domain,port)
attributes from which their Web Origin can be com- 4 else wo:=(protocol,dd,NULL)
puted. In the classical definition of Web Origins Listing 5: Computation of extended Web Origin.
in RFC6454 these are protocol (location.
protocol), domain (location.hostname) Listing 5 shows how an extended web origin is com-
and port (location.port). puted from the four given ABAC variables. Please note
that the else branch of this algorithm has been veri-
I We can, for example, extend this definition to fied by our testbed but different descriptions exist in the
take the legacy document.domain decla- literature. In contrast to previous descriptions of the in-
ration into account (see below). We define teraction of Web Origins and the document.domain
Our analysis highlights the importance to evaluate ev- [5] S. Lekies, B. Stock, M. Wentzel, and M. Johns,
ery single possibility of browser interactions in the SOP- “The unexpected dangers of dynamic javascript,”
DOM. Different browser data sets can be used to identify in USENIX Security 2014, ser. SEC’15. Berkeley,
inconsistencies across implementations, which can lead CA, USA: USENIX Association, 2015, pp. 723–
to security vulnerabilities. Although edge cases (CORS, 735.
sandbox attribute) are mainly responsible for the detected
browser behaviors in our evaluation, commonly known [6] E. Z. Yang, D. Stefan, J. C. Mitchell, D. Mazières,
cases can also have differences and even vulnerabilities. P. Marchenko, and B. Karp, “Toward principled
Consequently, browser vendors have to compare their browser security,” in HotOS. USENIX Associa-
own implementation with those of other vendors. tion, 2013.
Our discussion on access control policies as a model [7] K. Singh, A. Moshchuk, H. J. Wang, and W. Lee,
to describe the SOP-DOM helps for a better understand- “On the incoherencies in web browser access con-
ing. Browser implementations can use our insights to de- trol policies,” in Proceedings of the 2010 IEEE
scribe the SOP-DOM implementation more formally and Symposium on Security and Privacy, ser. SP ’10.
thus preemptively prevent SOP bypasses. We strongly Washington, DC, USA: IEEE Computer Society,
believe that a more formal SOP-DOM definition will 2010, pp. 463–478.
help the scientific as well as the pentesting community
to find more severe vulnerabilities. Our test results of the [8] M. Zalewski, “Browser security handbook,”
ten tested browsers are available on the testbed website. Google Code, 2010.
Future Work. To extend the coverage, future work [9] A. van Kesteren, “Cross-origin resource shar-
may address the following areas: (1.) local storage/ses- ing,” W3C, W3C Recommendation, Jan.
sion storage or even new data types like Flash or PDF; 2014, http://www.w3.org/TR/2014/REC-cors-
(2.) different protocols, including pseudo-protocols like 20140116/.
about: and data:; (3.) other elements with URL at- [10] W. Alcorn, C. Frichot, and M. Orrù, The Browser
tributes or properties; (4.) additional HTML attributes. Hacker’s Handbook. John Wiley & Sons, 2014.
To generate novel insights into SOP-DOM, the path
taken by integrating the document.domain declara- [11] G. S. Kalra, “Exploiting insecure crossdomain.xml
tion could be extended to other attributes like ee; for to bypass same origin policy (actionscript poc),”
[20] W3C, “Html: The markup language (an html [32] C. Jackson and A. Barth, “Beware of finer-
language reference),” https://www.w3.org/TR/ grained origins,” in In Web 2.0 Security and
2012/WD-html-markup-20121025/elements.html, Privacy (W2SP 2008), 2008. [Online]. Available:
February 2017. http://seclab.stanford.edu/websec/origins/fgo.pdf
[21] WHATWG, “The elements of html,” https: [33] H. J. Wang, C. Grier, A. Moshchuk, S. T. King,
//html.spec.whatwg.org/multipage/semantics.html, P. Choudhury, and H. Venter, “The multi-principal
February 2017. os construction of the gazelle web browser,” in
A pure protocol is in some sense “pure and simple”. For Theorem 2 For a pure LDP protocol PE and Support,
each input v1 , the set {y | v1 ∈ Support(y)} identifies all the variance of the estimation c̃(i) in (1) is:
outputs y that “support” v1 , and can be called the support
set of v1 . A pure protocol requires the probability that nq∗ (1 − q∗ ) n fi (1 − p∗ − q∗ )
Var[c̃(i)] = + (2)
any value v1 is mapped to its own support set be the same (p∗ − q∗ )2 p∗ − q∗
for all values. We use p∗ to denote this probability. In
order to satisfy LDP, it must be possible for a value v2 ̸= Proof 2 The random variable c̃(i) is the (scaled) sum-
v1 to be mapped to v1 ’s support set. It is required that mation of n independent random variables drawn from
this probability, which we use q∗ to denote, must be the the Bernoulli distribution. More specifically, n fi (resp.
same for all pairs of v1 and v2 . Intuitively, we want p∗ to (1 − fi )n) of these random variables are drawn from
the Bernoulli distribution with parameter p∗ (resp. q∗ ).
be as large as possible, and q∗ to be as small as possible.
∗ Thus,
However, satisfying ε-LDP requires that qp∗ ≤ eε .
∑ j 1Support(y j ) (i) − nq∗
⎡( )
Basic RAPPOR is pure with p∗ = 1 − 2f and q∗ = 2f .
⎤
RAPPOR is not pure because there does not exist a suit- Var[c̃(i)] = Var ⎣ ⎦
p∗ − q∗
able q∗ due to collisions in mapping values to bit vec-
tors. Assuming the use of two hash functions, if v1 is ∑ j Var[1Support(y j ) (i)]
=
mapped to [1, 1, 0, 0], v2 is mapped to [1, 0, 1, 0], and v3 is (p∗ − q∗ )2
mapped to [0, 0, 1, 1], then because [1, 1, 0, 0] differs from n fi p (1 − p∗ ) + n(1 − fi )q∗ (1 − q∗ )
∗
=
[1, 0, 1, 0] by only two bits, and from [0, 0, 1, 1] by four (p∗ − q∗ )2
bits, the probability that v2 is mapped to v1 ’s support set nq (1 − q ) n fi (1 − p∗ − q∗ )
∗ ∗
is higher than that of v3 being mapped to v1 ’s support set. = + (3)
(p∗ − q∗ )2 p∗ − q∗
For a pure protocol, let y j denote the submitted value
by user j, a simple aggregation technique to estimate the
number of times that i occurs is as follows:
In many application domains, the vast majority of val-
∑ j 1Support(y j ) (i) − nq∗ ues appear very infrequently, and one wants to identify
c̃(i) = (1) the more frequent ones. The key to avoid having lots of
p∗ − q∗
false positives is to have low estimation variances for the
The intuition is that each output that supports i gives an infrequent values. When fi is small, the variance in (2) is
count of 1 for i. However, this needs to be normalized, dominated by the first term. We use Var∗ to denote this
because even if every input is i, we only expect to see approximation of the variance, that is:
n · p∗ outputs that support i, and even if input i never
occurs, we expect to see n · q∗ supports for it. Thus the nq∗ (1 − q∗ )
original range of 0 to n is “compressed” into an expected Var∗ [c̃(i)] = (4)
(p∗ − q∗ )2
range of nq∗ to np∗ . The linear transformation in (1)
corrects this effect. We also note that some protocols have the property that
p∗ + q∗ = 1, in which case Var∗ = Var.
Theorem 1 For a pure LDP protocol PE and Support, As the estimation c̃(i) is the sum of many independent
(1) is unbiased, i.e., ∀i E [ c̃(i) ] = n fi , where fi is the frac- random variables, its distribution is very close to a nor-
tion of times that the value i occurs. mal distribution. Thus, the mean and variance of c̃(i)
fully characterizes the distribution of c̃(i) for all prac-
Proof 1
tical purposes. When comparing different methods, we
⎡( )
∑ j 1Support(y j ) (i) − nq∗
⎤
observe that fixing ε, the differences are reflected in the
E [ c̃(i) ] =E ⎣ ⎦ constants for the variance, which is where we focus our
p∗ − q∗ attention.
– Binary Local Hashing (BLH) uses hash func- Proof 4 For any inputs v1 , v2 , and output B, we have
tions that outputs a single bit. This is equiva- Pr[B|v1 ] ∏i∈[d] Pr[B[i]|v1 ] Pr[B[v1 ]|v1 ]Pr[B[v2 ]|v1 ]
lent to the random matrix projection technique Pr[B|v2 ]
= ∏i∈[d] Pr[B[i]|v2 ]
= Pr[B[v1 ]|v2 ]Pr[B[v2 ]|v2 ]
in [6]. ≤ eε/2 · eε/2 = eε
104 104
103 103
102 102
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
ε ε
(a) Vary ε (b) Vary ε (fixing d = 210 )
OUE and OLH, whose variances do not depend on d. In- of variances match the empirically measured squared er-
tuitively, it is easier to transmit a piece of information rors. For the empirical data, we issue queries using the
when it is binary, i.e., d = 2. As d increases, one needs protocols and measure the average of the squared errors,
to “pay” for this increase in source entropy by having namely, d1 ∑i∈[d] [c̃(i) − n fi ]2 , where fi is the fraction of
higher variance. However, it seems that there is a cap on users taking value i. We run queries for all i values and
the “price” one must pay no matter how large d is, i.e., repeat for ten times. We then plot the average and stan-
OLH’s variance does not depend on d and is always 4 dard deviation of the squared error. We use synthetic data
times that of DE with d = 2. There may exist a deeper generated by following the Zipf’s distribution (with dis-
reason for this rooted in information theory. Exploring tribution parameter s = 1.1 and n = 10, 000 users), simi-
these questions is beyond the scope of this paper. lar to experiments in [13].
Figure 2 gives the empirical and analytical results for
all methods. In Figures 2(a) and 2(b), we fix ε = 4
6 Experimental Evaluation and vary the domain size. For sufficiently large d (e.g.,
d ≥ 26 ), the empirical results match very well with the
We empirically evaluate these protocols on both syn- analytical results. When d < 26 , the analytical variance
thetic and real-world datasets. All experiments are per- tends to underestimate the variance, because in (4) we
formed ten times and we plot the mean and standard de- ignore the fi terms. Standard deviation of the measured
viation. squared error from different runs also decreases when the
domain size increases. In Figures 2(c) and 2(d), we fix
6.1 Verifying Correctness of Analysis the domain size to d = 210 and vary the privacy budget.
We can see that the analytical results match the empirical
The conclusions we drew above are based on analyti- results for all ε values and all methods.
cal variances. We now show that our analytical results In practice, since the group size g of OLH can only be
104 104
Var
Var
103 103
102 2 102 2
2 24 26 28 210 212 214 216 218 2 24 26 28 210 212 214 216 218
d d
(a) Vary d (fixing ε = 4) (b) Vary d (fixing ε = 4)
106 106
105 105
Var
Var
104 104
103 103
102 102
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
ε ε
(c) Vary ε (fixing d = 210 ) (d) Vary ε (fixing d = 210 )
FP
80
60
40 101
20
0 100
5000 10000 15000 20000 5000 10000 15000 20000
Threshold Threshold
(a) Number of True Positives (b) Number of False Positives
Figure 5: Results on Kosarak dataset. The y axes are the number of identified hash values that is true/false positive.
The x axes are the threshold. We assume ε = 4.
which we have presented in Section 2 and compared with flect dependency on d. Mutual information considers
in detail in the paper. These protocols use ideas from only a single user’s encoding, and not aggregation ac-
earlier work [20, 9]. Our proposed Optimized Unary curacy. For example, both global and local hashing have
Encoding (OUE) protocol builds upon the Basic RAP- exactly the same mutual information characteristics, but
POR protocol in [13]; and our proposed Optimized Lo- they have very different accuracy for frequency estima-
cal Hashing (OLH) protocol is inspired by BLH in [6]. tion, because of collisions in global hashing. Neverthe-
Wang et al. [23] uses both generalized random response less, it is found that for very large ε’s, Direct Encoding
(Section 4.1) and Basic RAPPOR for learning weighted is optimal, and for very small ε’s, BLH is optimal. This
histogram. Some researchers use existing frequency esti- is consistent with our findings. However, analysis in [18]
mation protocols as primitives to solve other problems in did not lead to generalization and optimization of binary
LDP setting. For example, Chen et al. [8] uses BLH [6] local hashing, nor does it provide concrete suggestion on
to learn location information about users. Qin et al. [22] which method to use for a given ε and d value.
use RAPPOR [13] and BLH [6] to estimate frequent
items where each user has a set of items to report. These
can benefit from the introduction of OUE and OLH in 8 Discussion
this paper.
There are other interesting problems in the LDP set- On answering multiple questions. In the setting of tra-
ting beyond frequency estimation. In this paper we do ditional DP, the privacy budget is split when answering
not study them. One problem is to identify frequent val- multiple queries. In the local setting, previous work fol-
ues when the domain of possible input values is very low this tradition and let the users split privacy budget
large or even unbounded, so that one cannot simply ob- evenly and report on multiple questions. Instead, we sug-
tain estimations for all values to identify which ones are gest partitioning the users randomly into groups, and let-
frequent. This problem is studied in [17, 6, 16]. Another ting each group of users answer a separate question. Now
problem is estimating frequencies of itemsets [14, 15]. we compare the utilities by these approaches.
Nguyên et al. [21] studied how to report numerical an- Suppose there are Q ≥ 2 questions. We calculate vari-
swers (e.g., time of usage, battery volume) under LDP. ances on one question. Since there are different number
When these protocols use frequency estimation as a of users in the two cases (n versus n/Q), we normalize
building block (such as in [16]), they can directly ben- the estimations into the range from 0 to 1. In OLH, the
ε
efit from results in this paper. Applying insights gained variance is σ 2 = Var∗ [c̃OLH (i)/n] = ε 4e 2 .
(e −1) ·n
in our paper to better solve these problems is interesting When partitioning the users, n/Q users answer one
future work. ε
question, rendering σ12 = ε4Qe 2 ; when privacy bud-
Kairouz et al. [18] study the problem of finding the (e −1) ·n
optimal LDP protocol for two goals: (1) hypothesis test- get is split, ε/Q is used for one question, we have σ22 =
4eε/Q
ing, i.e., telling whether the users’ inputs are drawn from 2 . We want to show σ12 < σ22 :
(eε/Q −1) ·n
distribution P0 or P1 , and (2) maximize mutual informa-
tion between input and output. We note that these goals
are different from ours. Hypothesis testing does not re- σ22 − σ12
(zQ − 1)2 − QzQ−1 (z − 1)2 [6] BASSILY, R., AND S MITH , A. Local, private, ef-
ficient protocols for succinct histograms. In Pro-
=(z − 1)2 · (zQ−1 + zQ−2 + . . . + 1)2 − QzQ−1 > 0
[ ]
ceedings of the Forty-Seventh Annual ACM on Sym-
posium on Theory of Computing (2015), ACM,
Therefore, σ12 is always smaller than σ22 . Thus utility
pp. 127–135.
of partitioning users is better than splitting privacy bud-
get. [7] B LOOM , B. H. Space/time trade-offs in hash cod-
Limitations. The current work only considers the frame- ing with allowable errors. Commun. ACM 13, 7
work of pure LDP protocols. It is not known whether a (July 1970), 422–426.
protocol that is not pure will produce more accurate re-
[8] C HEN , R., L I , H., Q IN , A. K., K A -
sult or not. Moreover, current protocols can only handle
SIVISWANATHAN , S. P., AND J IN , H. Private
the case where the domain is limited, or a dictionary is
spatial data aggregation in the local setting. In
available. Other techniques are needed when the domain
32nd IEEE International Conference on Data
size is very big.
Engineering, ICDE 2016, Helsinki, Finland, May
16-20, 2016 (2016), pp. 289–300.
9 Conclusion
[9] D UCHI , J. C., J ORDAN , M. I., AND WAIN -
In this paper, we study frequency estimation in the Local WRIGHT, M. J. Local privacy and statistical mini-
Differential Privacy (LDP) setting. We have introduced a max rates. In FOCS (2013), pp. 429–438.
framework of pure LDP protocols together with a simple
[10] DWORK , C. Differential privacy. In ICALP (2006),
and generic aggregation and decoding technique. This
pp. 1–12.
framework enables us to analyze, compare, generalize,
and optimize different protocols, significantly improving [11] DWORK , C., M C S HERRY, F., N ISSIM , K., AND
our understanding of LDP protocols. More concretely, S MITH , A. Calibrating noise to sensitivity in pri-
we have introduced the Optimized Local Hashing (OLH) vate data analysis. In TCC (2006), pp. 265–284.
protocol, which has much better accuracy than previous
frequency estimation protocols satisfying LDP. We pro- [12] DWORK , C., AND ROTH , A. The algorithmic
vide a guideline as to which protocol to choose in differ- foundations of differential privacy. Foundations
ent scenarios. Finally we demonstrate the advantage of and Trends in Theoretical Computer Science 9, 34
the OLH in both synthetic and real-world datasets. (2014), 211–407.
[13] E RLINGSSON , Ú., P IHUR , V., AND KOROLOVA ,
10 Acknowledgment A. Rappor: Randomized aggregatable privacy-
preserving ordinal response. In Proceedings of the
This paper is based on work supported by the United 2014 ACM SIGSAC conference on computer and
States National Science Foundation under Grant No. communications security (2014), ACM, pp. 1054–
1640374. 1067.
[14] E VFIMIEVSKI , A., G EHRKE , J., AND S RIKANT,
References R. Limiting privacy breaches in privacy preserving
data mining. In PODS (2003), pp. 211–222.
[1] Apples differential privacy is about col-
lecting your databut not your data. [15] E VFIMIEVSKI , A., S RIKANT, R., AGRAWAL , R.,
https://www.wired.com/2016/06/ AND G EHRKE , J. Privacy preserving mining of as-
apples-differential-privacy-collecting-data/. sociation rules. In KDD (2002), pp. 217–228.
8 8
10 10
Var
Var
107 107
106 106
105 105
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
ε ε
(a) Zipf’s Top 100 Values (b) Normal Top 100 Values
108 108
Var
Var
107 107
106 106
105 105
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
ε ε
(c) Zipf’s Top 50 Values (d) Zipf’s Top 10 Values
Figure 6: Average squared errors on estimating a distribution of 10 million points. RAPPOR is used with 128-bit
long Bloom filter and 2 hash functions.
108
Var
107
106
105
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
ε
70 105
104
60
103
TP
FP
50
102
40
101
30 100
5000 5500 6000 6500 5000 5500 6000 6500
Threshold Threshold
(a) Number of True Positives ε = 4 (b) Number of False Positives ε = 4
8 104
7 103
TP
FP
6 102
5 101
4 100
15000 20000 25000 15000 20000 25000
Threshold Threshold
(c) Number of True Positives ε = 2 (d) Number of False Positives ε = 2
FP
5
4 102
3
101
2
1 100
20000 30000 40000 50000 20000 30000 40000 50000
Threshold Threshold
(e) Number of True Positives ε = 1 (f) Number of False Positives ε = 1
Figure 8: Results on Rockyou dataset for ε = 4, 2 and 1. The y axes are the number of identified hash values that is
true/false positive. The x axes are the threshold.
clients. Furthermore, the privatized results obtained entially private manner that approximates the most
from the opt-in group and from the clients are then common search records in the population. It addi-
“blended” in a way that takes into account the pri- tionally includes a single “wildcard” record, h?, ?i,
vatization algorithms used for each group, and thus, which represents all records in the population that
again, achieving an improved utility over a less- weren’t previously included in the candidate head
informed combination of data from the two groups. list. It then uses the remainder of the opt-in data to
The problem of enabling privacy-preserving lo- estimate the probability of each record in the candi-
cal search using past search histories can be viewed date head list in a differentially private manner, and
as the task of identifying the most frequent search (optionally) trims the candidate head list down to
records among the population of users, and estimat- create the final head list. The result of this compo-
ing their underlying probabilities (both in a differen- nent of Blender is the privatized trimmed head list
tial privacy-preserving manner). In this context, we of search records and their corresponding probabil-
call the data collected from the users search records, ity and variance estimates, which can be shared with
where each search record is a pair of strings of the each user in the client group, and with the world.
form hquery, U RLi, representing a query that a user Each member of the client group receives the pri-
posed followed by the URL that the user subse- vatized head list obtained from the opt-in group.
quently clicked. We denote by phq,ui the true under- Each client then uses the head list to apply a differ-
lying probability of the search record hq, ui in the ential privacy-preserving perturbation to their data,
population. We assume that our system receives a subsequently reporting their perturbed results to
sample of users from the population, each holding Blender. Blender then aggregates all the clients’
their own collection of private data drawn indepen- reports and, using a statistical denoising procedure,
dently and identically from the distribution over all estimates both the probability for each record in the
records p. Its goal is to output an estimate p̂ of prob- head list as well as the variance of each of the esti-
abilities of the most frequent search records, while mated probabilities based on the clients’ data.
preserving differential privacy (in the trusted curator For each record, Blender combines the record’s
model) for the opt-in users and (in the local model) probability estimates obtained from the two groups.
for the clients. It does so by taking a convex combination of the
Informal Overview of Blender: Figure 1 groups’ probability estimates for each record, care-
presents an architectural diagram of Blender. fully weighted based on the record’s variance esti-
Blender serves as the trusted curator for the opt- mate in each group. The combined result under
in group of users, and begins by aggregating data this weighting scheme yields a better probability es-
from them. Using a portion of the data, it con- timate than either group is able to achieve individu-
structs a candidate head list of records in a differ- ally. Finally, Blender outputs the obtained records
records collected from each user and the distribution ties(, δ, T, mO , HLS , M ) be the refined head list
of the privacy budget among the sub-algorithms’ of records, their estimated probabilities, and esti-
mated variances based on opt-in users’ data.
sub-components) can be viewed as knobs the de- 4: let hp̂C , σ̂C2i = EstimateClientProbabili-
signer of Blender is at liberty to tweak in order ties(, δ, C, mC , fC , HL) be the estimated record
to improve the overall utility of Blender’s results. probabilities and estimated variances based on
client reports.
2.2 Overview of Blender Sub-Algorithms 5: let p̂ = BlendProbabilities(p̂O , σ̂O 2 , p̂ , σ̂ 2 , HL)
C C
be the combined estimate of record probabilities.
We now present the specific choices we made for the 6: return HL, p̂.
sub-algorithms in Blender. Detailed technical dis-
cussions of their properties follow in Section 3.
Algorithms for Head List Creation and Prob- Figure 2: Blender, the server algorithm that coordinates the
ability Estimation Based on Opt-in User Data privatization, collection, and aggregation of data from all users.
query list as a whole is then normalized by the anal- Figure 9: Top-10 most popular queries in the AOL dataset, their
ogous Ideal DCG (IDCGQ Q
k ) – the DCGk if the esti- empirical probabilities pq in the first numeric column, Blender’s
mated query list had the exact same ordering as the probability estimates p̂q in the next column, and the various
sub-components’ estimates in the remaining columns. Parame-
actual query list. ter choices are shown in Figure 10.
Compared to the traditional NDCG definition,
the additional discounting within DCGQ k makes it
URL for a given query (columns 4 and 6). The sam-
even harder to attain high NDCG values than in ple variance of these aggregated probabilities, used
the query-only case. Contrasted with the L1 mea- for blending, is naively computed as in Theorem 4.
sure, this formulation takes both the ranking and In addition to estimating the record probabilities,
probabilities from the data set into account. Since the client algorithm estimates query probabilities di-
changes to the probabilities may not result in rank- rectly, which are shown in column 5. Regressions,
ing changes, L1 is an even less forgiving measure i.e., estimates that appear out of order relative to
than NDCG. column 2, are shown in red.
Since the purpose of Blender is to estimate prob- Takeaways: The biggest takeaway is that the num-
abilities of the top records, we discard the artificially bers in columns 2 and 3 are similar to each other,
added ? queries and URLs and rescale reli prior to with only one regression after Blender’s usage.
L1 and NDCG computations. However, since we Blender compensates for the weaknesses of both
use the method of [39] in BlendProbabilities, the the opt-in and the client estimates. Despite the sub-
probability estimates involving ? have a minor im- components having several regressions, their combi-
plicit effect on the L1 and NDCG scores. nation has only one.
4.1 Experimental Setup The table also provides intuition for the usefulness
of a two-stage reporting process in the client algo-
Data sets: For our experiments, we use the rithm (first report a query and then the URL), thus
AOL search logs, first released in 2006 and an or- allowing for separate estimates of query and record
der of magnitude bigger Yandex search data set4 , probabilities. Specifically, despite the high number
from 2013. Figure 8 compares their characteristics. of regressions for the client algorithm’s aggregated
Data analysis: To familiarize the reader with the record probability estimates (column 6), its query
approach we used for assessing result quality, Fig- probability estimates (column 5) have only one.
ure 9 shows the top-10 most frequent queries in the
4.2 Experimental Results
AOL data set, with the estimates given by the dif-
ferent “ingredients” of Blender. We formulate questions for our evaluation as fol-
The table is sorted by column 2, which contains lows: how to choose Blender’s parameters (Sec-
the non-private, empirical probabilities pq for each tion 4.2.1), how does Blender perform compared
query q from the AOL data set with 1 random record to alternatives (Section 4.2.2), and how robust are
sampled from each user. We consider this as the our findings (Section 4.2.3)?
baseline for the true, underlying probability of that 4.2.1 Algorithmic and Parameter Choices
query. Column 3 contains the final query probability
estimates outputted by Blender, p̂q , after combin- Blender has a handful of parameters, some of
ing the estimates from the opt-in group and clients. which can be viewed as given externally (by the laws
The remaining columns show the estimates that are of nature, so to speak), and others whose choice is
produced by the sub-components of Blender that purely up to the entity that’s utilizing Blender.
are eventually combined to form the estimates in We now describe and, whenever possible, motivate,
column 3. As the opt-in and client sub-components our choices for these.
compute probability estimates over the records in Privacy parameters, and δ: Academic liter-
the head list, we obtain query probability estimates ature on differential privacy views the selection of
by aggregating the probabilities associated with each the parameter as a “social question” [9] and thus
NDCG
in the 30% range, at best. We believe that given the 0.75
0.7
intense focus on search optimization in the field of
0.65
information retrieval, NDCG values as low as those
0.6
of Qin et al. are generally unusable, especially for 0.55
0.85, 0.694406752
such a small head list size. Overall, Blender signif- 0.5
icantly outperforms what we believe to be the closest 0.55 0.65 0.75 0.85 0.95
and this work use slightly different scoring func- Figure 11: Comparing AOL data set results across a range of
tions. The former’s relevance score is based on the budget splits for client, opt-in, and blended results.
rank of queries in the original AOL data set, which
0.971 0.971 0.973 0.975
results in penalizing mis-ranked queries regardless 1 0.962
0.9
of how similar their underlying probabilities may Blender CCS'16
0.8
be. Blender’s relevance score relies on the under-
0.7
lying probabilities, so mis-ranked items with simi-
0.6
lar underlying probabilities have only a small nega-
NDCG
0.5
tive impact on the overall NDCG score; we believe 0.385
0.4 0.360
this choice is justified. Although it yields increased 0.295 0.315
0.3
NDCG scores, Blender operates on records (rather 0.175
0.2
than queries, as Qin et al. does). Because of this, the
0.1
generalized NDCG score used to evaluate Blender
0
(Section 4) is a strictly less forgiving metric than 1 2 3
epsilon
4 5
NDCG
0.95
0.02 0.94
0.93
10 25 50
0.01 0.92
0.91
0 0.9
1.00%
1.50%
2.00%
2.50%
3.00%
3.50%
4.00%
4.50%
5.00%
5.50%
6.00%
6.50%
7.00%
7.50%
8.00%
8.50%
9.00%
9.50%
10.00%
1.00%
1.50%
2.00%
2.50%
3.00%
3.50%
4.00%
4.50%
5.00%
5.50%
6.00%
6.50%
7.00%
7.50%
8.00%
8.50%
9.00%
9.50%
10.00%
Percentage of users that opt-in Percentage of users that opt-in
NDCG
0.06 0.95
0.94
0.04 0.93
10 25 50 100 500
0.92
0.02
0.91
0 0.9
1.00%
1.50%
2.00%
2.50%
3.00%
3.50%
4.00%
4.50%
5.00%
5.50%
6.00%
6.50%
7.00%
7.50%
8.00%
8.50%
9.00%
9.50%
10.00%
1.00%
1.50%
2.00%
2.50%
3.00%
3.50%
4.00%
4.50%
5.00%
5.50%
6.00%
6.50%
7.00%
7.50%
8.00%
8.50%
9.00%
9.50%
10.00%
Figure 13: L1 statistics as a function of the opt-in percentage Figure 14: NDCG statistics as a function of the opt-in percent-
for select head list sizes. age for select head list sizes.
Figure 15 shows the L1 values as a function of , values. We see a clear trend toward higher NDCG
ranging from 1 to 5. For both data sets, we see a values for Yandex, which is not surprising given the
steady decline in the L1 metric, despite aggregating sheer volume of data. For the Yandex data set, we
L1 values over longer estimate vectors. With more can keep as low as 1 and still achieve NDCG values
data in the Yandex data set, we are able to hit small of 95% and above for all but the two largest head
values of L1 (under 0.1) with ≥ 1. Similar to the list sizes. For those, we must increase in order to
case with small opt-in percentages, having too small generate larger head lists from the opt-in users.
an makes it difficult to achieve head lists of their
target size; e.g., in Figure 15a, the line for a head 5 Related Work
list of size 50 does not begin until = 3 because that
size head list was not created with a smaller value. Algorithms for the trusted curator model:
Evaluation of local search computation: Fig- Researchers have developed numerous differentially
ure 14 shows the NDCG measurements as a func- private algorithms operating in the trusted curator
tion of the opt-in percentage ranging between 1% model that result in useful data for a variety of ap-
and 10%. The results are quite encouraging; for the plications. For example, [24, 26, 16, 31] address the
smaller AOL data set, for instance, we need to have problem of publishing a subset of the data contained
an opt-in level of ≈5% to achieve an NDCG level in a search log with differential privacy guarantees;
of 95%, which we regard as acceptable. However, for [27] and [6] propose approaches for frequent item
the larger Yandex data set, we hit that NDCG level identification; [14] propose an approach for monitor-
even sooner: the NDCG value for 1.5% is above 95% ing aggregated web browsing activities; and so on.
for all but the largest head list size. Algorithms for the local model: Although the
Figure 16 shows how the NDCG values vary across demand for privacy-preserving algorithms operating
the two data sets for a range of head list sizes and in the local model has increased in recent years, par-
0.98
0.025 0.97
0.96
0.02
NDCG
0.95
0.015 0.94
0.01 0.93
10 25 50
0.92
0.005
0.91
0 0.9
1 2 3 4 5 1 2 3 4 5
epsilon epsilon
0.985
0.04
0.98
10 25 50 100 500
NDCG
0.03 0.975
0.97
0.02 0.965
0.96
0.01
0.955
0 0.95
1 2 3 4 5 1 2 3 4 5
epsilon epsilon
Figure 15: L1 statistics for AOL and Yandex data sets as a Figure 16: NDCG statistics for AOL and Yandex data sets as a
function of for select head list sizes. function of for select head list sizes.
ticularly among practitioners [17, 35], fewer such al- but achieving it differently depending on whether
gorithms are known [40, 19, 8, 13, 5]. Furthermore, or not they trust the data curator, enables an effi-
the utility of the resulting data obtained through cient privacy-preserving head list construction. The
these algorithms is significantly limited compared to subsequent usage of this head list in the algorithm
what is possible in the trusted curator model, as operating in the local model helps overcome one of
shown experimentally [15, 21] and theoretically [22]. the main challenges to utility of privacy-preserving
The recent work of [34] also takes a two-stage ap- algorithms in the local model [15]. Moreover, the
proach: first, spend some part of the privacy budget weighted aggregation of probability estimates ob-
to learn a candidate head list and then use the re- tained from algorithms operating in the two models
maining privacy budget to refine the probability esti- (that explicitly factors in the amount of noise each
mates of the candidates. However, that’s where the contributed), enabled remarkable utility gains com-
similarities with Blender end, as [34] focuses en- pared to usage of one algorithm’s estimates. As dis-
tirely on the local model (and thus has to use entirely cussed in Section 4.2.2, we significantly outperform
different algorithms from ours for each stage) and the most recently introduced local algorithm of [34]
addresses the problem of estimating probabilities of on metrics of utility in the search context.
queries, rather than the more challenging problem
of estimating probabilities of query-URL pairs. 6 Discussion
Our contribution: Our work significantly im-
proves upon the known results by developing Operating in the hybrid model is most beneficial
application-specific local privatization algorithms utility-wise if the opt-in user records and client user
that work in combination with the trusted curator records come from the same distribution – i.e., the
model algorithms. Specifically, our insight of pro- two groups have fairly similar observed search be-
viding all users with differential privacy guarantees havior. If that is not the case, the differential privacy