You are on page 1of 143

Machine Learning Based Cyber Attacks Targeting

on Controlled Information

By Yuantian Miao

A thesis submitted in fulfillment for the


degree of Doctor of Philosophy

Faculty of Science, Engineering and Technology (FSET)


Swinburne University of Technology

May 2021
Abstract

Due to the fast development of machine learning (ML) techniques, cyber attacks utilize ML algorithms
to achieve a high success rate and cause a lot of damage. Specifically, the attack against ML models,
along with the increasing number of ML-based services, has become one of the most emerging cyber
security threats in recent years. We review the ML-based stealing attack in terms of targeted controlled
information, including controlled user activities, controlled ML model-related information, and controlled
authentication information. An overall attack methodology is extracted and summarized from the recently
published research. When the ML model is the target, the attacker can steal model information or mislead
the model’s behaviours. The model information stealing attacks can steal the model’s structure information
or model’s training set information. Targeting at Automated Speech Recognition (ASR) system, the
membership inference method is studied to whether the model’s training set can be inferred at user-level,
especially under the black-box access. Under the label-only black-box access, we analyse user’s statistical
information to improve the user-level membership inference results. When even the label is not provided,
google search results are collected instead, while fuzzy string matching techniques would be utilized to
improve membership inference performance. Other than inferring training set information, understanding
the model’s structure information can launch an effective adversarial ML attack. The Fast Adversarial
Audio Generation (FAAG) method is proposed to generate targeted adversarial examples quickly. By
injecting the noise over the beginning part of the audio, the FAAG method can speed up around 60%
compared with the baseline method during the adversarial example generation process. In accordance with
these attack methodologies, the limitations and future directions of ML-based cyber attacks are presented.
The current countermeasures are also summarized and discussed for adequate protections because of their
urgent needs. Related code and sources about the completed work shown in this thesis are organized on
Github 1 .
Voice interfaces and assistants implemented by various services have become increasingly sophis-
ticated, powered by increased availability of data. However, users’ audio data needs to be guarded
while enforcing data-protection regulations, such as the GDPR law and the COPPA law. To check the
unauthorized use of audio data, we propose an audio auditor for users to audit speech recognition models.
Specifically, users can check whether their audio recordings were used as a member of the model’s training
dataset or not. We focus one work on a DNN-HMM-based ASR model over the TIMIT audio data. As a
proof-of-concept, the success rate of participant-level membership inference can reach up to 90% with
eight audio samples per user, resulting in an audio auditor.
1
https://github.com/skyInGitHub/PhD_thesis

ii
We further examine user-level membership inference in the problem space of voice services, by
designing an audio auditor to verify whether a specific user had unwillingly contributed audio used to
train an ASR model under strict black-box access. With user representation of the input audio data and
their corresponding translated text, our trained auditor is effective in user-level audit. We also observe that
the auditor trained on specific data can be generalized well regardless of the ASR model architecture. We
validate the auditor on ASR models trained with LSTM, RNNs, and GRU algorithms on two state-of-the-
art pipelines, the hybrid ASR system and the end-to-end ASR system. Finally, we conduct a real-world
trial of our auditor on iPhone Siri, achieving an overall accuracy exceeding 80%.
To broaden the assumptions, we examine user-level membership inference targeting ASR model
within the voice services under no-label black-box access. Specifically, we design a user-level audio
auditor to determine whether a specific user had unwillingly contributed audio used to train the ASR
model, when the service only reacts on user’s query audio without providing the translated text. With user
representation of the input audio data and their corresponding system’s reaction, our auditor shows an
effective auditing in user-level membership inference. Our experiments shows that the auditor behaves
better with more training samples and samples with more audios per user. We evaluate the auditor on
ASR models trained with different algorithms (LSTM, RNNs, and GRU) on the hybrid ASR system
(Pytorch-Kaldi). We hope the methodology developed in this thesis and findings can inform privacy
advocates to overhaul IoT privacy.
Apart from the membership inference attack, ASRs inherit deep neural networks’ vulnerabilities
like crafted adversarial examples. Existing methods often suffer from low efficiency because the target
phases are added to the entire audio sample, resulting in high demand for computational resources. This
thesis also proposes a novel scheme named FAAG as an iterative optimization-based method to generate
targeted adversarial examples quickly. By injecting the noise over the beginning part of the audio, FAAG
generates adversarial audio in high quality with a high success rate timely. Specifically, we use audio’s
logits output to map each character in the transcription to an approximate position of the audio’s frame.
Thus, an adversarial example can be generated by FAAG in approximately two minutes using CPUs only
and around ten seconds with one GPU while maintaining an average success rate over 85%. Specifically,
the FAAG method can speed up around 60% compared with the baseline method during the adversarial
example generation process. Furthermore, we found that appending benign audio to any suspicious
examples can effectively defend against the targeted adversarial attack. We hope that this work paves the
way for inventing new adversarial attacks against speech recognition with computational constraints.

iii
Acknowledgements

I would like to express my sincere gratitude to my supervisors: Prof. Yang Xiang, A/Prof. Jun Zhang,
Dr. Lei Pan and Dr. Chao Chen who have instructed me in research with their broad knowledge and
patience during my PhD. Without their professional guidance, support and encouragement, this thesis
would not have been possible.
I would like to acknowledge Prof. Qinglong Han, Prof. Dali Kaafar, Dr. Minhui Xue and Mr. Benjamine
Zi Hao Zhao for their constant support, advice and wonderful collaboration. I also like to thank my
colleagues Dr. Guanjun Lin and Mr. Rory Coulter for their constant support and advice.
Finally, I would like to thank my family and friends for their unending support and encouragement.
Especially, at the end of this part, I want to remember my beloved grandfather with all my gratitude and
regrets. In this year, when he stayed at his seventy-seven years old forever, this thesis is the only thing I
can reciprocate for all his love and care for me.

iv
Swinburne Research

Declaration

This thesis contains no material which has been accepted for the award to the

candidate of any other degree or diploma, except where due reference is made in the

text of the thesis; To the best of the candidate’s knowledge, this thesis contains no

material previously published or written by another person except where due

reference is made in the text of the thesis; and where the work is based on joint

research or publications, the thesis discloses the relative contributions of the

respective workers or authors.

Name: Yuantian Miao

Signature:

Date: 06 _ _ / _2021
_ _ / 05 ___

v
List of Publications

• Yuantian Miao, Chao Chen, Lei Pan, Qing-Long Han, Jun Zhang, and Yang Xiang. 2021. Machine
Learning–based Cyber Attacks Targeting on Controlled Information: A Survey. <i>ACM Comput.
Surv.</i> 54, 7, Article 139 (July 2021), 36 pages. DOI:https://doi.org/10.1145/3465171

• Yuantian Miao, Minhui Xue, Chan Chen, Lei Pan, Jun Zhang, Benjamine Zi Hao Zhao, Dali Kaafar,
Yang Xiang. “The audio auditor: user-level membership inference in Internet of Things voice
services”, Proceedings on Privacy Enhancing Technologies (PoPETs). 2021;2021:209-28.

• Yuantian Miao, Minhui Xue, Chan Chen, Lei Pan, Jun Zhang, Benjamine Zi Hao Zhao, Dali Kaafar,
Yang Xiang. “The audio auditor: participant-level membership inference in Internet of Things voice
services”, Privacy Preserving Machine Learning, ACM CCS 2019 Workshop.

• Yuantian Miao, Chao Chen, Lei Pan, Jun Zhang and Yang Xiang. “FAAG: Fast Adversarial Audio
Generation through Interactive Attack Optimisation”, IEEE Transactions on Computers, accepted
on 21/04/2021, in press.

vi
Contents

Abstract i

Acknowledgements iii

Declaration iv

Complete Work and List of Publications vi

1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature Review 6
2.1 ML-based Stealing Attack Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Reconnaissance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Attacking the Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Stealing ML Model Related Information . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Stealing controlled ML model description . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Stealing controlled ML model’s training data . . . . . . . . . . . . . . . . . . . 16
2.2.3 ML-based Attack about Audio Adversarial Examples Generation . . . . . . . . 23
2.3 Stealing User Activities Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Stealing controlled user activities from kernel data . . . . . . . . . . . . . . . . 24
2.3.2 Stealing controlled user activities using sensor data . . . . . . . . . . . . . . . . 26
2.4 Stealing Authentication Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Stealing controlled keystroke data for authentication . . . . . . . . . . . . . . . 27
2.4.2 Stealing controlled secret keys for authentication . . . . . . . . . . . . . . . . . 29
2.4.3 Stealing controlled password data for authentication . . . . . . . . . . . . . . . 30
2.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 The Audio Auditor: User-Level Membership Inference with Black-Box Access 34

vii
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 The Automatic Speech Recognition Model . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Deep Learning for Acoustic Models . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Membership Inference Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Auditing the ASR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Overview of the Audio Auditor . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Experiment and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 Target Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 The Audio Auditor: Label-Only User-Level Membership Inference in Internet of Things


Voice Services 44
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 The Automatic Speech Recognition Model . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Membership Inference Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Auditing the ASR Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2 Overview of the Proposed Audio Auditor . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Experimental Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.1 Effect of the ML Algorithm Choice for the Auditor . . . . . . . . . . . . . . . . 52
4.4.2 Effect of the Number of Users Used in Training Set of the Auditor . . . . . . . . 52
4.4.3 Effect of the Target Model Trained with Different Data Distributions . . . . . . . 53
4.4.4 Effect of the Number of Audio Records Per User . . . . . . . . . . . . . . . . . 54
4.4.5 Effect of Training Shadow Models across Different Architectures . . . . . . . . 56
4.4.6 Effect of Noisy Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.7 Effect of Different ASR Model Pipelines on Auditor Performance . . . . . . . . 59
4.4.8 Real-World Audit Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5 Threats to Auditors’ Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.7 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 The Audio Auditor: No-Label User-Level Membership Inference in Internet of Things


Voice Services 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

viii
5.2.1 The Automatic Speech Recognition (ASR) Model . . . . . . . . . . . . . . . . 70
5.2.2 Membership Inference Attack on ASRs . . . . . . . . . . . . . . . . . . . . . . 71
5.3 No-Label Audio Auditor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.2 No-Label User-Level Membership Inference . . . . . . . . . . . . . . . . . . . 73
5.4 Experimental Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.2 User-Level Auditor with No-Label Black-box access . . . . . . . . . . . . . . . 77
5.4.3 Model Independent User-Level Auditor . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6 FAAG: Fast Adversarial Audio Generation through Interactive Attack Optimisation 80


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.1 The Automatic Speech Recognition Model . . . . . . . . . . . . . . . . . . . . 82
6.2.2 Adversarial Attack on ASRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3 Generating Audio Adversarial Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3.1 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.2 Fast Adversarial Audio Generation (FAAG) . . . . . . . . . . . . . . . . . . . . 85
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.2 Proper Frame Length Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.3 Effectiveness and Efficiency Analysis . . . . . . . . . . . . . . . . . . . . . . . 95
6.4.4 Summary in Speed Advantage . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.5 Discussion on Different Position of Adversarial Audio Clip . . . . . . . . . . . . . . . . 99
6.5.1 Different Position of Adversarial Audio Clip . . . . . . . . . . . . . . . . . . . 99
6.5.2 Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5.3 Transferable FAAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7 Research Challenges and Future Work 103


7.1 Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.1 Reconnaissance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1.4 Attacking the Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 Defense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.2 Disruption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.3 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

ix
8 Conclusion 110

Bibliography 112

Appendix A. Authorship Indication Form 131

x
Chapter 1

Introduction

Driven by the needs to protect the enormous value within data and the reality of the emerging data
mining techniques, information leakage becomes a growing concern for governments, organizations and
individuals [1]. Compromising the confidentiality of protected information is an information leakage
incident, which is a prominent threat of cyber security [2]. The leakage of sensitive information results in
both financial and reputational damages to the organizations [3]. According to [4], the average number
of registered data leak increased 36.9% from 2016 to 2017, while both the volume of data leak and the
number of leakage have reached the highest record in the past 7 years. As reported in [5], the global
average cost of information leakage is 6.4% above that in 2017, raising to $3.86 million. Due to the rapid
digitization of our work and life, the loss of data breaches world-wide would increase to $2.1 trillion by
2019 as predicted in [6]. Thus, information leakage incidents are indeed an urgent threat that deserves the
public attention.
This thesis firstly introduces the stealing attack in the cyber security area. According to [7], the
information leakage can be defined as the violation of confidentiality of methods/mechanisms/framework
which stores information or has access to information. In other words, the introduced attack aims at
stealing the controlled information. According to the cyber attack definition in [8], the term “controlled”
has an implicit meaning as “protected”. Comparing to the attack compromising of a computing envi-
ronment/infrastructure or data integrity, the controlled information stealing attack is more difficult to be
detected in advance. Cyber attacks, which “disrupt, disable, destroy, or maliciously control a computing
environment/infrastructure and destroy the integrity of the data” [8], are out of the scope of this thesis, for
example, a DDoS attack leaking customer data [9] is not reviewed. According to the literature collected
between 2014 and 2019, there are three common vulnerabilities subject to the controlled information
stealing attacks:

1. User activity information, especially the one stored on the mobile devices. For example, [10]
extracted user’s foreground app running in Android in order to exploit it for the phishing attack,
while the user activity information was protected by a nonpublic system level permission [11].

2. ML models and their training data, particularly those which are hosted on the Machine-Learning-as-
a-Service (MLaaS) systems. For instance, an ML model is confidential due to the pay-per-query
development in a cloud-based ML service [12] as well as the security mechanisms contained in
spam/fraud detection applications [13, 14, 15, 16].

1
3. Authentication information such as keystroke information, secret keys, and passwords.

As a fast-growing technique in the recent years, ML techniques are applied widely in various cyber
security areas. MLaaS [17] is proposed to help users with limited computing resources or limited ML
knowledge to use ML models. In this thesis, the ML-based stealing attack is defined as: An attacker
utilizes an ML algorithm to build up a computational model in order to disclose the controlled information,
while the raw dataset is collected in the legitimate ways. This definition is explained with respect to two
attack modes. In the first attack mode, attackers build up an ML model as a tool to perform an accurate
and efficient stealing attack where the output of the model is the targeted controlled information. In the
second attack mode, the ML model itself is the target. Building up the model means reconstructing the
targeted controlled information — the model within an MLaaS platform, which is also known as model
reconstruction attack [18]. These two types of the ML-based stealing attacks are summarized in this thesis.
Other attacks, which leak controlled information without applying ML techniques, have been surveyed
in [19] and [20]. Furthermore, [21] applied the malware to leak password files, while [22] proposed an
eavesdropping attack to increase the information leakage rate without using ML algorithms. This thesis
investigates the ML-based stealing attack.
Except for the stealing attack, we further investigate the ML-based attack about adversarial examples
generation. Since the adversarial example generation achieved the high success rate with the white-box
access to the target model, the stealing attack against ML-related information can boost this attack. Herein,
the adversarial examples are generated by adding imperceptible noise to the benign sample and cause the
target to predict incorrect answers.

Figure 1.1: Introduced Stealing Controlled Information Attack Categories. (Info: information)

1.1 Contributions
This thesis intends to introduce a new rising threat of stealing controlled information, and catch up with
the trends of this kind of stealing attack and its countermeasures. As a representative of the ML-based
stealing attack, we further investigate the membership inference methodology against machine learning
models. Adversarial machine learning attack under white-box access is conducted to explain further the
threat of stealing controlled ML model related information. Our contributions can be itemized as follows:

• The ML-based stealing attack, which aims at stealing the controlled/protected information and leads
to huge economic loss, is introduced. Herein, ML algorithms are applied in the attack to increase
the success rate in various aspects. The classification of the ML-based stealing attacks is built based

2
on the targeted controlled information preferentially. Based on this classification, the vulnerabilities
in various systems and corresponding attacks are sorted out and revealed.

• A general methodology applied for the ML-based stealing attack against the controlled information
is generalized to five phases — reconnaissance, data collection, feature engineering, attacking the
objective, and evaluation. The methodology highlights the similarity of these attacks from strategies
and technical perspectives. The public datasets used for the attack analysis are also summarized
and referenced correspondingly.

• The Audio Auditor: User-Level Membership Inference with Black-Box Access. With black-box
access to a target Automatic Speech Recognition (ASR) system, we proposed an audio auditor
to audit whether a specific user unwillingly contributed his/her audio recordings to train the ASR
model. We focus our work on a DNN-HMM-based ASR model over the TIMIT audio data [23].
Herein, the DNN-HMM-based ASR model is the ASR model with the structure using Deep Neural
Network to train its acoustic model and using Hidden Markov Model to map the acoustic model’s
output to a sequence of text. As a proof-of-concept, the success rate of user-level membership
inference can reach up to 90% accuracy with eight audio samples per user.

• The Audio Auditor: Label-Only User-Level Membership Inference in Internet of Things Voice
Services. Different from previous work, our auditor audit an ASR model under black-box access
without providing any confidential score. Specifically, the translated text is the only output label of
the target model. With user representation of the input audio data and their corresponding translated
text, our trained auditor is effective in user-level audit. We also observe that the auditor trained
on specific data can be generalized well regardless of the ASR model architecture. We validate
the auditor on ASR models trained with Long Short-Term Memory (LSTM), Recurrent Neural
Network (RNN), and Gated Recurrent Unit (GRU) algorithms on two state-of-the-art pipelines, the
hybrid ASR system and the end-to-end ASR system. Finally, we conduct a real-world trial of our
auditor on iPhone Siri, achieving an overall accuracy exceeding 80%.

• The Audio Auditor: No-Label User-Level Membership Inference in Internet of Things Voice Services.
We broaden the assumption from label-only black-box access to no-label black-box access. In this
setting, the service only reacts on user’s query audio without providing the translated text. With
user representation of the input audio data and their corresponding system’s reaction, our auditor
shows an effective auditing in user-level membership inference. Our experiments show that the
auditor behaves better with more training samples and samples with more audios per user. The
highest AUC score can reach 73% which is better than random guessing method. Herein, the AUC
is Area Under the Curve used to measure the model’s separability.

• FAAG: Fast Adversarial Audio Generation through Interactive Attack Optimisation. FAAG is an
iterative optimization-based method to generate targeted adversarial examples quickly. By injecting
the noise over the beginning part of the audio, FAAG generates adversarial audio in high quality
with a high success rate timely. Specifically, we use audio’s logits output to map each character in
the transcription to an approximate position of the audio’s frame. Thus, an adversarial example can
be generated by FAAG in approximately two minutes using CPUs only and around ten seconds
with one GPU while maintaining an average success rate over 85%. Specifically, the FAAG method

3
can speed up around 60% compared with the baseline method during the adversarial example
generation process. Furthermore, we found that appending benign audio to any suspicious examples
can effectively defend against the targeted adversarial attack.

By improving our knowledge of the emerging attack, the ultimate purpose of this thesis is to safeguard
the information thoroughly. In the information era, the leakage of information, especially those have
already been controlled, will result in tremendous damage to both corporations and individuals [3, 4, 5, 6].
This thesis reveals that current protections cannot fully suppress the existing ML-based stealing attacks.
As discussed in Chapter 7, in the near future, protecting controlled information can be improved from
detecting the access states of related data, disrupting the related data with considerable utility, and isolating
the related data from being accessed.

1.2 Structure
This thesis conducts the ML-based cyber attacks targeting on controlled information. The cyber attacks
can be generally categorised as targeting confidentiality, integrity, and availability. This thesis mainly
focus on the confidentiality and integrity. For the confidentiality investigation, ML-based stealing attack
is summarized as one of the most popularity attack within this research topic. The adversarial machine
learning under the white-box access is selected as the representative attack of the integrity violation.
Additionally, the adversarial machine learning is considered as the follow-up attack of the ML-based
stealing attack. Therefore, the thesis will investigate a literature review about ML-based stealing attack
firstly, while the adversarial machine learning attack will be reviewed as the follow-up attack under one
category of the ML-based stealing attack. After that, three different kinds of membership inference attack
are conducted against a popular ML model — ASR models. One adversarial ML attack is conducted
against the ASR model under white-box access.
The rest of this thesis is organized as follows:

• The stealing attack methodology is summarized in Chapter 2. In Chapter 2, the literature review of
the stealing attack using ML algorithms in the past five years is presented, where stealing attacks are
reviewed in three categories classified by the types of targeted controlled information. Accordingly,
the review of the audio adversarial examples are summarized under the category of stealing ML
model related information.

• Chapter 3 shows our user-level membership inference method against an automatic speech recog-
nition (ASR) system under black-box access. An assumption is considered under this black-box
access to simplify the attack, which is the model’s outputs including it transcription text and it
corresponding probabilities.

• Chapter 4 describes our work published in PoPETs named "The audio auditor: user-level mem-
bership inference in Internet of Things". This work proposed a user-level membership inference
method against an ASR system under label-only black-box access, which release the assumption
we restricted in last chapter. Specifically, the model’s output under label-only black-box access only
include the transcription text.

4
• Chapter 5 presents the user-level membership inference method against an ASR system in voice
service under no-label black-box access, which further release the assumption. Specifically, not
only the probabilities but also the transcription text remained unknown to the users and attacker
under no-label black-box access.

• Chapter 6 is our work accepted by IEEE TCSI. This work proposed a fast adversarial audio
generation under white-box access. Since the membership inference can help attackers to extract
the training set of the target model and further reveal the detail of the target model. This attack is
considered as a follow-up attack to our previous research.

• In Chapter 7, challenges of the ML-based stealing attack are discussed and followed by correspond-
ing future directions.

• Finally, Chapter 8 concludes the thesis.

5
Chapter 2

Literature Review

In this chapter, the literature related to the thesis topic — ML-based cyber attacks will be reviewed and
summarized including ML-based stealing attack and adversarial machine learning attack. An overall
methodology for ML-based stealing attack will be summarized above all. Then three different types of
stealing attacks will be reviewed and discussed following the overall methodology, including stealing ML
model related information, stealing user activities information, and stealing authentication information.
This chapter reviews the core papers in accordance with Machine Learning Based Stealing Attack
(MLBSA) including the ML-based attack against ML models. All tables highlight the essential elements
of the each ML-based attack. The attack methods and the corresponding countermeasures are discussed.
The detailed information of the dataset and source code for these attacks are listed on Github 1 .

2.1 ML-based Stealing Attack Methodology


This section presents the attack methodology about stealing controlled information attack utilizing ML
techniques as shown in Figure 2.1 and named as the MLBSA methodology. The cyber kill chain [24, 25], a
traditional model for cyber security threat analysis, is revised and used to model this attack methodology. A
typical kill chain consists of seven stages including reconnaissance, weaponization, delivery, exploitation,
installation, command and control, and actions on objectives [26, 27]. Reconnaissance aims to identify
the target by assessing the environment. As a result, the prior knowledge of attacks can guide data
collection. Regarding the ML-based stealing attack, weaponization means data collection. Extracting the
useful information via feature engineering is essential. Using supervised learning, the ML-based model
is built as a weapon taking actions on objectives. Moreover, the ML-based stealing attack may keep
improving its performance and accumulate the knowledge gained from its retrieved results. Other stages
of kill chain, including delivering the weapon to the victim, exploiting the vulnerabilities, installing the
malware, and using command channels for remote control [27], are considered as a preparation phase
before attacking the objectives. In this paper, the preparation phase is named feature engineering. Having
consolidated a few steps of the kill chain, the MLBSA methodology consists of five phases, which are
organized in a circular form implying a continuous and incremental process. The five phases of the
MLBSA methodology are 1) reconnaissance, 2) data collection, 3) feature engineering, 4) attacking the
objective, and 5) evaluation. The following subsections will illustrate each phase in details.
1
https://github.com/skyInGitHub/Machine-Learning-Based-Cyber-Attacks-Targeting-
on-Controlled-Information-A-Survey

6
Figure 2.1: ML-based stealing attack methodology (abbreviated as MLBSA methodology).

2.1.1 Reconnaissance
Reconnaissance refers to a preliminary inspection of the stealing attack. The two aims of this inspection
include defining adversaries’ targets and analyzing the accessible data in order to facilitate the forthcoming
attacks.
The target of adversaries in the published literature is usually the confidential information controlled by
systems and online services. According to Kissel [8] and Dukes [28], the term “information” is defined as
“the facts and ideas which can be represented as various forms of data, within which the knowledge in any
medium or form are communicated between system entities”. For example, an ML model (e.g. prediction
model) represents the knowledge of the whole training dataset and can act as a service to return results of
any search queries [29, 12]. Thereby, the controlled information can be interpreted as the information
stored, processed, and communicated in the controlled area for which the organization or individuals have
confidence that their protections are sufficient to secure confidentiality. It is more difficult to detect the
attack against confidentiality than that against integrity and availability. These information stealing attacks
are often referred to as the “unknown unknowns”.
In this thesis, the targeted controlled information can be classified into three categories: user activities
information, ML related information, and authentication information. Herein, user activities of the mobile
system, such as which app is running in the foreground, are considered as sensitive information. Such
sensitive information should be protected against security threats like phishing [10]. ML models can be
provided on the Internet as a service to analyze the big data and build predictive models [29, 12], such as
Google Prediction API and Amazon SageMaker. In this scenario, both the model and training data are
considered to be confidential subjects. However, when some ML services allow white-box access from
users, only training data is considered as confidential information. Passwords and secret keys unlocking
mobile devices and authenticating online services should always be stored securely [30, 31, 32, 33]. Using
ML to infer the password from user’s keystrokes breaks the information confidentiality [34, 35].
Since the information that adversaries aimed to steal is in control, the accessible data is the break-
through point. In order to analyze its value, the attacker acts as a legitimate user to learn the characteristics
and capabilities of the targeted systems, especially those related to the controlled information. During the
reconnaissance of accessible data, the attacker needs to search all possible entry points of the targeted
system, reachable data paths, and readable data [36]. When the attacker aims at user’s activities, the
triggered hardware devices and their corresponding logged information will be investigated [10, 36, 37].

7
For example, the attacker always searches and explores the readable system files, such as interrupt timing
data [10, 36] and network resources [37]. To perform a successful stealing attack against a model [29, 12]
or training samples [38, 18, 39], the functionalities of ML services (e.g. Amazon ML) are analyzed
by querying specific inputs. The attacker analyzes the relationship between the inputs and the outputs
including output labels, their corresponding probabilities (also known as confidence values), and the top
ranked information [18, 38, 12]. The relationship reveals some internal information about the target model
and/or training samples. For authentication information stealing attacks, stealing keystroke information
needs to utilize some sensor activity information activated by the attacker, while the intermediate data can
be regarded as accessible data [34, 35]. The fine-grained information about security domains, e.g. secret
keys, can be inferred by analyzing the accessible cache data [30, 40]. The accessible data related to the
target information are defined in this phase.

2.1.2 Data Collection


Having conducted a detailed reconnaissance process, the attacker refines the scope of their targeted con-
trolled information alongside with the awareness of the related accessible data. Then the attacker designs
specific queries against the target system/service to collect useful accessible data. What differentiates the
information stealing attack and other forms of cyber attacks is that the datasets are collected in a malicious
manner instead of being collected via the malicious way. In accordance with the intelligence gained
during the reconnaissance phase, data collection can be either active collection or passive collection.
Active collection refers to that the attacker actively interacts with the targeted system for data collection.
Specifically, an attacker designs some initial queries to interact with the system and subsequently collects
the data. The goal of the attacker guides the design of malicious interactions, referring to the analysis
results from the reconnaissance phases. All data closely related to the controlled information can be
gathered as a dataset for the stealing attack. For example, if an attacker intends to identify which app
is launched in a user’s mobile, some system files like procf s recording app launching activities should
be collected [36]. In addition, various apps will be launched for several times by the attacker for app
identification. By running 100 apps, [10] gathered a dataset of kernel information about the foreground
User Interface (UI) refreshing process to identify different apps. The active collection for stealing keystroke
information and secret keys are similar to that for stealing user activities information. Interacting with the
operating system with different keystrokes inputs, [34, 35] collected sensor information like acceleration
data and video records to infer keystroke movements. Moreover, [30] and [40] collected cache data to
analyze the relationship between memory access activities and different secret keys.
Additionally, the active collection for stealing ML-related information aims to design effective and
efficient queries against the target model. The design of active collection varies when the target model
allows black-box access or white-box access (see more details in Section 2). With the black-box access,
the inputs and corresponding outputs are collected to reveal the model’s internal information. The number
of inputs should be sufficient to measure the model’s functionality [12, 29] or clarify its decision boundary
[41, 42, 18]. [41] synthesized a set of inputs in order to train a local model to substitute the target model.
The range of inputs should also be wide enough to include samples inside and outside a model’s training
set [38, 43]. With the white-box access, not only inputs and outputs but also some internal information
are collected to infer the training data information. For instance, [39, 44] updated the model by training
a local model with data having different features, and the changes in the global model’s parameters are

8
utilized to infer the targeted feature values.
The other kind of collection is passive collection, which is defined as gathering all data related to the
targeted controlled information without engaging with the targeted system/service directly. The passive
collection mainly used to steal the password information in this thesis where the targeted system/service
is always a login system or permission granted. In such a case, engaging with the target system/service
directly could only provide attackers the information on whether the guessing password is correct or
not. This information can be used to validate an ML-based attack but is unable to contribute to an attack
model as training or testing data. [31] cracked users’ passwords by generating a lot of passwords in
high probabilities based on people’s behaviors of password creation, while [33] generated passwords
based on the semantic structure of passwords. Specifically, attackers collect the relevant information such
as network data, personal identifiable information (PII), previous leaked passwords, the service site’s
information and so on [31, 33, 32]. The information can be gathered by searching online and accessing
some open sources data like leaked passwords (as shown in Table 2.10). The attack targeting at password
data primarily uses the passive collection to gather information.
All of the stealing attacks involved in this thesis are utilizing the supervised learning algorithms.
The ground truth need to be set up in the data collection phase. Among investigated ML-based stealing
attacks, collected data was labeled with the target information or something related. For instance, the
kernel data about the foreground UI refreshing process caused by app launching activities is labeled with
corresponding apps [10]. [38] and [43] labeled data with member or non-member of the target training set.
Similarly, for stealing the authentication information, [35] and [34] labeled the collected sensor data with
corresponding input keystrokes. The ground truth of datasets for ML-based stealing attacks are closely
related to attackers’ target information.

2.1.3 Feature Engineering


After the datasets are prepared, feature engineering is the subsequent essential phase to generate repre-
sentative vectors of the data to empower the ML model. The two key points in feature engineering for
ML-based attacks consist of dataset cleaning and extracting features.
One obstacle of feature engineering is cleaning the noises and irrelevant information in the raw data.
In general, deduplication and interpolation can be used to reduce the noise from accessible resource [10].
To reduce the noise, a Fast Fourier Transform (FFT) filter and an Inverse FFT (IFFT) filter are applied
[35]. Other popular methods extract refined information and replace redundant information, such as
Dynamic Time Warping (DTW) and Levenshtein-distance (LD) algorithms for similarities calculation in
time series data [10, 36, 37], Symbolic Aggregate approXimation (SAX) for dimensionality reduction
[37, 45], normalization and discretization for effectiveness [37], and Bag-of-Patterns (BoP) representation
and Short Time Fourier Transform (STFT) for feature refinement [37, 46, 47].
In order to extract features, it is necessary to analyze and clarify the relationship between the dataset
and the targeted controlled information. The relationship determines what kinds of features the attacker
should extract. For instance, the inputs and their corresponding confidential values reveal the behaviour of
the model stored in cloud service (like Google service). Adversaries choose each query’s confidential
value as a key feature. Therefore, this relationship can be leveraged to steal an ML model and customer’s
training samples using reverse-engineering techniques [12] and the shadow training samples’ generation
[38]. Specifically, using reverse-engineering techniques, [12] revealed the target model’s parameters by

9
finding the threshold where confidential value changes with various inputs. Shadow training samples are
intended to be statistically similar to the target training set and are synthesized according to the inputs
with high confidence values.
When targeting at user activities, some feature extraction approaches are applied in a kernel dataset for
the stealing attack. [10] and [36] noticed that the diverse foreground apps could be characterized by the
changes in electrostatic field which existed in interrupt timing log files on Android. Hereafter, the statistics
of interrupt timing data are calculated as features [10]. Feature extraction techniques depend on the
type of the useful information. For example, several extraction techniques, including interrupt increment
computation, gram segmentation, difference calculation, and the histogram construction, are specialized
for the sequential data, like interrupt time series [10, 37, 48]. For the authentication information stealing
attack, the ways of defining features are similar to those methods mentioned above [35, 31, 32]. One
typical method is transforming the characteristics of information as features, such as logical values of the
state of sensor [49], temporal information accessing memory activities [30], different kinds of PIIs from
Internet resources [31, 33], and acceleration segments within a period of time collected from smartwatches’
accelerometer [35]. In addition, manually defining the features based on the attackers’ domain knowledge
is another popular method [36, 37, 50, 33].

2.1.4 Attacking the Objective

(a) The First Attack Mode (b) The Second Attack Mode

Figure 2.2: Adopted from cyber kill chain [24, 25], there are two ML-based attack modes: the first mode
uses this ML-based model as a weapon to steal controlled information, while this model itself is the
target for the second mode. Based on the result from reconnaissance, attackers design the input queries.
Querying the target system/service, attackers collect required accessible data from the inputs and their
query outputs. To set up ground truth, the data are labeled according to the target information. After
feature engineering, training dataset is built with labels to train an supervised ML model. For the first
mode, testing samples without labels test the model whose outputs are the target information. For the
second mode, the training dataset is used to reconstruct a model which is the attacker’s target.

In this thesis, we only consider the ML-based stealing attack as defined in Section 1 targeting at
user activity information, ML model related information, and authentication information. We summarize
the ML-based stealing attack into two attack modes as illustrated in Figure 2.2. That is, both attack
modes share the initial five actions. These four actions correspond to the first three phases within the
MLBSA methodology. Specifically, the attacker firstly reconnoiters the environment storing targeted
controlled information. The environment provides an interface taking users’ queries and responding to the
queries. The attacker designs the input queries and inquiries the target system/service. As stated in the
data collection phase, the inputs and their query results are collected as the required accessible dataset,
which reveals the target information. Based on the target information, the ground truth of the dataset is set

10
up in this phase. With proper feature engineering methods, the training dataset is prepared to attack the
objective. But the subsequent actions to steal the controlled information using machine learning differ
between two attack modes.
For the first attack mode as shown in Figure 2.2a, the training dataset is used to train an ML model
to steal the controlled information. The testing dataset has the same features as the training dataset.
The testing dataset is collected from a victim’s system/service, the testing samples are not labeled while
querying the attack model. Since the attack model is built to infer the controlled information from these
accessible data, the output of the model is the targeted controlled information. This attack mode is applied
in the ML-based stealing attack against the user activity information, the authentication information, and
training set information. The literature applies ML algorithms to train the classification model such as
Logistic Model Tree (LMT) [49], k-Nearest Neighbors (k-NN) [10, 36, 37], Support Vector Machine
(SVM) [48, 37], Naive Bayes (NB) [33, 49], Random Forest (RF) [35, 34, 43], Neural Network (NN)
[50, 51], Convolutional Neural Network (CNN) [52] and logistic regression [18, 44]. Apart from these
classification models, the probabilistic forecasting model is popular to predict the probability of the real
password with a guessing password pattern. There are a few probabilistic algorithms applied, such as
Probabilistic Context-Free Grammars (PCFG), Markov model, and Bayesian theory [30, 33, 32].
For the second attack mode illustrated in Figure 2.2b, the training dataset is used to train an ML model
while the model itself is the target of the attack. This attack mode is mostly applied in the ML-based
stealing attack against the ML model related information. In a black-box setup, stealing the ML model
attack aims at calculating the detailed expression of the model’s objective function. Reconstructing the
original model is essentially a reconstruction attack [18]. Using the equation-solving and path-finding
methods [12, 29], the inputs and their query outputs for solving the specific objective function expression
is interpreted as the training set. Therefore, this attack can be simplify regarded as an ML-based attack.
Additionally, based on the attackers’ inputs and the query outputs, the training set is synthesized and
used to build a substitute model for reconstruction [41, 42]. Several ML algorithms were applied in the
literature, such as decision tree [12, 41], SVM [12, 29], NN [12, 29, 42], Recurrent Neural Network
(RNN) [32], ridge regression (RR), logistic regression, and linear regression [29, 12, 41].
Moreover, some popular and publicly available tools can be used to train the ML model for the attack,
for example, WEKA and monkeyrunner [10, 34]. In summary, despite the model itself is the attacker’s
objective, the adversary can predict the results which reveal the controlled information in the training data
using ML techniques.

2.1.5 Evaluation
During the evaluation phase, attackers measure how likely they can successfully steal the controlled
information. Evaluation metrics differ between two attack modes. As we investigate the ML-based attack
under the first attack mode, the attack evaluation measures the performance of the attack model. The
higher the performance of the model is, the more powerful the weapon the attacker builds. While under
the second attack mode, the attack evaluation measures the differences between the attack model and the
target model. The attack will be considered more successful when the attack model is more similar to the
target model. Evaluation metrics for two attack modes are summarized separately.
For the first attack mode, the attack mode is the attacker’s weapon. Its performance is measured by
effectiveness and efficiency. Specifically, metrics like execution time and battery consumption are used

11
Table 2.1: Confusion Matrix for Evaluation.
XX
XXX Actual
XXX Class A Class B
Predicted XX
Class A True Positive (TP) False Positive (FP)
Class B False Negative (FN) True Negative (TN)

for efficiency evaluation. Most metrics commonly used to measure the effectiveness include accuracy,
precision, recall, FPR, FNR, and F-measure. Throughout the paper, some evaluation metrics are derived
from a confusion matrix as shown in Table 2.1. The evaluation metrics are listed as below.

• Accuracy: It is also known as success rate and inference accuracy [10, 35, 34]. Accuracy means
the number of correctly inferred samples to the total number of predicted samples. Accuracy is a
T P +T N
generic metric evaluating the attack model’s effectiveness. Accuracy = T P +T N +F P +F N

• Precision: It is regarded as one of the standard metrics for attack accuracy [38]. Precision illustrates
the percentage of samples correctly predicted as controlled class A among all samples classified as
A. Precision reveals the correctness of the model’s performance on a specific class [36, 49, 50],
TN
especially when features’ values are binary [18]. P recision = T N +F P

• Recall: It is regarded as another standard metric for attack accuracy [38]. Recall is also called
sensitivity or True Positive Rate (TPR) [49]. It is the probability of the amount of class A
correctly predicted as class A. Similar to precision, recall also reveals the model’s correctness on a
specific class. These two metrics are almost always applied together [36, 49, 18, 38, 43, 51, 44].
TP
Recall = T P +F N

• F-measure: This metric or F1-score is the harmonic mean of recall and precision. F-measure
2×Recall×P recision
provides a comprehensive analysis of precision and recall [49]. F −measure = Recall+P recision

• False positive rate (FPR): This metric denotes the proportion of class B samples mistakenly
FP
categorized as class A sampled. FPR assesses the model’s misclassified samples. F P R = T N +F P

• False negative rate (FNR): This metric stands for the ratio between class A samples mistakenly
categorized as class B samples. Similar to FPR, FNR assesses the model’s misclassified samples
from another aspect. FPR and FNR are almost always applied together to measure the model’s error
FN
rate [49]. F N R = T P +F N

• Execution time: The execution time is used in training the model which indicates the efficiency of
the attack model [37, 10, 40].

• Battery consumption: It is also known as power consumption [37]. Battery consumption refers
to the target mobile’s battery while the target system is a mobile system [10, 36, 37], which indicates
the efficiency of the attack model.

For the second attack mode, ML-based attacks of stealing the ML model are assessed with other
metrics. This kind of attack is the ML model reconstruction attack. Inherently, the reconstruction attack
requires a set of comparison metrics. The target of this kind of attack is an ML model fˆ which closely
matches the original ML model f . Generally, the stolen model fˆ will be constructed locally. Its prediction

12
results will be compared to the results of the original model with the same inputs. The applied evaluation
metrics are defined and listed below:

• Test error is the average error based the same test set (D) testing at learned model and targeted
P ˆ
ˆ ˆ x∈D dif f (f (x),f (x))
model [12]. A low test error means f matches f well. Errortest (f, f ) = |D|

• Uniform error is an estimation of the portion of full feature space that the learned model is different
from the targeted one, when the testing set (U ) are selected uniformly [12]. Errorunif orm (f, fˆ) =
dif f (f (x),fˆ(x))
P
x∈U
|U |

• Extraction accuracy indicates the performance of model extraction attack based on the test error
and the uniform error [12]. Accuracyextraction = 1 − Errortest (f, fˆ) = 1 − Errorunif orm (f, fˆ)

• Relative estimation error (EE) measures the effectiveness of model extraction attack using its
learned hyperparameters (λ̂) contrasting to the original hyperparameters (λ) [29]. ErrorEE =
|λ̂−λ|
λ

• Relative mean square error (MSE) measures how well the model extraction attack reconstructs
the regression models via comparing the mean square error after learning hyperparameters using
|M SEλ̂ −M SEλ |
cross-validation techniques [29]. ErrorM SE = M SEλ

• Relative accuracy error (AccE) measures how well the model extraction attack reconstructs the
classification models via comparing accuracy error after learning hyperparameters using cross-
|AccEλ̂ −AccEλ |
validation techniques [29]. ErrorAccE = AccEλ

The adversary applies evaluation metrics to determine whether the performance of attack is satisfactory
or not. If the value of any metrics does not meet the expectations, adversaries can restart the stealing
attack by redefining the targeted controlled information. The stealing attack can be executed incrementally
until the attacker gains the satisfactory results.

2.2 Stealing ML Model Related Information


ML model related information consists of the model description, training data information, testing data
information, and testing results. In this subsection, the ML model and users’ uploaded training data are the
targets, which are stored in the cloud. By querying the model via MLaaS APIs, the prediction/classification
results are displayed. The model description and training data information are controlled, otherwise, it
is easy for an attacker to interpret the victim’s query result. As most of ML services charge users per
query [53, 54, 55], this kind of attack may cause huge financial losses [12]. Additionally, several ML
models including neural networks are suffered from adversarial examples. Adding small but intentionally
worst-case perturbations to inputs, adversarial examples result in the model predicting incorrect answers
[56]. By revealing the knowledge of either the model’s internal information or its training data, the
stealing attack can facilitate the generation of adversarial examples [41, 42]. The generalized attack in
this category is illustrated in Fig. 2.3. Leveraging the query inputs and outputs, the model description
can be stolen by using a model extraction attack or a hyperparameter stealing attack, and the training
samples can be stolen by using the model inversion attack, Generative Adversarial Network (GAN) attack,

13
membership inference attack, and property inference attack. The countermeasures mitigating these attacks
are summarized at the end of this subsection.

Figure 2.3: The ML-based stealing attack against ML model related information. In this category,
ML-based attacks aim at stealing the training samples or the ML model.

2.2.1 Stealing controlled ML model description


It is important to protect the confidentiality of ML models online. If the ML model’s knowledge description
was stolen, the profit of the MLaaS platform may diminish because of its pay-per-query deployment [12].
If spam or fraud detection are based on ML models [13, 14, 16, 15], understanding the model means
that adversaries can evade detection [15]. A specific ML model is defined by two important elements
including ML algorithm’s parameters and hyperparameters. Parameters are learned from the training data
by minimizing the corresponding loss function. Additionally, hyperparameters are helping to find the
balance within objective function between its loss function and its regularization terms, which cannot be
learned directly from the estimators. Since the model is controlled, its parameters and hyperparameters
should be deemed confidential by nature. Stealing these model descriptions, the main approaches are
equation-solving, patch-finding and linear least square methods.

Table 2.2: Stealing Controlled ML Model Description


Reference Dataset for Evaluation Description Targeted ML Model Attack Methods
Circles, Moons, Blobs, Synthetic, 5,000 with 2 features,
5-Class [12]; Synthetic,1000 with 20 features,
Steak Survey [57], 331 records with 40 features,
GSS Survey [58], 16,127 records with 101 features, Logistic Regression;
Equation-solving
Adult (Income/race) [57], 48,842 records with 108/105 features, Decision Tree;
[12] attack; Path-finding
Iris [57], 150 records with 4 features, SVM;
attack
Digits [59], 1,797 records with 64 features, Three-layer NN
Breast Cancer [57], 683 records with 10 features,
Mushrooms [57], 8,124 records with 112 features,
Diabetes [57] 768 records with 8 features
DNN; SVM; k-NN;
MNIST [60], 70,000 handwritten digit images, Jacobian-based Dataset
[41] Decision Tree;
GTSRB [61] 49,000 traffic signs images Augmentation
Logistic Regression
Diabetes [57], 442 records with 10 features,
GeoOrig [57], 1,059 records with 68 features,
Regression algorithms;
UJIIndoor [57]; 19,937 records with 529 features;
[29] Logistic regression Equation solving
Iris [57], 100 records with 4 features;
algorithms; SVM; NN
Madelon [57], 4,400 records with 500 features;
Bank [57] 45,210 records with 16 features
[42] MNIST [60] 70,000 handwritten digit images NNs Metamodel methods

Stealing Parameters Attack: Model extraction attacks targeting ML models of the MLaaS systems
were described in [12]. The goal of the model extraction attacks was constructing the adversary’s own
ML model which closely mimics the original model on the MLaaS platform. That is, the constructed
ML model can duplicate the functionality of the original one. During the reconnaissance, MLaaS allows

14
clients to access the predictive model in the black-box setting through API calls. That is, the adversary
can only obtain the query results. Most MLaaS provides information-rich query results consisting of
high-precision confidence values and the predicted class labels. Adversaries can exploit this information
to perform the model extraction attack. The first step was collecting confidence values with query inputs.
Feature extraction needs to map the query inputs into a feature space of the original training set. Feature
extraction methods were applied for categorical and numerical features (Table 2.2). Equation-solving and
patch-finding attacks were used to calculate the objective function of the targeted model. Three popular
ML models were targeted listed in Table 2.2, while two online services namely BigML [62] and Amazon
ML [53] were compromised as case studies. The key processes of model extraction attacks include query
input design, confidence values collection, and attack with equation-solving and patch-finding.
Stealing the model’s parameters, the equation-solving attack and patch-finding attack are illustrated
in detail. Regarding the attack mentioned in [12], the equation-solving attacks can extract confidence
values from all logistic models including logistic regression and NNs, whereas the patch-finding attacks
work on decision trees model. The equation solving was based on the large class probabilities for
unknown parameters and then calculated the model. Specifically, the objective function of the targeted ML
model was the equation which adversaries aimed to solve. With several input queries and their predicted
probabilities, the parameters of the objective function were calculated. The patch-finding attack exploited
the ML API specificities to query specific inputs in order to traversal the decision trees. A path-finding
algorithm and a top-down approach helped in locating the target model algorithm to reveal paths of the
tree. In this way, the detailed structure of the targeted decision tree classifier was reconstructed. For
the experiments, the attack’s performance was measured by the extraction accuracy. The online model
extraction attack targeted the decision tree model which was set up by the users on BigML [62]. The
accuracy was over 86% irrespective of the completeness of queries. In another case study targeting the
ML model on Amazon services, the attacker reconstructed a logistic regression classification model. The
results showed that the cost of this attack was acceptable in terms of time consumption (less than 149s)
and the price charged ($0.0001 per prediction). The model was learned by calculating the parameters.
Apart from reconstructing the exact model parameters, another model extraction attack reveals the
model’s internal information by building a substitute model as shown in [41]. Herein, the substitute model
shares similar decision boundaries with the target model. During the reconnaissance, adversaries can
only obtain labels predicted by the target model with given inputs. To train this substitute model, the
substitute dataset is collected using a synthetic data generation technique named Jacobian-based Dataset
Augmentation with a small initial set [41]. Specifically, the ground truth of a synthetic data has the
label predicted by the target model, while the architecture is selected based on the understanding of the
classification task. The best synthetic training set is determined by the substitute model’s accuracy and
the similarity of decision boundaries. To approximate the target model’s boundaries, the Jacobian matrix
is used to identify the directions of the changes in the target model’s output. Hence, the model can be
reconstructed as a substitute model.
Stealing Hyperparameters Attack: Stealing hyperparameters in the objective function of the targeted
MLaaS model can gain financial benefits [29]. The investigated MLaaS models can be regarded as
black-box providing query results only, like Amazon ML [53] and Microsoft Azure Machine Learning
[54]. By analyzing the model’s training process, a key observation showed that parameters were learned
when the objective function reached its minimum value. That is, the gradient of objection function at the

15
model parameters should be the vector whose entries are all close to zeros. According to this observation,
the hyperparameters were learned covertly with a system of linear equations, when the gradient was set to
vector of zero.
A threat model was proposed by [29], when the attacker acted as a legitimate user of MLaaS platform.
Some popular ML algorithms used by the platform were analyzed (Table 2.2). That is, the attacker knew
the ML algorithm in advance. Given the learned model’s parameters, the attacker set the gradient vector
of the objective function of the non-kernel/kernel algorithm. By solving this equation with the linear least
square method, the hyperparameters were found by the attacker. In some black-boxed MLaaS models,
the attacker applied the model parameter stealing attacks relied on equation-solving in [12] to learn the
parameters firstly. Thus, even though the parameters were unknown beforehand, the attacker could steal
the hyperparameters. Therefore, the target model was reconstructed successfully.
To evaluate the effectiveness of this stealing hyparameters attack [29], several real-world datasets
listed in Table 2.2 were used. Additionally, a set of hyperparameters whose span covered a large range
were predefined. Apart from these, a scikit-learn package was applied to implement different ML models
and worked out the values of each hyperparameter. For experimental evaluation, relative mean square
error (MSE), relative accuracy error (AccE), and relative estimation error (EE) were applied. The results
showed a high accuracy of attack performance with all estimation errors were below 10%. The good
performance indicated that the attacker successfully stole the target model.
As this attack [29] was implemented in the MLaaS platform, three methods were suggested to learn
an accurate model with fewer costs: uploading the training set firstly with a specific learning algorithm,
uploading part of the training set that was randomly selected, and “Train-Steal-Retrain”. The third method
was proved to be more accurate with less time, which means that the attacker re-learned the model with
entire training set specifying a learning algorithm and calculating a hyperparameter with the second
method. The “Train-Steal-Retrain” is the best among three in practical attacks.
Regarding a target model as a black-box, its hyperparameters can also be learned by building another
metamodel which takes various classifiers’ input and output pairs as training data [42]. Firstly, by
observing outputs of the target model with given inputs, a diverse set of white-box models need to be
trained by varying values of various hyperparameters (i.e. activation function, the existence of dropout or
maxpooling layers, and etc). More importantly, these white-box models are expected to be similar to the
target model. The training set of the metamodel can be collected by querying inputs over these white-box
models, while the ground truth label should be the hyperparameter’s value used by the corresponding
white-box model. Afterwards, by querying the target model, its hyperparameter can be predicted given its
output to the metamodel. Stealing hyperparamters attack is complementary to the stealing parameters
attack.

2.2.2 Stealing controlled ML model’s training data


Another type of controlled information about MLaaS product is the training data. Training data is not only
useful to construct the model using ML algorithms provided by an MLaaS platform, but also sensitive
as the records can be private information [18, 63]. For example, a user’s health diagnostic model is
trained by personal healthcare data [38]. Hence, the confidentiality of the model’s training data should be
protected. The ML-based stealing attacks include the model inversion attack, GAN attack, membership
inference attack, and property inference attack. Moreover, two protections are demonstrated: One uses

16
adversarial regularization against membership inference attack, while another utilizes count featurization
for protecting the models’ training data.

Table 2.3: Stealing Controlled ML Model’s Training Data.


Reference Dataset for Experiment Description Feature Engineering ML-based Attack Method
FiveThirtyEight survey, 553 records with 332 features, Decision Tree,
[18] N/A
GSS marital happiness survey 16,127 records with 101 features Regression model
MNIST [60], 70,000 handwritten digit images, Features learned Convolutional Neural
[39]
AT&T [64] 400 personal face images with DNN Network (CNN) with GAN
CIFAR10 [65], 6,000 images in 10 classes,
CIFAR100 [65], 60,000 images in 100 classes,
Purchases [66], 10,000 records with 600 features, Regarded shadow model
[38] Foursquare [67], 1,600 records with 446 features, resulted as features and NN
Texas hospital stays [68], 10,000 records with 6170 features, label records as in/out
MNIST [69], 10,000 handwritten digit images,
Adult (income) [57] 10,000 records with 14 attribute
Include 6 sets in [38], Same as above cell, Regarded shadow model Random Forest,
[43] News [70], 20,000 newsgroup documents in 20 classes, resulted as features and Logistic Regression,
Face [71] 13,000 faces from 1,680 individuals label records as in/out Multilayer perceptron
Adult (income) [57], 299,285 records with 41 features,
MNIST [60], 70,000 handwritten digit images, Neuron sorting ,
[51] NN
CelebFaces Attributes [72], more than 200K celebrity images, Set-based representation
Hardware Performance Counters 36,000 records with 22 features
Face [71], 13,233 faces from 5,749 individuals,
FaceScrub [73], 76,541 faces from 530 individuals, Logistic regression,
[44] PIPA [74], 60,000 photos of 2,000 individuals, N/A gradient boosting,
Yelp-health, Yelp-author [75], 17,938 reviews, 16,207 reviews, Random Forests
FourSquare [67], CSI corpus [76] 15,548 users in 10 locations, 1,412 reviews
CIFAR100 [65], 60,000 images in 100 classes, Regarded shadow model
[77] Purchase100 [66], 197,324 records with 600 features, resulted as features and NN
Texas100 [68] 67,330 records with 6,170 features label records as in/out

Model Inversion Attack & Defense: The model inversion attack was developed by [18] via conducting
the commercial MLaaS APIs and leveraging confidence information with predictions. Though another
model inversion attack proposed in [63] leaked the sensitive information from ML’s training set, the attack
could not work well under other settings e.g. the training set has a large number of unknown features.
However, the attack proposed in [18] aimed to be applicable across both white-box setting and black-box
setting. For the white-box setting, an adversarial client had a prior knowledge about the description of the
model as the APIs allowed. For the black-box setting, the adversary was only allowed to make prediction
queries on ML APIs with some feature vectors. Considered as the useful data for the attack, the confidence
values were extracted from ML APIs by making prediction queries. The attacks were implemented in two
case studies — inferring features of the training dataset, and recovering the training sample of images.
The model inversion attack targets the ML model’s training data under both settings.
The first attack was inferring sensitive features of the inputs from a decision tree classifier. BigML
[62] was used to reveal the decision tree’s training and querying routines. With query inputs with different
features and the corresponding confidential values, the attacker in [18] accessed marginal priors for each
feature of the training dataset. For the black-box setting, the attacker utilized the inversion algorithm [63]
to recover the target’s sensitive feature with weighted probability estimation. A confusion matrix was
used to assess their attacks. For the white-box setting, the white-box with counts (WBWC) estimator was
used to guess the feature values. Evaluated with Global Social Survey (GSS) dataset [78], the results
showed that white-box inversion attack on decision tree classifier achieved 100% precision, while the
black-box one achieved 38.8% precision. Additionally, the attack in the white-box setting received 32%
less recall than that in the black-box setting. Comparing to black-box attacks, white-box inversion attacks
show a significant advance on feature leakage, especially in precision.
The second attack was recovering the images from an NN model — a facial recognition service
— accessed by APIs. Learning the training samples was required to steal the recognition model firstly.

17
Two specific model inversion attacks were proposed in [18], including reconstructing victim’s image
with a given label, and determining whether the blurred image existed in training set. Specifically, the
inversion attack for facial recognition (MI-Face) method and Process-Denoising Autoencoder (Process-
DAE) algorithm were used to perform the attacks [18]. Herein, the query inputs and confidential values
were used to refine the image. The best reconstruction performance from evaluation was 75% overall
accuracy and 87% identification rate. Moreover, the attacker employed an algorithm named maximum a
posterior to estimate the effectiveness. The evaluation results showed that the proposed attacks enhanced
the inversion attack efficacy significantly comparing to the previous attack [63]. The training images were
recovered accurately.
Stealing the Training Data of Deep Model with GAN: An attack against the privacy-preserving
collaborative deep learning was designed to leak the participants’ training data which might be confidential
[39]. A distributed, federated, or decentralized deep learning algorithm can process each users’ training
set by sharing the subset of parameters obfuscated with differential privacy [79, 52]. However, the training
dataset leakage problem had not yet been solved by using the collaborative deep learning model [79]. An
adversary can deceive the model with incorrect training sample to deduce other participants to leak more
local data. Then by leveraging the learning process nature, the adversary can train a GAN for stealing
others’ training samples. The GAN attack targets the collaborative deep learning.
Specifically, the GAN simulated the original model in collaborative learning process to leak the
targeted training records [39]. During the reconnaissance phase, an adversary pretended as one of the
honest participants in collaborative deep learning, so that the adversary could influence the learning
process and induce the victims to release more information about the targeted class. To collect the
valuable data, the adversary did not need to compromise the central parameter server instead of inferring
the meaningful information of that class based on the victim’s changed parameters. In addition, as a
participant was building up the targeted model, part of the training samples were known to the adversary.
The fake training dataset could be sampled randomly from other datasets. The true training dataset and
a fake training dataset were collected to train the discriminator of the GAN using the CNN algorithm.
The outputs of this discriminator and another fake training dataset were used to train the generator of the
GAN using CNN. Since the feature of the training data was known by default, the adversary sampled the
targeted training data with the targeted label and random feature values. This fake sample was fed into
the generator model. The adversary modified the feature values of this fake sample until the predicted
label was the targeted label. The final modification of this fake sample was regarded as the target training
sample. In the experiments, the GAN attack against collaborative learning was evaluated with MNIST
[60] and AT&T datasets [64] as inputs. Comparing with model inversion attack, the discriminator within
GAN attack reached 97% accuracy and recovered the MNIST image trained in the collaborated CNN
clearly. In a word, the GAN attack trained the discriminator and generator to steal the training data.
Membership Inference Attack: Learning a specific data record was targeted by [38], which was the
membership of the training set of the targeted MLaaS model. Since the commercial ML model only
allowed black-box access provided by Google and Amazon, not only the training data but also the training
data’s underlying distribution were controlled. Though the training set and corresponding model were
unknown, the output based on a given input revealed the model’s behavior. By analyzing such behaviors,
adversaries found that the ML model behaved differently on the input that they trained compared to the
input which was new to the model. Therefore, according to this observation, an attack model was trained,

18
which could recognize such differences and determine whether the input data was the member of targeted
training set or not. The attack is intended to recognize the model’s behavior testing with target training
sample.
The attack model was constructed by leveraging a shadow training technique [38]. Specifically,
multiple “shadow models” were built to simulate the targeted model’s behavior, which informed the
ground truth of membership of their inputs. All “shadow models” applied the same service (i.e. Amazon
ML) as the targeted model. In addition, the training data that the adversary used can be generated by the
model-based synthesis and statistic-based synthesis methods. The generated dataset shared the similar
distribution to the object model’s training set, while the testing set was disjoint from training set. Querying
these “shadow models” with the training sets and testing sets, the prediction results were added a label of
in or out. These records could be collected as the attack model’s training set. Then the adversary utilized
the built binary classifier to learn a specific data record by determining whether it was in or out of the
training set for MLaaS model. Such an offline attack was difficult to be detected, while MLaaS system
would consider the adversary as a legitimate user since the adversary was just querying online. Shadow
models were trained to produce the inputs for the membership inference attack.
For this membership inference attack evaluation [38], several public datasets were used and listed
in Table 2.3. Three targeted models were constructed by Google Prediction API, Amazon ML, and
CNN respectively. The evaluation metrics used by the adversaries were accuracy, precision, and recall.
According to the evaluation results, Google Prediction API was suffered from the biggest training data
leakage due to this attack. The accuracy of the attack model was above the baseline 50% (random guessing
result) in all experiments, while the precision were all over 60%, and recall was close to 100%. The
membership inference attack learned the training sample effectively.
For the mitigation, since overfitting was the most important reason that makes ML model be vulnerable
to the membership inference attack [38], regularization techniques could be applied in the ML model
to resolve the overfitting problem [80, 81, 82]. Another three mitigation strategies were described as
restricting the prediction vector to top k classes, using the coarsen precision results, and increasing entropy
of prediction vector for NN models [83]. The first method, unfortunately, could not fully prevent the
membership inference attack. The last two methods obfuscated prediction vectors to mitigate the leaking.
Those restriction and obfuscated data protected training set to a limited extent.
In 2019, [43] further studied membership inference attack to make it broadly applicable at low
cost. Specifically, three assumptions mentioned in [38] are relaxed including using multiple shadow
models, synthesizing the dataset from the similar distribution of the target model’s training set, and the
knowledge of the target model’s learning algorithm. The results show that the performance of these attack
will not be affected with only one shadow model trained with a dataset from other distributions. The
results of using different classification algorithms on one shadow model are not promising. However, by
combining a set of ML models trained with various algorithms as one shadow model, the performance of
membership inference attack can be tolerable (above 85% in precision and recall). Herein, the attack is
based on an assumption that one model of the model set is trained with the learning algorithm used by the
target model. Furthermore, by selecting a threshold of the posterior results to determine the input data’s
membership, even shadow model is not needed for the membership inference attack. Therefore, the scope
of membership inference attack is enlarged.
Property Inference Attack: Different from learning a specific training record, the property inference

19
attack targets at the properties of training data that the model producer unintended to share. [51] defines
their target model as a white-box Fully Connected Neural Networks (FCNNs), and aims to infer some
global properties such as a higher proportion of women. To launch this attack and take a model as input,
a meta-classifier is built to predict whether the global property exists in this model’s training set or not.
Above all, several shadow models are trained on a similar dataset using similar training algorithms to
mimic the target FCNNs. During the feature engineering phase, instead of using a flattened vector of all
parameters [84], [51] applied set-based representation to form the meta-training set. Specifically, a set-
based representation is learned using the DeepSets architecture [85]: 1) flattens each nodes’ parameters
from all hidden layers, 2) obtains a node representation with node processing function based on the
target property, 3) sums a layer representation with layer summation, and 4) concatenates these layer
representations as a classifier representation. The accuracy of this attack reached 85% or over on binary
income prediction, smile prediction or gender classification task. This property inference attack against
white-box FCNNs is effective to steal training set information.
In collaborative learning, leaking unintended features about participants’ training data is another kind
of property inference attack [44]. Instead of global properties, the unintended feature they targeted is held
for a certain subset of training set or even independent of the model’s task. For example, the attacker
infers black face property of training data while learning a gender classifier in a federated manner. In
the reconnaissance process, the adversary as a participant can download the current joint model for each
iteration of the collaborative learning. The aggregated gradient updates from all participants are computed,
thereafter, the adversary can learn the aggregated updates other than his own updates [86]. Since the
gradients of one layer are calculated based on this layer’s features and the previous layer’s error, such
aggregated updates can reflect the feature values of other participants’ private training set. After several
iterations, these updates are labeled with the targeted property and fed to build a batch property classifier.
Given model updates as inputs, this classifier can predict corresponding unintended features effectively
(most precisions larger than 80%). Therefore, the collaborative learning is vulnerable to property inference
attack as well.
Protection using Adversarial Regularization: A protection for black-box MLaaS models against the
membership inference attack was introduced in [77]. As described in [38], membership inference attack
could learn whether a data sample was a member of targeted model’s training set, even if the adversary
only knew the queried output of this cloud service. Regularizing the ML model with L2-norm regularizers
was one of the major mitigation methods [38, 18], which was not considered to offer a rigorous defense.
On the other hand, researchers concluded that differential privacy mechanism prevented this information
leakage by sacrificing the model’s usability which was expendable. To guarantee the confidentiality and
privacy of training set rigorously, a privacy mechanism is proposed more powerful than regularization and
differential privacy.
A defender’s objective was analyzed firstly by formalizing membership inference attack in [77].
Precisely, the input of an inference model consisted of a testing data for the targeted classifier, its
prediction vector, and a label about membership. An adversary aimed to maximize his inference gain,
which was effected by the targeted training dataset and a disjoint dataset for reference attack training.
Therefore, the defender intended to minimize the adversary’s inference gain, in the meanwhile, minimizing
the loss of targeted classifier’s performance. That is, the defender enhanced the security of ML model
by training it in an adversarial process. The inference gain as an intermediate result was regarded as the

20
classifier’s regularizer to revise the ML model with several training epochs. An adversarial regularization
was used in training the classifier.
To evaluate the defense mechanism, three common datasets were used in membership inference attack
[38]. The classifier’s loss was calculated when the attacker’s inference gain reached the highest score [77].
The results showed that the classification loss reduced from 29.7% to 7.5% with defense comparing to
that without defense for the Texas model, which could be insignificant. For the membership inference
attack, the accuracy performance targeted at protected ML model is around 50%, which was close to the
random guessing. In a word, the protecting model using the adversarial regularization guaranteed the
confidentiality and privacy of its training data.
The protection proposed in [77] was powerful against the membership inference attack. However, its
effectiveness in protecting training data leaked by other attacks remains unknown. Additionally, it did not
discuss whether adversarial regularization can protect the white-box MLaaS models from membership
inference attack or not. Moreover, this defense method could not deal with an online attack like stealing
the training data of deep model with GAN [39].
Protection using PATE: To protect the training set of an ML model generally, Private Aggregation of
Teacher Ensembles (PATE) was proposed by [87]. Specifically, PATE prevents training set information
leakage from model inversion attack, GAN attack, membership inference attack, and property inference
attack. Two kinds of models are trained in this general ML strategy including “teacher” and “student”
models. Teacher models are not published and trained on sensitive data directly. Splitting sensitive dataset
into several partitions, several teacher models are trained using learning algorithms independently. These
teacher models are deployed as an ensemble making predictions in a black-box manner. Given an input to
these teachers, aggregating their predictions as a single prediction depends on each teacher’s vote. To
avoid that teachers do not have an obvious preference in aggregation, Laplacian noise is added to vote
counts. Obtaining a set of public data without ground truth, the student will label them by querying the
teacher models. Then the student model can be built in a privacy-preserving manner by transferring the
knowledge from teachers. Moreover, its variant PATE-G uses the GAN framework to train the student
model with a limited number of labels from teachers. In conclusion, the PATE framework provides a
strong privacy guarantee to the model’s training set.
Protection using Count Featurization: A limited-exposure data management system named Pyramid
enhanced the protection for organizations’ training data storage [88]. It mitigated the data breaches prob-
lem by limiting widely accessible training data, and constructing a selective data protection architecture.
For emerging ML workloads, the selective data protection problem was formalized as a training set
minimization problem. Minimizing the training set can limit the stolen data.
In prior data management [89], only in-use data were retained in the accessible storage for the ML
training periodically, whereas the unused data was reserved in the protected area. However, with the
speedy application of ML mechanisms, the whole dataset would be exposed from accessible storage
continuously [88]. For this concern, distinguishing and extracting the necessary data for effective training
was the key process. The workflow of Pyramid kept accessible raw data within a small rolling window.
The core method named “count featurization” was used to minimize the training set. Specifically, the
counts summarized the historical aggregated information from the collected data. Then Pyramid trained
the ML model with the raw data featurized with counts in a rolling window. The counts were rolled
over and infused with differential privacy noise to preserve the training set [90]. In addition, the balance

21
between training set minimization and model performances (accuracy and scalability) should also be
considered. Three specific techniques were applied to retrofit the count featurization for data protection.
The infusion with the weighted noise added less noise to noise-sensitive features of the training set.
Another technique, called unbiased private count-median sketch, solved the negative bias problem arising
from the noise infusion, while the automatic count selection found out useful features automatically and
counted them together. For training data protection, count featurization was used to remain necessary
data within data storage. Pyramid prevented the attacker from learning the extracted information from the
training set.

Table 2.4: Categories of Stealing ML related information attacks from three perspectives (info: informa-
tion).
Attack Targets Attack Surfaces Attacker’s Capabilities
Attack Type
Model Info Training Set Info Training Phase Inference Phase Black-box Access White-box Access
Model extraction attack [12] YES no no YES YES no
Model extraction attack [41] YES no no YES YES no
Hyperparameter stealing attack [29] YES no no YES YES no
Hyperparameter stealing attack [42] YES no no YES YES no
Black-box inversion attack [18] no YES no YES YES no
White-box inversion attack [18] no YES no YES no YES
GAN attack [39] no YES YES no no YES
Membership inference attack [38] no YES no YES YES no
Membership inference attack [43] no YES no YES YES no
Property inference attack [51] no YES no YES no YES
Property inference attack [44] no YES YES no no YES

Table 2.5: Attack’s prior knowledge under black-box access and white-box access.
Black-box Access White-box Access
Attack Type
Predicted Label Predicted Confidence Parameters Hyper-parameters
Model extraction attack [12] YES YES no no
Model extraction attack [41] YES no no no
Hyperparameter stealing attack [29] YES YES no no
Hyperparameter stealing attack [42] YES YES no no
Black-box inversion attack [18] YES YES no no
White-box inversion attack [18] YES YES YES YES
GAN attack [39] YES YES YES YES
Membership inference attack [38] YES YES no no
Membership inference attack [43] YES YES no no
Property inference attack [51] YES YES YES YES
Property inference attack [44] YES YES YES YES

Summary:In Chapter 2.2, ML-based stealing attacks against model related information target at
either model descriptions or model’s training data. In addition to this category, as shown in Table 2.4,
the other two ways focus on attacks at training/inference phase and with black-/white-box access [91].
Model extraction attacks [12, 41] and hyperparameter stealing attacks [29, 42] leak the model’s internal
information happened at inference phase. Attackers steal model’s training data mostly at inference phase,
except the GAN attack [39] and the property inference attack [44] which happen at training phase of
collaborative learning. When attacking during training phase, attackers with white-box access to the
model can exploit its internal information. As shown in Table 2.5, the white-box access allows attackers
to have more prior knowledge than black-box, which results in high performance of the stealing attack
[18]. On the other hand, black-box attacks can be more applicable in the real world. Additionally, except
[43], most of the attackers in this category under black-box access know the learning algorithm of the
target model [12, 29, 41, 42, 18, 50].
Countermeasures: Concerning the ML pipeline, the protection methods will be applied in data
preprocessing phase, training phase, and inference phase respectively. Differential privacy noise used in

22
the first phase can build a privacy-preserving training set [88]. Differential privacy is the most common
countermeasures to defend against the stealing attack, however, it alone cannot prevent the GAN attack
[39]. Differential privacy, regularization, dropout and rounding techniques are popular protections at
training and inference phases. At the training phase, differential privacy on parameters cannot resist the
GAN attack [39], while rounding parameters is not effective against hyperparameter stealing attack [29].
Regularization is effective depending on the targeted algorithm towards hyperparameter stealing attack
[29].

2.2.3 ML-based Attack about Audio Adversarial Examples Generation


Stealing controlled ML model’s information can boost another ML-based attack about adversarial example
generation.Generally, the audio adversarial examples are generated by adding intentionally small/im-
perceptible perturbations to benign examples. The aim of such adversarial examples is to mislead the
model predicting an incorrect answer. When this incorrect answer is designed by the adversary, we named
such an adversarial attack as the targeted adversarial attack. In this thesis, we only consider the audio
adversarial example generation against an automatic speech recognition (ASR) model.
Hidden Voice Commands: Carlini et al. [92] generated an audio adversarial examples to hide voice
commands under two threat models including the black-box model and the white-box model. In the
black-box model, the attacker extracts the benign audio’s acoustic information trough a transform function
like Mel-Frequency Cepstral Coefficients (MFCC) firstly. Then inverse the MFCC and additionally add
some noise to generate an adversarial audio. Querying the target ASR model with such an adversarial
audio, if the machine cannot recognize the target phrase, then the attacker will refine the noise added
before and generate another adversarial audio till the target phrase is recognized. In the meanwhile, the
adversarial audio should not be recognized by human beings. In the white-box model, the attacker has full
knowledge about the ASR model’s internal information. Targeting at the open-source CMU Sphinx speech
recognition system [93], the attacker can utilize the information of the coefficients for each Gaussian in
the Gaussian Mixture Model (GMM) and some dictionary files including the mapping information from
words to phonemes. With these information, the attacker can identify which MFCC coefficients need to
be modified with noise to generate the adversarial audio. Herein, gradient descent are used for each frame
and find the local optima noise added to the benign audio.
Inaudible Attack: Carlini et al. [94] and Zhang et al. [95] improves the qualify of adversarial audio
during the audio adversarial examples generation process. Specifically, the former research quantify a
distortion of the perturbation with white-box access. Measuring the relative loudness in a logarithmic
way, the distortion of the perturbation is considered during the optimization function. Targeting at the
open-source DeepSpeech end-to-end ASR model, Connectionist Temporal Classification loss (CTC-loss)
and gradient descent are used to modify the perturbation added in the benign audio. The latter research
used the same method as [92]. However, instead of adding noise to the benign audio directly, they hide
the voice commands on ultrasonic carriers to ensure the adversarial audio’s inaudibility. Another research
proposed by Qin et al. [96] designed an audio adversarial examples using the method similar to [94]
with effectively imperceptible noise. Specifically, they utilized the psychoacoustic principle of auditory
masking so that only machine can recognize the commands.
Over-the-air Attack: Different from the white-box attack and the black-box attack over-the-line, the
over-the-air attack should consider the environment noise to the generated adversarial audio. Both Yakura

23
and Sakuma [97] and Schonherr et al. [98] analyzed the room impluse responses during the adversarial
example generation phrase. Especially, Schonherr et al. [98] utilized the psychoacoustic methods to noise
signal and made sure the robust adversarial audio’s inaudiability.
Summary: For audio adversarial attack, the attacker usually utilize gradient descent to modify the
noise added to the benign audio under white-box attack. With black-box access, the attacker regards the
ASR model as an opaque oracle [92] and always inverse the MFCC information of the benign audio. The
psychoacoustic masking and ultrasonic carriers are used to produce adversarial audio in high quality,
whose hidden commands or noise are inaudible to human beings. The defense about audio adversarial
examples generation could be alerting the users with machine’s recognized commands. Another common
countermeasure is to learn human preception of the speech and build an adversarial example robustness
ASR model [98].

2.3 Stealing User Activities Information


It is essential for security specialists to protect user activities information. Not only because the private
activities are valuable to adversaries, but also the adversary can exploit some specific activities (i.e. fore-
ground app) to perform malicious attacks such as the phishing attack [10]. In general, the attackers pursue
two types of data — kernel data and sensor data, as shown in Fig. 2.4. We organize the reviewed papers
according to the MLBSA methodology. The countermeasures against this kind of ML-based stealing
attack are discussed at the end of Section 2.3. According to the utilized kernel data and sensor data,
controlled user activities information were stolen through timing analysis and frequency analysis.

Figure 2.4: The ML-based stealing attack against user activities information.

2.3.1 Stealing controlled user activities from kernel data


The dataset collected from the kernel about system process information is too noisy and coarse-grained to
disclose any intelligible and valuable information. However, through analyzing plenty of such data, the
adversary could deduce some confidential information about the victim’s activities with the help of ML
algorithms.
Stealing User Activities with Timing Analysis: The security implications of the kernel information were
evaluated in [10, 36] through integrating some specific hardware components into Android smartphones.
During the reconnaissance phase, user activities information records the user’s interactions with hardware
devices, before responded by the kernel layer. The targeted user activities in [10] were unlock patterns
and foreground apps. Moreover, users’ browsing behavior was targeted by the attacker in [36]. One
kind of kernel data was accessible to legitimate users, which logged the time series of the hardware

24
Table 2.6: Stealing Controlled User Activities using Kernel Data
Reference Dataset for Experiment Description Feature Engineering ML-based Attack Method
Deduplication; Interpolation; HMM with Viterbi
Interrupt data for unlock
[10] Collect from procf s Interrupt Increment Computation; algorithm; k-NN classifier
pattern and for apps
Gram Segmentation; DTW with DTW
Time series for apps, Automatically extract with Viterbi algorithm with DTW;
[36] Collect from procf s
website,keyboard guests tsf resh; DTW SVM classifier with DTW
1200 x 6 time series of 120 apps(App Store+iOS ) SVM classifier;
Manually defined;
[37] data about app; +10 trace x 6 time series; k-NN classifier
SAX, BoP representation
1000 website traces 10 traces for each website with DTW
Consecutively reading N/A; Construct a histogram binned
[48] Collect from procf s SVM classifiers
data; Resident size field data into seven equal-weight bins

interrupt information can reveal previous activities. Specifically, the reported interrupts imply the real-time
running status of some specific hardware (i.e. touchscreen controller). However, accessing the similar
process-specific information had been continuously restricted since Android 6, and the interrupt statistics
became unavailable in Android 8 [36]. Different Android versions contain different kinds of process
information which is accessible to legitimate users without permissions under proc filesystem (procf s).
Thus, an app was developed in [36] to search all accessible process information under procf s. The
time series of these accessible data could distinguish the event of interests including unlocked screen,
the foreground app, and the visited website. Reconnaissance showed the value of time series of data in
procf s.
During the data collection, the interrupt time logs were collected by pressing and releasing the
touchscreens in [10]. Specifically, for the versions prior to Android 8, a variety of interrupt time series
recording the changes of electrostatic field from touchscreen were gathered as one dataset for stealing the
user’s unlock pattern. Another dataset was built for stealing foreground apps’ information by recording the
time series of starting the app from accessible sources like interrupts from the Display Sub-System [10]
and the virtual memory statistics [36]. Moreover, the time series of some network process information
fingerprinted online users. These fingerprints were gathered as the dataset for stealing user’s web browsing
information. Different sets of time series were prepared with respect to the information of different user
activities.
In terms of feature engineering, the attacker can analyze the process information on procf s to study
the characteristics of the user’s unlock pattern, foreground app status, and the user’s browsing activity.
The datasets were firstly processed by deduplication, interpolation and increment computation. The
distinct features of three datasets were constructed via several methods such as segmentation, similarity
calculation and dynamic time warping (DTW). An automatic method named tsf resh [99] was utilized
for feature extraction. Subsequently, for the stealing attack targeting unlock patterns, a Hidden Markov
Model (HMM) was used to model the attack to infer the unlock patterns through the Viterbi algorithm
[100]. The evaluation results showed that its success rate outperformed the random guessing significantly.
Targeting at foreground apps, the processed data was used to train a k-NN classifier. For the evaluation,
the results showed that the classifier had high accuracy, which achieved 87% on average in [10] and 96%
in [36]. To reveal the user’s browsing activities, SVM classifier was used to mount the attack. The results
showed that both precision and recall values were above 80% in [36]. Among these three attack scenarios,
the consumption of battery and time were acceptable (less than 1% and shorter than 6 minutes). The
ML-based stealing attack showed its effectiveness with less consumption in time and battery.
Stealing User Activities with iOS Side-channel Attack: In iOS systems, one popular side-channel
attack vector of Linux system about the process information — procf s — is inaccessible, which hinders

25
the aforementioned attacks from leaking the sensitive information. Attackers have actively looked for new
resources to exploit in the Operating System level.
In the reconnaissance phase, several attack vectors, which are feasible to Apple, were applied to
perform cross-app information leakage [37]. Specifically, three iOS vectors enabled apps accessing the
global usage statics without requiring any special permissions to bypass the timing channel: the memory
information, the network-related information, and the file system information. The attacker aimed to
steal the user activities’ information (such as foreground apps, visited websites and map searches) and
in-app activities (such as online transactions). To collect data for an ML-based attack, attackers manually
collected several data traces for interesting events like foreground apps, website footprints and map
searches. To improve the performance of such an inference attack, the information collected from multiple
attack vectors was combined and fed into the ML models. Particularly, time series data from the targeted
vectors were exploited frequently. As for feature engineering and the stealing attacks, ML frameworks
were utilized to exfiltrate the user’s information from accessible vectors [37]. The changes of the time
series reflected in the difference between two consecutive data traces. The feature processing methods
were applied to transform the sequences into the Symbolic Aggregate approXimation (SAX) strings
[45] and to construct the Bag-of-Patterns (BoP) of the sequences. In [37], two ML-based attacks with a
large amount of data were presented — classifying the user activities and detecting the sensitive in-app
activities. An SVM classifier was trained and tested for the former attack. The Viterbi algorithm [100]
with DTW was utilized for the latter attack. In terms of the evaluation of the first attack stealing three
users’ activities, the foreground app classification accuracy achieved 85.5%, Safari website classification
accuracy reached 84.5%, and the accuracy accomplished 79% in inferring map searches. The proposed
attacks could be trained on the attacker’s device and tested on other devices such as the victim’s devices.
Meanwhile, the power consumption was acceptable with only 5% extra power used in an hour, while the
attacks’ execution time was tolerable as well (within 19 minutes). In the context of stealing user activities
information, ML-based attacks exploited the OS-level data with time series analysis.

2.3.2 Stealing controlled user activities using sensor data


The stealing attack using sensor data should be studied seriously by the defenders, not only from the
application of effective ML mechanisms, but also from the popularity of sensing enabled applications
[101, 102, 103, 104, 105]. The sensor information can reveal the controlled information indirectly as
demonstrated in this stealing attack, such as acoustic and magnetic data.

Table 2.7: Stealing Controlled User Activities using Sensor Data


Reference Dataset for Experiment Description Feature Engineering ML-based Attack Method
Recorded with a phone put STFT,
[47] Audio signature dataset A regression model
within 4 inches of the printer noise normalization
Sensor data collected benign Markov Chain, NB, LMT,
[49] Sensor dataset N/A
and malicious activities (alternative algorithms e.g. PART)

Stealing Machine’s Activities with Sensor-based Attack: A side-channel attack was proposed by [47]
on manufacturing equipment exploiting sensor data collected by mobile phones, which revealed its design
and the manufacturing process. The attacker managed to reconstruct the targeted equipment. During
the reconnaissance, the adversary placed an attack-enabled phone near the targeted equipment like a 3D
printer. The accessible acoustic and magnetic information reflected the product’s manufacturing activities

26
indirectly. During data collection phase, the acoustic and magnetic sensors embedded in the phone would
record audio and gain magnetometer data from the manufacturing equipment. The magnetometer data
was transferred into a type of acoustic information. Then these acoustic signal information was combined
as the training dataset. Hence, acoustic and magnetic data can be leveraged by the attack.
After the dataset was gathered, the ML-based attack in [47] was completed by feature engineering,
attacking with model training, and evaluation. The features were extracted from the audio signal’s
frequency with the help of STFT and the noise normalization [47]. With features constructed, the
product’s manufacturing process could be inferred by an ML model, especially for 3D printers. A
regression algorithm was used to train the ML model for this attack. In the experiments, the adversaries
tested the effectiveness of the reconstruction of a star, a gun and an airplane by a 3D printer. All
products were reconstructed except the airplane, which was more like a “fish mouth” [47]. As for the
difference of angles between original product and the reconstructed one, the differences of all angles were
within one degree in average, which was acceptable. A defense of this kind of attack was proposed by
[47]. The protection method obfuscated the acoustic leakage by adding the noise (i.e. play recordings)
during production. Sensor-based attacks built up model by analyzing the frequency of the manufacturing
equipment, but noise injection can mitigate this attack to some extent.
Summary:ML-based attacks in Section 2.3 steal user activities information from operating systems.
According to the data sources, there are two kinds of attacks — using kernel data and using sensor
data. Kernel data reveals some system-level behaviors of the target system, while sensor data reflects
the system’s reactions on its specific functionality used by users [10]. The kernel data is analyzed by the
adversary from a time dimension, while the sensor data is exploited with frequency analysis.
Countermeasures: Regarding the protection mechanism, differential privacy is an important method
for the attacks stealing user activities information. In [48, 106], an example applied the noise to an
accessible data source (like Android kernel log files). Another kind of solution is to restrict access to
accessible data [37]. It is also effective to build a model to detect potential stealing threats like in [49].
The in-depth research in protecting against user activities information can explore the differential privacy
appliance or a management system design for kernel files and sensor data. Noise injection and access
restriction are two effective protections, and the detection can alert the stealing attack.

2.4 Stealing Authentication Information


The authentication information is one of the most important factors in security while accessing the
information from services or mobile applications. In Section 2.4, the controlled authentication information
mainly contains keystroke data, secret keys and password data. As shown in Fig. 2.5 and Fig. 2.6,
classification models or probabilistic models are trained to steal the controlled authentication information.
The protections of controlled authentication information stealing attacks are summarized in this subsection.

2.4.1 Stealing controlled keystroke data for authentication


The dataset collected from a device’s sensor information can be used to infer the controlled keystroke
information as depicted in Fig. 2.5. The keystroke data contains the information about user authentication
data, especially for keystroke authentications [107, 108, 109, 110]. Leveraging the acceleration, acoustic

27
Figure 2.5: The ML-based stealing attack against authentication information — keystroke information
and secret keys. After reconnoitering and querying, attackers targeting at keystroke information and
secret keys interact with the target system to collect data, which refers to the active collection. The attack
involved active collection shares a similar workflow as Fig. 2.4 depicted.

and video information, we review the attacks stealing these keystroke information and the countermeasures.

Table 2.8: Stealing Controlled Keystroke Data for Authentication


Paper Dataset for Experiment Description Feature Engineering ML-based Attack Method
Consecutive vectors FFT & IFFT filter, Movement capturing, Random Forest;
[35] Acceleration data set
with 26 labels Optimization with change direction k-NN; SVM; NN
Image resolution and Extract from selected AOIs’ motion
[34] Video recordings set multi-class SVM
frame rate signals for motion patterns

Keystroke Inference Attack: Several types of sensor information can be utilized to steal keystroke
authentication information while targeting at keyboard inputs. In the process of reconnaissance, [35]
found that sensor data from the accelerometer and microphone in a smartwatch was related to user
keystrokes. Since the smartwatch was worn on the user’s wrist, the accelerometer data reflected the
user’s hand movement. Therefore, the user’s inputs on keyboards can be inferred. The authors presented
the corresponding practical attack based on this finding. Adversaries collected the accelerometer and
microphone data when inputting the keystrokes in a keyboard. By leveraging the acceleration and acoustic
information, adversaries were able to distinguish the content of the typed messages. In addition, two kinds
of keyboards were targeted: a numeric keypad of POS terminal and a QWERTY keyboard. The datasets
about sensor information were collected for the inference attack.
During the feature engineering phase, adversaries manually defined the x-axis and y-axis as two
movement features of acceleration data, and the frequency features were extracted from acoustic data.
Then FFT was employed to filter the linear noise and high-frequency noise. Applying the ML strategies
into the attacks, keystroke inference models were set up to reduce the impact caused by the noise within
sensor data [35]. Specifically, the modified k-NN algorithm cooperated with an optimization scoring
algorithm was applied to enhance the accuracy of their inference attack. Thereafter, the typed information
within these two keyboards were leaked, including users’ banking PINs and English texts. The attack
inferred the keystroke information containing authentication information.
Regarding the evaluation, the results showed that the keystroke inference attack on the numeric keypad
had 65% accuracy in leaking banking PINs among the top 3 candidates [35]. Unlike the previous work
in decoding PINs [111, 112, 113], any devices containing the POS terminal could be compromised by
this attack. For the attack targeted at QWERTY keypads, comparing to previous work [114, 115], a
notable improvement had to be achieved to find the word correctly, where the accuracy improved by 50%
with strong allowance to acoustic noise. In the end, several mitigation solutions against the keystroke

28
inference attack were provided [35]: restricting the access to accelerometer data; limiting the acoustic
emanation; and adding the permissions in accessing the sensors which should be managed dynamically
according to the context. The attack inferred the keystroke information accurately and was mitigated with
the restrictions.
Video-Assisted Keystroke Inference Attack: Apart from accelerometer and microphone, the video
records are another kind of sensor information for attackers to infer the keystroke authentication informa-
tion. An attack named VISIBLE, provided by [34], leaked the user’s typed inputs leveraging the stealthy
video recording the backside of a tablet. The attack scenario assumed that the targeted tablet was placed
on a tablet holder, and another two types of soft keyboards were used for inputs including alphabetical
and PIN keyboards. The dataset for the ML-based attack contained the video of the backside motion of a
tablet during the text typing process. In the process of feature engineering, the area of interests (AOIs)
were selected and decomposed. The tablet motion was analyzed with amplitude quantified. Then, the
features were extracted from temporal and spatial domains. As an ML-based attack, a multi-class SVM
was applied to classify various motion patterns to infer the input keystrokes. VISIBLE exploited the
relationship between a dictionary and linguistic to refine such inference results. The experiments showed
that the accuracy of VISIBLE in leaking input single keys, words and sentences, outperformed the random
guess significantly. Particularly, the average accuracy scores of the aforementioned inference attacks were
above 80% for the alphabetical keyboard and 68% for the PIN keyboard. The countermeasures of this
attack include providing no useful information for the video camera, randomizing the keyboards layout,
and adding noise when accessing the video camera. The attack leveraged the video information can also
infer keystroke authentication information very accurately.

2.4.2 Stealing controlled secret keys for authentication


Secret keys are used to encrypt and decrypt sensitive messages [116, 117, 118]. Reconstructing the
cryptographic keys means that one host is authenticated to read the message in some cases [119, 120, 121].
However, an adversary has the ability to deduce the sensitive information like cryptographic keys, by
understanding the changes in the state of shared caches [30]. In this part, the attack stealing controlled
secret key information is surveyed via analyzing the state of targeted cache set.

Table 2.9: Stealing Controlled Secret Keys for Authentication (Information: info)
Reference Dataset for Experiment Description Feature Engineering ML-based Attack Method
Encode info using a
[30] 300 observed TLB latencies Collect from TLB signals SVM classifier
normalized latencies vector
Number of absent cache
[40] 500,000 Prime-Probe trials N/A NB classifier
lines + cache lines available

Stealing secret keys with TLB Cache Data: Due to the abuse of hardware translation look-aside buffers
(TLBs), secret key information can be revealed by the adversary via analyzing the TLB information [30].
The targeted fine-grained information of user memory activities (i.e. cryptographic keys) was safeguarded
in the controlled channels like cache side channels [122, 123]. During the reconnaissance, the legitimate
user accesses the shared TLBs, which reflects victims’ fine-grained cache memories. In detail, the victim’s
TLB records could be accessed by other users using CPU affinity system calls or leveraging the same
virtual machine. Adversaries reverse-engineered unknown addressing functions which mapped virtual
addresses to different TLB sets, in order to clarify the CPU activities from the TLB functions. To design

29
Figure 2.6: The ML-based stealing attack against authentication information — password data. To
infer the password, attackers reconnoiter and collect the online information with the passive collection.
During the feature engineering phase, different segments from the required data are extracted. A semantic
classifier is trained using probabilistic algorithms. After testing this classifier, various passwords can be
constructed as outputs with the semantic generalization.

the data collection, the adversary monitored the states of shared TLB sets indicating the functions missed
or performed by the victims. Without privileged access to properties of TLB information (i.e. TLB
shootdown interrupts), adversaries timed the accesses to the TLB set and measured the memory access
latency, which indicated the state of a TLB set. Instructing the targeted activities with a set of functions
statements, the adversary accessed the TLB data shared by the victim and collected the corresponding
temporal information as a training dataset. The label, in this case, was the state of the function written in
the statement. Datasets about the TLB state information were prepared for the stealing attack.
For the feature engineering, features were extracted from TLB temporal signals by encoding infor-
mation with a vector of normalized latencies. Additionally, ML algorithms were adopted to distinguish
the targeted TLB set by analyzing memory activity. Specifically, with high-resolution temporal features
extracted to present the activity, an SVM classifier was built to distinguish the access to the targeted set and
other arbitrary sets. In the experiment of [30], the training set contained 2,928 TLB latency data in three
different sets. The end-to-end TLBleed attack on libgcrypt captured the changes of the target TLB set,
extracted the feature signatures and reconstructed the private keys. During the evaluation phase, TLBleed
reconstructed the private key at an average success rate of 97%. Particularly, a 256-bit EdDSA secret key
was leaked with TLBleed successfully with the success rate of 98%, while 92% in reconstructing RSA
keys. Potential mitigations against the TLBleed attack was discussed in [30] including executing sensitive
process in isolation on a core, partitioning TLB sets among distrusting processes, and extending hardware
transactional memory features. Hence, secret cryptography keys can be reconstructed by distinguishing
the targeted TLB set.

2.4.3 Stealing controlled password data for authentication


Passwords are considered as one of the most important sensitive information of the user, and its leakage
can raise a serious security concern. Most of the useful information to the stealing attack is collected
passively from the network services as illustrated in Fig. 2.6. The password guessing attack was studied
by analyzing the password patterns with ML techniques. The protection mechanism leads the user to set
up strong passwords by analyzing the password patterns as well.
Online Password Guessing Attack: Online password guessing problem and a framework named
TarGuess to model targeted online guessing scenarios systematically were introduced by [31]. Since
attackers perform an online password guessing attack based on the victim’s personal information, system-
atically summarizing the all possible attack scenarios helps analysts understand the security threats. The
architecture of TarGuess was demonstrated with three phases including the preparing phase to determine
the targeted victim and build up its password profile, the training phase to generate the guessing model,

30
Table 2.10: Stealing Controlled Password Data for Authentication
Reference Dataset for Experiment Description Feature Engineering ML-based Attack Method
Dodonew, CSDN, 16,258,891 (6,428,277) leaked passwords,
PCFG-based
126, Rockyou, 6,392,568 (32,581,870) leaked passwords,
algorithm [124],
000webhost, Yahoo, 15,251,073 (442,834) leaked passwords,
[31] N/A Markov-based
12306, 6,392,568 leaked passwords + 129,303 PII,
algorithm [125],
Rootkit; 69,418 leaked passwords + 69,324 PII;
LD algorithm
Hotel, 51job 20,051,426 PII, 2,327,571 PII
[33] RockYou 32,581,870 leaked passwords Segmented with NLP PCFG-based algorithm
PGS training set [126], 33 million passwords, PCFG-based
1class8, 1class16 [127], 3,062 (2,054) leaked passwords, algorithm [131],
[32] N/A
3class12 [128], 4class8 [129], 990 (990) leaked passwords, Markov models [125],
webhost [130] 30,000 leaked passwords NN

and the guessing phase to perform the guessing attack. During the reconnaissance phase, according to
the diversity of people’s password choices, three kinds of information were beneficial in online guessing
attacks: PII like name and birthday, site information like service type, and leaked password information
like sister passwords and popular passwords. In particular, PII could be divided into two types, including
Type-1 PII used to build part of the passwords (i.e. birthday), and Type-2 PII reflected user behavior in
setting passwords (e.g. language [129]). Some leaked passwords were reused by the user. During the data
collection phase, the datasets were combined by multiple types of PIIs and leaked passwords. To be more
specific, the four TarGuess were listed as TarGuess-I based on Type-1 PII, TarGuess-II based on leaked
passwords, TarGuess III based on leaked passwords and Type-1 PII, and TarGuess-IV based on leaked
passwords and PII. Datasets for password guessing attacks were prepared.
After dataset was collected with its initial features, attackers adopted probabilistic guessing algorithms
including PCFG, Markov n-gram and Bayesian theory to train the four TarGuess models to infer passwords
[31]. The accuracy of these four guessing algorithms was evaluated when the guessing time was limited
online. Comparing with [132], the TarGuess-I algorithm outperformed with 37.11% to 73.33% more
passwords successfully inferred within 10 − 103 guesses. It also outperformed the three trawling online
guessing algorithms [133, 125, 124] significantly by cracking at least 412% to 740% more passwords.
Comparing to [134] with 8.98% success rate, TarGuess-II achieved 20.19% within 100 guesses. As for
TarGuess-III, no prior research could be compared and it achieved 23.48% success rate within 100 guesses.
Concerning TarGuess-IV, the improvements of accuracy were between 4.38% to 18.19% comparing to
TarGuess-III. Through modeling guessing attack scenarios, a serious security concern was revealed in
[31] about online password leakage with effective guessing algorithms.
Password Guessing with Semantic Pattern Analysis: An attempt was made in [31] to formalize several
passwords guess lists for one targeted user. And a similar attempt was made in [33] to find a general
password pattern. A framework presented by [33] built up semantic patterns of passwords for users
in order to understand their password security. The security impacts of user’s preferences in password
creation were identified. For a better reconnaissance, the passwords were analyzed by breaking into two
conceptually consistent parts containing semantic and syntactic patterns. Since a password consists of the
combination of word and/or gap segments, the attacker intended to understand these patterns by inferring
the password’s meanings and syntactic functions. By comprehending how well the semantic pattern
characterizing the password, plenty of password guesses could be learned for attacks. When those attacks
were successful with some guesses, the true passwords were learned. The attack formalized the password
with semantic and syntactic patterns.
The password datasets could be collected from password data leakage like the RockYou’s password
list. Firstly, the NLP methods was used for passwords segmentation and semantic classification. The

31
segmentation was the fundamental step to process the passwords in various forms. The source corpora was
a collection of raw words as the segmentation candidates, whereas the reference corpora contained part-
of-speech (POS). Specifically, the POS was tagged with Natural Language Toolkit (NLTK) [135] based
on Contemporary Corpus of American English. With N-gram probabilities representing the frequency of
use, the tagged POS was used to select the most suitable segmentation for passwords. After processing
the password dataset, the NLP algorithm was used to classify the segments of input passwords and result
in a semantic category. Secondly, a semantic guess generator could be built with the PCFG algorithm.
Since syntactic functions of the password were structural relationships among semantic classes, the PCFG
algorithm was employed to model the password’s syntactic and semantic patterns. In detail, this model
learned the password grammar from the dataset, generated the guessing sentence of a language [136]
with different constructs, and encoded the probabilities of these constructs as output. To learn any true
passwords, the semantic guess generator sorted the outputs according to the probability of the password
cracking attack. The generator generated a guessing list based on the semantic and syntactic patterns.
To assess the advantage of the semantic guess generator, the success rate of this generator was
compared to the result of previous offline guessing attack approach, i.e. Weir approach [124]. To crack a
password within 3 billion guessing times, 67% more passwords were cracked by the semantic approach
than the Weir approach in terms of LinkedIn leakage [33]. Exploiting the leakage from MySpace, this
approach outperformed the Weir approach by inferring 32% more passwords.
Summary: According to different forms of authentications, ML-based stealing attacks target at users’
keystroke authentication, secret keys and passwords. As shown in Fig. 2.5 and Fig. 2.6, attackers steal
users’ passwords by cracking the useful information online. For the other two objectives, they exploit
the information based on users’ activities recorded by an Operating System (i.e. TLB/CPU cache data).
Additionally, password guessing attacks use the probabilistic method to construct a password with the
least number of guesses. The attack on the remaining two targets can be transferred as classification tasks
by generating keystroke patterns and cache set states.
Countermeasures: From the security perspective, two types of countermeasures are introduced as the
access restriction and the attack detection. The secret keys, for example, can be protected by managing
the accessible related cache data [40]. The analysis of password guessability [32] can secure the user’s
account by setting a strong password. The weak passwords are evaded by detection. The future direction
can target the effectiveness of guessing model prediction which is limited by the sparsity of training
samples [32]. The defense for the keystroke inference has not been well-developed. The future work may
explore the secured access of related sensor data.

2.4.4 Summary
To understand the information leakage threat and the stealing attack comprehensively, an outline of
relevant high-quality papers from 2014 to 2019 is provided and summarized in Table 2.11 from four
perspectives including the attack, protection, related ML techniques, and evaluation.

32
Table 2.11: Summary of reviewed papers from attack, protection, related ML techniques they utilized, and
the evaluation metrics.
Reference Attack Protection Related ML Techniques Evaluation
Unlock pattern & foreground Restrict access to kernel resources; HMM with Viterbi algorithm; Success rate; Time &
[10]
app inference attack Decrease the resolution of interrupt data k-NN classifier with DTW battery consumption
Accuracy;
Restrict access to kernel resources; k-NN classifier with DTW;
[36] Leaking specific events attack Precision; Recall;
App Guardian [137, 106] Multi-class SVM with DTW
Battery consumption
Keystroke timing attack; Accuracy;
[48] Design d∗ -private mechanism Multi-class SVM classifier
website inference attack Relative AccE
Eliminate the attack vectors; Rate limiting;
Accuracy;
Stealing user activities; Runtime detection [106]; Coarse-grained SVM classifier;
[37] Execution time;
Stealing in-app activities return values; Privacy-preserving statistics k-NN classifier with DTW
Power consumption
report [48]; Remove the timing channel
[47] Stealing product’s design Obfuscate the acoustic emissions A regression model Accuracy
Accuracy; FNR;
Information leakage via a Markov Chain; NB;
The contextual model detects malicious F-measure; FPR;
[49] sensor; Stealing information Alternative set of ML
behavior of sensors Recall; Precision;
via a sensor algorithms (e.g. PART)
Power consumption
Rounding confidences [18]; Logistic regression; Test error;
[12] Model extraction attack Differential privacy (DP) [138, 50, 139, 140]; Decision tree; SVM; Uniform error;
Ensemble methods [16] Three-layer NN Extraction Faccuracy
Gradient masking [56] and defensive DNN; SVM; k-NN; Decision
[41] Model extraction attack Success rate
distillation [141] for a robust model Tree; Logistic regression
Cross entropy and square hinge loss Regression algorithms; NN; Relative EE; Relative
[29] Hyperparameters stealing attack
instead of regular hinge loss Logistic regression; SVM MSE; Relative AccE
[42] Hyperparameters stealing attack N/A Metamodel methods Accuracy
Incorporate inversion metrics in training; Accuracy;
Decision Tree;
[18] Model inversion attack Degrade the quality/precision of the Precision;
Regression model
model’s gradient information. Recall
The GAN attack stealing
[39] N/A CNN with GAN Accuracy
users’ training data
Restrict class in the prediction vector; Accuracy;
[38] Membership inference attack Coarsen precision; Increase entropy of the NN Precision;
prediction vector [83]; Regularization Recall
Logistic regression; Random Precision;
[43] Membership inference attack Dropout; Model Stacking
Forest; Multilayer perceptron Recall; AUC
Multiply the weights and bias of each Accuracy;
[51] Property inference attack NN
neuron; Add noise; Encode arbitrary data Precision; Recall
Share fewer gradients; Reduce input Logistic regression; Gradient Precision;
[44] Property inference attack
dimension; Dropout; user-level DP boosting; Random Forests Recall; AUC
[77] Membership inference attack Protect with adversarial regularization NN Accuracy
PATE: transfer knowledge from an Semi-supervised learning;
[87] N/A Accuracy
ensemble model to a student model GAN
NN; Gradient boosted tree;
Protect stored training data with count- Average logistic
[88] N/A Logistic regression;
based featurization squared loss
Linear regression
Restrict access to accelerometer data; Limit
Random Forest;
[35] Keystroke inference attack acoustic emanation; Dynamic permission Success rate
k-NN; SVM; NN
management based on context
Design a featureless cover; Randomize
[34] Typed input inference attack multi-class SVM Accuracy
the keyboards’ layouts; Add noise
[30] TLBleed attack infers secret keys Protect in hardware [122, 123] SVM classifier Success rate
ML-based prime-probe attack CacheBar manage memory pages Accuracy;
[40] NB classifier
infers secret keys cacheability Execution time
PCFG algorithm; Markov
[31] Password guessing attack N/A Success rate
model; Bayesian theory
[33] Password guessing attack N/A PCFG-based algorithm Success rate
Mitigate the threat by modeling password PCFG-based algorithm;
[37] Password guessing attack Accuracy
guessability Markov models; NN

33
Chapter 3

The Audio Auditor: User-Level


Membership Inference with Black-Box
Access

Voice interfaces and assistants implemented by various services have become increasingly sophisticated,
powered by increased availability of data. However, users’ audio data needs to be guarded while
enforcing data-protection regulations, such as the General Data Protection Regulations (GDPR) law and
the Children’s Online Privacy Protection Act (COPPA) law [142, 143]. To check the unauthorized use of
audio data, we propose an audio auditor for users to audit speech recognition models. Specifically, users
can check whether their audio recordings were used as a member of the model’s training dataset or not. In
this chapter, we focus our work on a DNN-HMM-based automatic speech recognition model over the
TIMIT audio data. As a proof-of-concept, the success rate of participant-level membership inference can
reach up to 90% with eight audio samples per user, resulting in an audio auditor.

3.1 Introduction
The automatic speech recognition (ASR) system is widely adopted on Internet of Things (IoT) devices
[144, 145]. The IoT voice services competition among Apple, Microsoft, and Amazon is continuously
heating up the smart speaker market [146]. In parallel, the privacy concerns about the ASR system and
unauthorized access to user’s audio are of great awareness for customers. Privacy policies and regulations,
such as the GDPR [142] and the COPPA [143], have been enforced to regulate personal data processing.
Specifically, the Right to be Forgotten [147] law allows customers to prevent third-party voice services
from continuously using their data [148]. However, the murky privacy and security boundary can thwart
IoT’s trustworthiness [149, 150] and many IoT devices attempt to sniff and analyze the audio captured
in real-time without user’s consent [151]. Most recently, on WeChat, an enormously popular messaging
platform within China and worldwide, a scammer camouflaged to voice like an acquaintance by spoofing
her or his voice [152]. Therefore, it is important to develop techniques that enable auditing the use of
customers’ audio data in ASR models.
In this chapter, we designed and evaluated an audio auditor to help users determine whether their audio
data had been used without authorization to train an ASR model. The targeted ASR model used in this

34
chapter is a DNN-HMM-based speech-to-text model. With an audio signal input, this model transcribes
speech into written text. The auditor audits this target model with an intent to infer participant-level
membership. The auditor will behave differently depending on if it is transcribing audio from within its
training set or transcribing audio from other datasets. Thus, one can analyze the transcriptions and use
the outputs to train a binary classifier as the auditor. As our primary focus is to infer participant-level
membership, speaker-related information is filtered out while analyzing the transcription outputs (see
details in Section 3.3).
Our work is the first attempt to audit whether ASR models still remember customer’s audio data
without consent using the membership inference method. User-level membership inference on textual
data has been recently studied [153]. However, in this work, we target an ASR model, instead of a
text-generation model. The time-series audio data is significantly more complex than the textual data,
causing feature patterns to be greatly varied [154]. Furthermore, current IoT applications demonstrate
significantly higher security and privacy impacts than most verbal applications in learning tasks [155, 156].
In doing so, firstly, we assume a different auditing scenario. To reproduce a target model close to ASR
systems in practice, we use multi-task learning, which includes audio feature extraction, DNN learning,
HMM learning, and an n-gram language model with natural language processing. Secondly, the auditor
has black-box access to the target model which only outputs one final transcription result. Additionally,
the auditor can audit the model by simultaneously providing multiple audio inputs supplied from the same
user, instead of just one. Thirdly, we extract a different set of features from the model’s outputs. Instead
of using the rank lists of several top output results, we only use one text output with the highest posterior
and the length of input audio frames.
Our user-level membership auditing method achieves high performance on the TIMIT dataset. The
auditing accuracy results reach over 90% while the F1-score reaches 95% when 125 speaker records are
used to train the auditor model. Even when training with 25 users, the resulting accuracy is approximately
85%. The auditor is also effective in auditing ASR models with different numbers of audio queries
from the same individual. When the speaker audits the target model with more than one audio sample
(one-audio sample membership inference success rate approaches to random guessing), the success rate is
significantly boosted, reaching up to 90% with eight audio samples per user.

3.2 Background
3.2.1 The Automatic Speech Recognition Model

Figure 3.1: An advanced ASR system.

The DNN-HMM-based acoustic model is popular in the current automatic speech recognition (ASR)

35
system [157]. As defined by [158], the ASR system contains a preprocessing step, model training step,
and decoding step as displayed in Figure 3.1. The preprocessing step performs the feature processing
and labeling for an audio input. In this chapter, the audio frame is processed using the Discrete Fourier
Transform (DFT) to extract information from the frequency domain, namely Mel-Frequency Cepstral
Coefficients (MFCCs) as features. Forced alignment is applied on the raw audio inputs to extract the
text label which is processed and used in training our acoustic model. We train an acoustic model on a
DNN. The acoustic model outputs posterior probabilities for all HMM states which are processed in the
decoding step mapping posterior probabilities to a sequence of text. The language model contained within
the decoder provides a language probability which is used by the decoder to re-evaluate the acoustic score
to the most suited language [159]. The final transcription text is the sequence with the highest score.

3.2.2 Deep Learning for Acoustic Models


Deep learning methods are used to build up acoustic models, such as speech transcription [160], word
spotting or triggering [161], speaker identification or verification [162]. With supervised learning, a neural
network can be trained as a classifier using a softmax across the phonetic units. A feature stream of audio
is the input of the network in deep learning, while the output should be a posterior probability for the
predicted phonetic states. Subsequently, these output representations will be decoded by the HMM-based
decoder and will be mapping to possible sequences of phonetic texts with different probabilities.
Multilayer Perceptron (MLP) is one of the DNN algorithms used in this work. Assume that MLP is a
stack of L layers of logistic regression models, fl (·) represents the active function in the lth layer. Given
l
an input z l ∈ Rm , where ml is the number of neurons in the lth layer, this layer’s output outl can be
formalized as:
outl = fl (z l ) = fl (W l · outl−1 + bl ), (3.1)
l (l−1)
where W l ∈ Rm ×m represents the weight matrix, and bl is the bias from the (l − 1)th to lth layer.
Specifically, we applied the sigmoid function in the hidden layers and used the softmax activation function
for the final output layer. As for the loss function, the MLP uses the cross-entropy. Moreover, the MLP
tunes the parameters using the error back-propagation procedure (BP) and the stochastic gradient descent
method.
In the case of building the ASR system with DNN-HMM algorithms, the posterior probability output in
the output layer can be expressed as {P (ρ1 |ot ), . . . , P (ρk |ot )}, where k is the total number of phonemes
corresponding to the number of the Lth layer’s output nodes. This is a set of posterior probabilities of
each phoneme in the tth time frame ot of the audio input. The posterior probability of each phoneme (i.e.,
P (ρk |ot )) is transferred and processed by the HMM-based decoder:
I
X P (ρi |ot )
bj = P (ot |sj ) = cji . (3.2)
P (ρi )
i=1

In Equation 3.2, bj is the probability of phonemes in the time frame ot mapping to j th HMM state sj
based on continuous Probability Density Functions (PDFs) [163]. Herein, I is a fixed number of PDFs,
and c is the weight for each phoneme.

36
3.2.3 Membership Inference Attack
Membership inference attack aims to determine whether a specific data sample is within the training set
by training a series of shadow models constituting the attack model [38]. The attack model intends to
learn from the differences in the target model’s output by feeding in pristine or bogus training data. In this
chapter, we adapt the membership inference attack for the task of audio auditing. Specifically, instead of
inferring the record-level membership, we aim to infer the participant-level membership. That is, we focus
on whether a particular user had unwillingly contributed data to train an ASR model. Our work differs
from another user-level membership audit [153], as the features extracted from the outputs of developed
ASR models are three pieces of audio related information including: transcription text, text probability,
and frame length, rather than words’ rank lists.

3.3 Auditing the ASR Model

Figure 3.2: Auditing an ASR model.

In this section, we first formalize our objective for auditing automatic speech recognition models.
Secondly, we present how an audio auditor can be constructed. Finally, we outline how the auditor is used
for auditing the target model.

3.3.1 Problem Definition


As shown in Figure 3.1, we describe the workflow of audio transcription using an ASR system. By
querying an ASR system with an audio sample of a recorded speech, the speech recognition model outputs
pseudo-posterior probabilities for all context-dependent phonetic units. During the decoding step, the
probabilities are used to infer the most probable text sequence.
Suppose there is a group of audio recordings Dtar from a set of individuals Utar . Our target model is
an speech recognition model denoted as ftar which is trained on Dtar using a learning algorithm Altar .

37
For a specific user u, our objective is to find out whether this user is in the target model’s training set, such
that u ∈ Utar . The participant-level membership inference against ftar requires an auxiliary reference
dataset Dref to build the audio auditor. Specifically, Dref is used to train several shadow models fshd
which simulate the target model ftar in approximation. We denote Uref the set of all users in Dref . By
querying fshd , the transcription outputs are properly labeled depending on the audio speaker belonging to
Uref or not.
Finally, we assume that our auditor only has black-box access to the target model. Given an input
audio recording, the auditor can only obtain the text transcription and its probability as outputs. Neither
the training data nor the training parameters and hyper-parameters of the target model is known to the
auditor. We assume that our auditor knows any learning algorithm used in the ASR system, including
feature extraction, the training algorithm, and the decoding algorithm.
Threat Model. We assume that our auditor only has black-box access to the target model. Given an
input audio recording, the auditor can only obtain the text transcription and its probability as outputs.
Neither the training data nor the training parameters and hyper-parameters of the target model is known to
the auditor. The state-of-the-art algorithms for typical DNN-HMM ASR systems are well-known and
standard [158, 164, 165]. We hereby assume that our auditor knows any learning algorithm used in the
ASR system, including feature extraction, the training algorithm, and the decoding algorithm. Due to
recent research on model stealing [29, 12] which extracts network parameters from querying the output, it
is reasonable to offer the auditor black-box access to the Machine Learning as a Service (i.e., the ASR
model).

3.3.2 Overview of the Audio Auditor


The nature of membership inference [38] is to learn the difference of a model fed with its actual training
samples and other samples. Thus, to audit whether an ASR model had been trained with a user’s audio
data or not, the auditor’s task can be transferred as inferring this user’s membership in this ASR model’s
training dataset. The audio auditor’s training and auditing processes are depicted in Figure 3.2. We assume
that our target model’s dataset Dtar is disjoint from the auxiliary reference dataset Dref (Dtar ∩Dref = ∅).
In addition, Uref and Utar are also disjoint (Utar ∩ Uref = ∅).
The primary task to train an audio auditor is to build up several shadow models to infer the targeted
ASR model’s decision boundary. We assume all learning algorithms Altar are known to the auditor;
therefore, the learning algorithms for the shadow model are known accordingly (Alshd = Altar ). Different
from the target model, we have full knowledge of the shadow models’ ground truth. For a user u querying
train , we collapse the features extracted from these samples’
the model with her audio samples, if u ∈ Dshdi
results into one record and label it as “member”; otherwise, “nonmember”. Taken all together with
these labeled records (processed), a training dataset is set to train a binary classifier as the audit model
using a supervised learning algorithm. As also evidenced in [38], the more shadow models built, the more
accurate the audit model performed.
As shown in Figure 3.2, n datasets are sampled from Dref to train n shadow models with Alshd . The
testing set and the subset of training set are used to query each shadow model. The query outputs are
preprocessed below. For participant-level membership, some users’ pertinent characters are extracted
from each output, including the transcription text (denoted as TXT), the posterior probability (denoted
as Probability), and the audio frame length (denoted as Frame Length). The features of the auditor’s

38
training set are written as: {TXT1=type(string), Probability1=type(float), Frame_Length1=type(integer), . . .
, TXTn=type(txt), Probabilityn=type(float), Frame_Lengthn=type(integer), class}, where n is the number of
audios belonging to a speaker. To process categorical features, such as the TXT features, we map the
text to integers using a label encoder [166]. The built auditor determines whether u ∈ Utar or not by the
processed outputs. Exploring alternative preprocessing methods, such as a one-hot encoder, will be an
avenue for future research.
To build up n shadow models (see Figure 3.2), we sample n datasets from the auxiliary reference
train , i =
dataset Dref as Dshd1 , . . . , Dshdn , n > 1. Further, we split each shadow model dataset Dshdi
train and testing set D test . D train is used to train the shadow model with
1, . . . , n into training set Dshdi shdi shdi
test is used to evaluate its performance. To generate the training set with ground truth for
Alshd , while Dshdi
query test and a subset of
the audit model, we query the shadow model with Dshdi obtaining all samples in Dshdi
train which are sampled randomly.
Dshdi
train . For a user u querying the
Each shadow model fshdi , i = 1, . . . , n is trained with Alshd using Dshdi
train , we combine the features extracted from these samples’
model with their audio samples, if u ∈ Dshdi
results into one record and label it as “member”; otherwise, we label this record as “nonmember”. The
labels combined together with the outputs from the shadow models, form the training dataset for our audit
model.
As for the auditing process, we randomly sample one or a few audios recorded by one speaker u
from Dusers to query our target ASR model. These sampled audios are transcribed into text with some
outputs. To audit whether the target model had used this speaker’s audio in its training phase, we analyze
these transcription outputs as part of a testing record with our audio auditor. Feature extraction and
preprocessing methods used in this testing record are the same as the methods used for shadow models’
results. The auditor finally classifies this testing record as “member” or “nonmember” and hence
determines whether u ∈ Utar or not.

3.4 Experiment and Results


3.4.1 Dataset
The TIMIT speech corpus contains 6,300 sentence spoken by 630 speakers from 8 major dialect regions of
the United States. Three kinds of sentences are recorded including the dialect sentences, the phonetically-
compact sentences and the phonetically-diverse sentences. The dialect sentences are spoken by all speakers
from different dialect regions. The phonetically-compact sentences were recorded with a good coverage
of pairs of phones, while the phonetically-diverse audios recorded the sentence selected from different
corpus for diverse sentence types and phonetic contexts.
We selected three disjoint datasets from TIMIT speech corpus manually as described in Table 3.1.
Specifically, each training dataset and testing dataset obtains the audio recorded by speakers from 8 dialect
regions. In addition, each subset dataset contains all three kinds of audios that mentioned above. The
diversity of audios within each dataset not only is more similar to the reality ASR model’s training set,
but also remains some users’ information for participant-level auditing task.
As a proof-of-concept, we aim to build up one target model and design two shadow models based on
this target model. As mentioned in Section 3.3.2, we curated three disjoint datasets from the TIMIT speech

39
Table 3.1: Datasets across models

Model Training Dataset Testing Dataset


154 speakers, 54 speakers,
Target
1232 audio 432 audios
154 speakers, 57 speakers,
Shadow1
1232 audio 456 audios
154 speakers, 57 speakers,
Shadow2
1232 audio 456 audios

train (i = 1, 2) with
corpus as listed in Table 3.1. In this experiment, we trained two shadow models on Dshdi
train . Training with differently distributed datasets will be our future research.
a similar distribution to Dtar
The outputs of our two shadow models are used to train the audit model. By querying the shadow
model with all its testing set and one-third of its training set, we processed their outputs and labeled them
as “nonmember” and “member”, respectively. Since the training datasets for all three models include
eight sentences for each speaker, the feature set of the auditor’s training dataset is {TXT1, Probability1,
Frame_Length1, . . . , TXT8, Probability8, Frame_Length8, class}. To audit the target model, a speaker may
query the auditor model from one to eight pieces of audios. When a user audits the target model less than
eight pieces of audios, we pad zeros to all the missing feature values.

3.4.2 Target Model


Our target model is a speech-to-text model. The inputs are a set of audio files with phonetic text as labels,
while the outputs are the transcribed phonetic texts with final probabilities and the corresponding input
frame length. To simulate most of the current ASR models in the real world, we created a state-of-art
DNN-HMM-based ASR model [158] using the PyTorch-Kaldi Speech Recognition Toolkit [159]. In the
preprocessing step, MFCC features are used to train the model with the multilayer perceptron (MLP)
algorithm. The training epoch is 24. The outputs of this MLP model are decoded and rescored with the
probabilities of the HMM and n-gram language model to obtain the transcription. A decision tree is used
for the audit model.

Figure 3.3: Training and Figure 3.4: Training and Figure 3.5: Training and
Validation Accuracy of Target Validation Accuracy of Shadow Validation Accuracy of Shadow
Model Model 1 Model 2

40
During the audio input’s preprocessing step, we utilize the Kaldi Toolkit [167] to extract MFCC
features for each audio of waveform. The force alignment among features and phone states were used to
process the label. To prepare a training set, we applied a simple DNN algorithm — multilayer perceptron
(MLP) to learn the relationship between the input audios and the output transcriptions. As for the
hyperparameters in the MLP model, we set up 4 hidden layers and 1,024 hidden neurons per layer. The
learning rate was set at 8%, the model was trained with 24 epochs. The output of this MLP model is a set
of pseudo-posteriors probabilities of all possible phonetic units. These outputs are normalized and then
fed into a HMM-based encoder. After encoding, an n-gram language model was applied to rescore the
probabilities. The final transcription is the text sequence with the highest final probability.
To evaluate the target model’s performance, we use the training accuracy and validation accuracy as
shown in Figure 3.3. Comparing the training accuracy performed by two shadow models and our target
model, the trends are similar and the accuracy curve can ultimately reach 70% (see Figures 3.3, 3.4, 3.5).
This indicates that our shadow models can successfully mimic the target model (same transcription on the
same audio inputs), or are able to achieve the same utility, i.e., speech recognition (same transcription
accuracy, not the same input samples between models.).

3.4.3 Results
To evaluate the auditor’s performance, four metrics are calculated from the confusion matrix, includ-
ing accuracy, precision, recall, and F1-score. True Positive (TP): the number of records we pre-
dicted as “member” are correctly labelled. True Negative (TN): the number of records we predicted
as “nonmember” are correctly labelled. False Positive (FP): the number of records we predicted
as “member” are incorrectly labelled. False Negative (FN): the number of records we predicted as
“nonmember” are incorrectly labelled:

• Accuracy: the percentage of records correctly classified by the audit model.


TP + TN
Accuracy =
TP + TN + FP + FN

• Recall: the percentage of all true “member” records correctly determined as “member”.
TP
Recall =
TP + FN

• Precision: the percentage of records correctly determined as “member” by the audit model among
all records determined as “member”.
TP
P recision =
TP + FP

• F1-score: the harmonic mean of precision and recall.


2 × Recall × P recision
F 1 − score =
Recall + P recision

We show results for the behavior of the auditor under two different circumstances: when the number
of users in the training dataset is varied, and when the number of the audio samples from the user to be
audited is varied. Four metrics are calculated from the confusion matrix including accuracy, precision,
recall, and F1-score.

41
Table 3.2: The confusion matrix for the auditor.

Class Actual: member Actual: nonmember


Predicted: member TP FP
Predicted: nonmember FN TN

Effect of the number of users used in training dataset. The audit model’s behavior when training sets
containing different numbers of users is depicted in Figure 3.6. We trained the audit model with 25, 50, 75,
100, 125, and 150 users that were randomly sampled from the outputs of two shadow models. The testing
set querying these audit models is fixed at 78 test audio records. To eliminate trial specific deviations, we
repeated each experiment 10 times and averaged the results. The audit model performs fairly well for all
metrics, with all metrics under different configurations above or approximately 85%. The model performs
better when the number of users within the training set increases. When 100 users used in training set
size, the performance reached a highest score especially in accuracy (approximately 93%) and F1-score
(approximately 95%). When the number of users increased to 125, both two metrics’ results slightly drop
but raise back while the number of users increased to 150. In all configurations, the audit model performs
well. Herein, the more users that are used to train the audit model, the more accurate a user’s membership
within the target model can be determined. With regards to the performance of the audit model when
training with an even larger number of users of the training dataset, we will consider this problem in our
future work.

Figure 3.6: The audit model’s performance Figure 3.7: The audit model’s performance by
across the training set size. the number of audios for one speaker.

Effect of the number of audio records for each user used in querying the auditor. Since we randomly
sample a user’s audio to test our audit model, the number of audio samples for this user may not be
the same as the number of audios for each user in the auditor’s training dataset. That is, the number of
non-zero features of an audit query may vary. We evaluate the effect on auditor’s performance using a
variable number of audio samples from each user in auditing. Herein, the number of users used in different

42
testing sets are the same and #{u ∈ Utar } : #{u ∈
/ Utar } = 2 : 1. To gather a different number of
non-zero features in audit model’s testing dataset, we queried the target model with 78 users where each
user was randomly sampled from one to eight test audio records. Like the experiment above, we repeated
the experiment 100 times and averaged these results to reduce deviations in performance. The results are
displayed in Figure 3.7. The more audios for each user used to audit their membership, the more accurate
our audio auditor performed. When the user audits the target model with only one audio, the audit model’s
performance is relatively low — except the accuracy approaches to 50% — the other three metrics’ results
are around 25%. When the number of audio reaches eight, all performance results are above 90%.

3.5 Conclusion
This work highlights, and leaves open, the potential of mounting participant-level membership inference
attack in IoT voice services. While our work has yet to examine the attack success rate on various IoT
applications across multiple learning models, they do narrow the gap towards defining clear membership
privacy in the user level, rather than the record level [38] which leaves questions about whether the privacy
leakage hails from the data distribution or its intrinsic uniqueness of the record. Nevertheless, as we
argued, both the size of user base and the number of audio samples per user used in the testing set have
shown to have a positive effect on the IoT audit model. Examining other factors on performance and
extending possible defenses against audit are all worth further exploration.

3.6 Acknowledgement
We would greatly thank Nvidia Corporation for the donation of the Titan XP GPU used for this research.

43
Chapter 4

The Audio Auditor: Label-Only


User-Level Membership Inference in
Internet of Things Voice Services

With the rapid development of deep learning techniques, the popularity of voice services implemented
on various Internet of Things (IoT) devices is ever increasing. In this chapter, we examine user-level
membership inference in the problem space of voice services, by designing an audio auditor to verify
whether a specific user had unwillingly contributed audio used to train an automatic speech recognition
(ASR) model under strict black-box access. With user representation of the input audio data and their
corresponding translated text, our trained auditor is effective in user-level audit. We also observe that
the auditor trained on specific data can be generalized well regardless of the ASR model architecture.
We validate the auditor on ASR models trained with LSTM, RNNs, and GRU algorithms on two state-
of-the-art pipelines, the hybrid ASR system and the end-to-end ASR system. Finally, we conduct a
real-world trial of our auditor on iPhone Siri, achieving an overall accuracy exceeding 80%. We hope the
methodology developed in this chapter and findings can inform privacy advocates to overhaul IoT privacy.

4.1 Introduction
Automatic speech recognition (ASR) systems are widely adopted on Internet of Things (IoT) devices [144,
145]. In the IoT voice services space, competition in the smart speaker market is heating up between giants
like Apple, Microsoft, and Amazon [146]. However parallel to the release of new products, consumers
are growing increasingly aware and concerned about their privacy, particularly about unauthorized access
to user’s audio in these ASR systems. Of late, privacy policies and regulations, such as the General Data
Protection Regulations (GDPR) [142], the Children’s Online Privacy Protection Act (COPPA) [143],
and the California Consumer Privacy Act (CCPA) [168], have been enforced to regulate personal data
processing. Specifically, the Right to be Forgotten [147] law allows customers to prevent third-party
voice services from continuously using their data [148]. However, the murky boundary between privacy
and security can thwart IoT’s trustworthiness [149, 150] and many IoT devices may attempt to sniff and
analyze the audio captured in real-time without a user’s consent [151]. Most recently, on WeChat – a
hugely popular messaging platform within China and Worldwide – a scammer camouflaged their voice

44
to sound like an acquaintance by spoofing his or her voice [152]. Additionally, in 2019, The Guardian
reported a threat regarding the user recordings leakage via Apple Siri [169]. Auditing if an ASR service
provider adheres to its privacy statement can help users to protect their data privacy. It motivates us to
develop techniques that enable auditing the use of customers’ audio data in ASR models.
Recently, researchers have shown that record-level membership inference [38, 170, 43] may expose
information about the model’s training data even with only black-box access. To mount membership
inference attacks, Shokri et al. [38] integrate a plethora of shadow models to constitute the attack model
to infer membership, while Salem et al. [43] further relax this process and resort to the target model’s
confidence scores alone. However, instead of inferring record-level information, we seek to infer user-level
information to verify whether a user has any audios within the training set. Therefore, we define user-level
membership inference as: querying with a user’s data, if this user has any data within target model’s
training set, even if the query data are not members of the training set, this user is the user-level member
of this training set.
Song and Shmatikov [171] discuss the application of user-level membership inference on text gen-
erative models, exploiting several top ranked outputs of the model. Considering most ASR systems in
the real world do not provide the confidence score, significantly differing from text generative models
lending confidence scores [171], this chapter targets user-level membership inference on ASR systems
under strict black-box access, which we define as no knowledge about the model, with only knowledge of
the model’s output excluding confidence score and rank information, i.e., only predicted label is known.
Unfortunately, user-level membership inference on ASR systems with strict black-box access is
challenging. (i) Lack of information about the target model is challenging [172]. As strict black-box
inference has little knowledge about the target model’s performance, it is hard for shadow models to
mimic a target model. (ii) User-level inference requires a higher level of robustness than record-level
inference. Unlike record-level, user-level inference needs to consider the speaker’s voice characteristics.
(iii) ASR systems are complicated due to their learning architectures [172], causing membership inference
with shadow models to be computationally resource and time consuming. Finally, time-series audio data
is significantly more complex than textual data, resulting in varied feature patterns [154, 173].
In this chapter, we design and evaluate our audio auditor to help users determine whether their audio
records have been used to train an ASR model without their consent. We investigate two types of targeted
ASR models: a hybrid ASR system and an end-to-end ASR system. With an audio signal input, both
of the models transcribe speech into written text. The auditor audits the target model with an intent via
strict black-box access to infer user-level membership. The auditor will behave differently depending on
whether audio is transcribed from within its training set or from other datasets. Thus, one can analyze the
transcriptions and use the outputs to train a binary classifier as the auditor. As our primary focus is to
infer user-level membership, instead of using the rank lists of several top output results, we only use one
text output, the user’s speed, and the input audio’s true transcription while analyzing the transcription
outputs (see details in Section 4.3).
In summary, the main contributions of this chapter are as follows:

1. We propose the use of user-level membership inference for auditing the ASR model under strict
black-box access. With access to the top predicted label only, our audio achieves 78.81% accuracy.
In comparison, the best accuracy for the user-level auditor in text generative models with one
top-ranked output is 72.3% [171].

45
2. Our auditor is effective in user-level audit. For the user who has audios within the target model’s
training set, the accuracy of our auditor querying with these recordings can achieve more than
80%. In addition, only nine queries are needed for each user (regardless of their membership
or non-membership) to verify their presence of recordings in the ASR model, at an accuracy of
75.38%.

3. Our strict black-box audit methodology is robust to various architectures and pipelines of the ASR
model. We investigate the auditor by auditing the ASR model trained with LSTM, RNNs, and GRU
algorithms. In addition, two state-of-the-art pipelines in building ASR models are implemented for
validation. The overall accuracy of our auditor achieves approximately 70% across various ASR
models on auxiliary and cross-domain datasets.

4. We conduct a proof-of-concept test of our auditor on iPhone Siri, under the strict black-box
access, achieving an overall accuracy in excess of 80%. This real-world trial lends evidence to the
comprehensive synthetic audit outcomes observed in this chapter.

To the best of our knowledge, this is the first paper to examine user-level membership inference in the
problem space of voice services. We hope the methodology developed in this chapter and findings can
inform privacy advocates to overhaul IoT privacy.

4.2 Background
In this section, we overview the automatic speech recognition models and membership inference attacks.

(a) A hybrid ASR system. (b) An end-to-end system.

Figure 4.1: Two state-of-the-art ASR systems

4.2.1 The Automatic Speech Recognition Model


There are two state-of-the-art pipelines used to build the automatic speech recognition (ASR) system,
including the typical hybrid ASR systems and end-to-end ASR systems [174]. To test the robustness of our
auditor, we implement both open-source hybrid and end-to-end ASR systems focusing on a speech-to-text
task as the target models.
Hybrid ASR systems are mainly DNN-HMM-based acoustic models [157]. As shown in Fig. 4.1a,
typically, a hybrid ASR system is composed of a preprocessing step, a model training step, and a
decoding step [158]. During the preprocessing step, features are extracted from the input audio, while
the corresponding text is processed as the audio’s label. The model training step trains a DNN model to
create HMM class posterior probabilities. The decoding step maps these HMM state probabilities to a
text sequence. In this work, the hybrid ASR system is built using the pytorch-kaldi speech recognition
toolkit [159]. Specifically, feature extraction transforms the audio frame into the frequency domain, as
Mel-Frequency Cepstral Coefficients (MFCCs) features. For an additional processing step, feature-space

46
Maximum Likelihood Linear Regression (fMLLR) is used for speaker adaptation. Three popular neural
network algorithms are used to build the acoustic model, including Long Short-Term Memory (LSTM),
Gated Recurrent Units (GRU), and Recurrent Neural Networks (RNNs). The decoder involves a language
model which provides a language probability to re-evaluate the acoustic score. The final transcription
output is the sequence of the most suited language with the highest score.
End-to-end ASR systems are attention-based encoder-decoder models [175]. Unlike hybrid ASR
systems, the end-to-end system predicts sub-word sequences which are converted directly as word
sequences. As shown in Fig. 4.1b, the end-to-end system is a unified neural network modeling framework
containing four components: an encoder, an attention mechanism, a decoder, and a Softmax layer. The
encoder contains feature extraction (i.e., Visual Geometry Group (VGG) extractor) and a few neural
network layers (i.e., bidirectional LSTM (BiLSTM) layers), which encode the input audio into high-
level representations. The location-aware attention mechanism integrates the representation of this time
frame with the previous decoder outputs. Then the attention mechanism can output the context vector.
The decoder can be a single layer neural network (i.e., an LSTM layer), decoding the current context
output with the ground truth of last time frame. Finally, the softmax activation, which can be considered
as “CharDistribution”, predicts several outputs and integrates them into a single sequence as the final
transcription.

4.2.2 Membership Inference Attack


The membership inference attack is considered as a significant privacy threat for machine learning (ML)
models [77]. The attack aims to determine whether a specific data sample is within the target model’s
training set or not. The attack is driven by the different behaviors of the target model when making
predictions on samples within or out of its training set.
Various membership inference attack methods have been recently proposed. Shokri et al. [38] train
shadow models to constitute the attack model against a target ML model with black-box access. The
shadow models mimic the target model’s prediction behavior. To improve accuracy, Liu et al. [176] and
Hayes et al. [170] leverage Generative Adversarial Networks (GAN) to generate shadow models with
increasingly similar outputs to the target model. Salem et al. [43] relax the attack assumptions mentioned
in the work [38], demonstrating that shadow models are not necessary to launch the membership inference
attack. Instead, a threshold of the predicted confidence score can be defined to substitute the attack model.
Intuitively, a large confidence score indicates the sample as a member of the training set [177]. The attacks
mentioned in the work above are all performed on the record level, while Song and Shmatikov [171] study
a user-level membership inference attack against text generative models. Instead of using the prediction
label along with the confidence score, Song and Shmatikov [171] utilize word’s rank list information of
several top-ranked predictions as key features to generate the shadow model. Apart from the black-box
access, Farokhi and Kaafar [178] model the record-level membership inference attack under the white-box
access.
Unlike image recognition systems or text generative systems, ASR systems present additional chal-
lenges [172]. With strict black-box access, attacks using confidence scores cannot be applied. With
limited discriminative power, features can only be extracted from the predicted transcription and its input
audio to launch membership inference attacks, i.e., audio auditing in our paper.

47
4.3 Auditing the ASR Models

Figure 4.2: Auditing an ASR model.

In this section, we first formalize our objective for auditing ASR models. Secondly, we present how a
user-level ASR auditor can be constructed and used to audit the target ASR. Finally, we show how we
implement the auditor.

4.3.1 Problem Statement


We define user-level membership inference as querying a user’s data and trying to determine whether any
data within the target model’s training set belongs to this user. Even if the queried data is not members of
the training set, but data belonging to this user is members in the training set, then this user is regarded as
the user-level member of this training set. Let (x, y) ∈ X × Y denote an audio sample, where x presents
the audio component, and y is the actual text of x. Assume an ASR model is a function F : X → Y. F (x)
is the model’s translated text. The smaller the difference between F (x) and y, the better the ASR model
performs. Let D represent a distribution of audio samples. Assume an audio set A is sampled from D
of size N (A ∼ DN ). Let U be the speaker set of A of size M (U ← A). The ASR model trained with
the dataset A is denoted as FA . Let A represent our auditor, and the user-level auditing process can be
formalized as:
Sm
• A speaker u has S = i=1 (xi , yi ), where u ← S.
Sm
• Let Y 0 = 0
i=1 yi , when yi0 = FA (xi ).

• Let “member” = 0 and “nonmember” = 1.

• Set r = 0 if u ∈ U , or r = 1 if u ∈
/ U.

• The auditor successes if A(u, S, Y 0 ) = r; otherwise it fails.

48
Our auditor, as an application of user-level membership inference, checks a speaker’s membership of
an ASR model’s training set. This ASR model is considered as the target model. To closely mirror the
real world, we query the target model with strict black-box access. The model only outputs a possible
text sequence as its transcription when submitting an audio sample to the target model. This setting
reflects the reality, as the auditor may not know this transcription’s posterior probabilities or other possible
transcriptions. Additionally, any information about the target model is unknown, including the model’s
parameters, algorithms used to build the model, and the model’s architecture. To evaluate our auditor,
we develop our target ASR model Ftar using an audio set Atar with two popular pipelines — hybrid
ASR model and end-to-end ASR model — to represent the ASR model in the real world. As described
in Section 4.2, the hybrid ASR model and the end-to-end ASR model translate the audio in different
manners. Under the strict black-box access, the auditor only knows query audio records of a particular
user u and its corresponding output transcription. The goal of the auditor is to build a binary classifier
Aaudit to discriminate whether this user is the member of the user set in which their audio records have
been used as target model’s training data (u ∈ Utar , Utar ← Atar ).

4.3.2 Overview of the Proposed Audio Auditor


The nature of membership inference [38] is to learn the difference of a model fed with its actual training
samples and other samples. User-level membership inference, like its record-level variant, requires higher
robustness. Apart from the disparity of the target model’s performance on record-level, our auditor needs
to consider the speaker’s characteristics as well. Since the posterior probabilities (or confidence scores)
are not part of the outputs, shadow models are necessary to audit the ASR model.
Fig. 4.2 depicts a workflow of our audio auditor auditing an ASR model. Generally, there are two
processes, i.e., training and auditing. The former process is to build a binary classifier as a user-level
membership auditor Aaudit using a supervised learning algorithm. The latter uses this auditor to audit an
ASR model Ftar by querying a few audios spoken by one user u. In Section 4.4.4, we show that only
a small number of audios per user can determine whether u ∈ Utar or u ∈
/ Utar . Furthermore, a small
number of users used to train the auditor is sufficient to provide a satisfying result.
Training Process. The primary task in the training process is to build up shadow models of high
quality. Shadow models, mimicking the target model’s behaviors, try to infer the targeted ASR model’s
decision boundary. Due to strict black-box access, a good quality shadow model performs with an
approximate testing accuracy as the target model. We randomly sample n datasets from the auxiliary
reference dataset Dref as Ashd1 , . . . , Ashdn to build n shadow models. Each shadow model’s audio
dataset Ashdi , i = 1, . . . , n is split to a training set Atrain test
shdi and a testing set Ashdi . To build up the ground
truth for auditing, we query the shadow model with Atrain test
shdi and Ashdi . Assume a user’s audio set Au is
sampled from users’ audio sets Dusers . According to the user-level membership inference definition,
the outputs from the audio Au ∈ Atest
shdi where its speaker u ∈
train are labeled as “nonmember”.
/ Ushdi
Otherwise, the outputs translated from the audio Au ∈ Atrain test
shdi and from the audio Au ∈ Ashdi where its
train are all labeled as “member”. Herein, U train ← Atrain . To simplify the experiment,
speaker u ∈ Ushdi shdi shdi
for each shadow model, training samples are disjoint from testing samples (Atrain test
shdi ∩ Ashdi = ∅). Their
train ∩ U test = ∅). With some feature extraction (noted below), those
user sets are disjoint as well (Ushdi shdi
labeled records are gathered as the auditor model’s training set.
Feature extraction is another essential task in the training process. Under the strict black-box access,

49
Table 4.1: The audit model’s performance when selecting either 3 features, 5 features, or 5 features with
MFCCs for each audio’s query.

F1-score Precision Recall Accuracy


Feature_Set3 63.89% 68.48% 60.84% 61.13%
Feature_Set5 81.66% 81.40% 82.22% 78.81%
Feature_Set5 + MFCCs 81.01% 79.72% 82.52% 77.82%

features are extracted from the input audio, ground truth transcription, and the predicted transcription.
As a user-level membership inferrer, our auditor needs to learn the information about the target model’s
performance and the speaker’s characteristics. Comparing the ground truth transcription and the output
transcription, the similarity score is the first feature to represent the ASR model’s performance. To
compute the two transcriptions’ similarity score, the GloVe model [179] is used to learn the vector
space representation of these two transcriptions. Then the cosine similarity distance is calculated as
the two transcriptions’ similarity score. Additionally, the input audio frame length and the speaking
speed are selected as two features to present the speaker’s characteristics. Because a user almost always
provides several audios to train the ASR model, statistical calculation is applied to the three features
above, including sum, maximum, minimum, average, median, standard deviation, and variance. After
the feature extraction, all user-level records are gathered with labels to train an auditor model using a
supervised learning algorithm.
To test the quality of the feature set above, we trained an auditor with 500 user-level samples
using the Random Forest (RF) algorithm. By randomly selecting 500 samples 100 times, we achieve
an average accuracy result over 60%. Apart from the three aforementioned features, two additional
features are added to capture more variations in the model’s performance, including missing characters
and extra characters obtained from the transcriptions. For example, if (truth transcription, predicted
transcription) = (THAT IS KAFFAR’S KNIFE, THAT IS CALF OUR’S KNIFE), then (missing characters,
extra characters) = (KFA, CL OU). Herein, the blank character in the extra characters means that one
word was mistranslated as two words. With these two extra features, a total of five features are extracted
from record-level samples: similarity score, missing characters, extra characters, frame length, and speed.
The record-level samples are transformed into user-level samples using statistical calculation as previously
described. We compare the performance of two auditors trained with the two feature sets. We also
consider adding 13 Mel-Frequency Cepstral Coefficients (MFCCs) as the additional audio-specific feature
set to accentuate each user’s records with the average statistics. As seen in Table 4.1, the statistical feature
set with 5-tuple is the best choice with approximately 80% accuracy, while the results with additional
audio-specific features are similar, but trail by one percentage. Thus, we proceed with five statistical
features to represent each user as the outcome of the feature extraction step.
Auditing Process. After training an auditor model, we randomly sample a particular speaker’s (u0 s)
audios Au from Dusers to query our target ASR model. With the same feature extraction, the outputs can
be passed to the auditor model to determine whether this speaker u ∈ Utar . We assume that our target
model’s dataset Dtar is disjoint from the auxiliary reference dataset Dref (Dtar ∩ Dref = ∅). In addition,
Uref and Utar are also disjoint (Utar ∩ Uref = ∅). For each user, as we will show, only a limited number
of audios are needed to query the target model and complete the whole auditing phase.

50
Table 4.2: The audit model’s performance trained with different algorithms.

F1-score Precision Recall Accuracy


DT 68.67% 70.62% 67.29% 64.97%
RF 81.66% 81.40% 82.22% 78.81%
3-NN 58.62% 64.49% 54.69% 56.16%
NB 34.42% 93.55% 21.09% 53.96%

4.3.3 Implementation
In these experiments, we take audios from LibriSpeech [180], TIMIT [23], and TED-LIUM [181] to build
our target ASR model and the shadow model. Detailed information about the speech corpora and model
architectures can be found in the Appendix. Since the LibriSpeech corpus has the largest audio sets, we
primarily source records from LibriSpeech to build our shadow models.
Target Model. Our target model is a speech-to-text ASR model. The inputs are a set of audio files with
their corresponding transcriptions as labels, while the outputs are the transcribed sequential texts. To
simulate most of the current ASR models in the real world, we created a state-of-the-art hybrid ASR
model [158] using the PyTorch-Kaldi Speech Recognition Toolkit [159] and an end-to-end ASR model
using the Pytorch implementation [175]. In the preprocessing step, fMLLR features were used to train
the ASR model with 24 training epochs. Then, we trained an ASR model using a deep neural network
with four hidden layers and one Softmax layer. We experimentally tuned the batch size, learning rate
and optimization function to gain a model with better ASR performance. To mimic the ASR model in
the wild, we tuned the parameters until the training accuracy exceeded 80%, similar to the results shown
in [174, 175]. Additionally, to better contextualize our audit results, we report the overfitting level of the
ASR models, defined as the difference between the predictions’ Word Error Rate (WER) on the training
set and the testing set (Overf itting = W ERtrain − W ERtest ).

4.4 Experimental Evaluation and Results


The goal of this work is to develop an auditor for users to inspect whether their audio information is used
without consent by ASR models or not. We mainly focus on the evaluation of the auditor, especially in
terms of its effectiveness, efficiency, and robustness. As such, we pose the following research questions.

• The effectiveness of the auditor. We train our auditor using different ML algorithms and select one
with the best performance. How does the auditor perform with different sizes of training sets? How
does it perform in the real-world scenario, such as auditing iPhone Siri?

• The efficiency of the auditor. How many pieces of audios does a user need for querying the ASR
model and the auditor to gain a satisfying result?

• The data transferability of the auditor. If the data distribution of the target ASR model’s training set
is different from that of the auditor, is there any effect on the auditor’s performance? If there is a
negative effect on the auditor, is there any approach to mitigate it?

• The robustness of the auditor. How does the auditor perform when auditing the ASR model built
with different architectures and pipelines? How does an an auditor perform when a user queries the
auditor with audios recorded in a noisy environment (i.e., noisy queries)?

51
4.4.1 Effect of the ML Algorithm Choice for the Auditor
We evaluate our audio auditor as a user-level membership inference model against the target ASR system.
This inference model is posed as a binary classification problem, which can be trained with a supervised
ML algorithm. We first consider the effect of different training algorithms on our auditor performance.
To test the effect of different algorithms on our audit methodology, we need to train one shadow ASR
model for training the auditor and one target ASR model for the auditor’s auditing phase. We assume
the target ASR model is a hybrid ASR system whose acoustic model is trained with a four-layer LSTM
network. The training set used for the target ASR model is 100 hours of clean audio sampled from the
LibriSpeech corpus [180]. Additionally, the shadow model is trained using a hybrid ASR structure where
GRU network is used to build its acoustic model. According to our audit methodology demonstrated in
Fig. 4.2, we observe the various performance of the audio auditor trained with four popular supervised
ML algorithms listing as Decision Tree (DT), Random Forest (RF), k-Nearest Neighbor where k = 3
(3-NN), and Naive Bayes (NB). After feature extraction, 500 users’ samples from the shadow model’s
query results are randomly selected as the auditor’s training set. To avoid potential bias in the auditor, the
number of “member” samples and the number of “nonmember” samples are equal in all training set splits
(#{u ∈ Ushd } = #{u ∈
/ Ushd }). An additional step taken to eliminate bias, is that each experimental
configuration is repeated 100 times. Their average result is reported as the respective auditor’s final
performance in Table 4.2.
As shown in Table 4.2, our four metrics of accuracy, precision, recall, and F1-score are used to
evaluate the audio auditor. In general, the RF auditor achieves the best performance compared to the other
algorithms. Specifically, the accuracy approaches 80%, with the other three metrics also exceeding 80%.
We note that all auditors’ accuracy results exceed the random guess (50%). Aside from the RF and DT
auditors, the auditor with other ML algorithms behaves significantly differently in terms of precision and
recall, where the gaps of the two metrics are above 10%. The reason is in part due to the difficulty in
distinguishing the “member” and “nonmember” as a user’s audios are all transcribed well at a low speed
with short sentences. Tree-based algorithms, with right sequences of conditions, may be more suitable to
discriminate the membership. We regard the RF construction of the auditor as being the most successful;
as such, RF is the chosen audio auditor algorithm for the remaining experiments.

4.4.2 Effect of the Number of Users Used in Training Set of the Auditor
To study the effect of the number of users, we assume that our target model and shadow model are trained
using the same architecture (hybrid ASR system). However, due to the strict black-box access to the
target model, the shadow model and acoustic model shall be trained using different networks. Specifically,
LSTM networks are used to train the acoustic model of the target ASR system, while a GRU network is
used for the shadow model. As depicted in Fig. 4.3, each training sample (xi , yi ) is formed by calculating
shadow model’s querying results for each user uj ← m
S
i=1 (xi , yi ). Herein, we train the audio auditor
with a varying number of users (j = 1, ..., M ). The amount of users M we considered in the auditor
training set is 10, 30, 50, 80, 100, 200, 500, 1,000, 2,000, 5,000, and 10,000.
On the smaller numbers of users in the auditor’s training set, from Fig. 4.3, we observe a rapid increase
in performance with an increasing number of users. Herein, the average accuracy of the auditor is 66.24%
initially, reaching 78.81% when the training set size is 500 users. From 500 users, the accuracy decreases
then plateaus. Overall, the accuracy is better than the random guess baseline of 50% for all algorithms.

52
Figure 4.4: Auditor model accuracy on a mem-
ber user querying with the target model’s un-
seen audios (Aout
mem ) against the performances
Figure 4.3: Auditor model performance with on the member users only querying with the
varied training set size. seen recordings (Ain
mem ).

Aside from accuracy, the precision increases from 69.99% to 80.40%; the recall is 73.69% initially and
eventually approaches approximately 90%; and the F1-score is about 80% when the training set size
exceeds 200. In summary, we identify the auditor’s peak performance when using a relatively small
number of users for training.
Recall the definition of user-level membership in Section 4.3.1. We further consider two extreme
scenarios in the auditor’s testing set. One extreme case is that the auditor’s testing set only contains
member users querying with the unseen audios (excluded from the target model’s training set), henceforth
denoted as Aout
mem . The other extreme case is an auditor’s testing set that only contains member users
querying with the seen audios (exclusively from the target model’s training set), herein marked as Ain
mem .
Fig. 4.4 reports the accuracy of our auditor on Ain out
mem versus Amem . If an ASR model were to use a user’s
recordings as its training samples (Ain
mem ), the auditor can determine the user-level membership with
a much higher accuracy, when compared to user queries on the ASR model with Aout
mem . Specifically,
Ain
mem has a peak accuracy of 93.62% when the auditor’s training set size M is 5,000. Considering the
peak performance previously shown in Fig. 4.3, auditing with Ain
mem still achieves good accuracy (around
85%) despite a relatively small training set size. Comparing the results shown in Fig. 4.3 and Fig. 4.4,
we can infer that the larger the auditor’s training set size is, the more likely nonmember users are to be
misclassified. The auditor’s overall performance peak when using a small number of the training set is
largely due to the high accuracy of the shadow model. A large number of training samples perhaps contain
the large proportion of nonmember users’ records whose translation accuracy is similar to the member
users’.
Overall, it is better for users to choose audios that have a higher likelihood of being contained within
the ASR model for audit (for example, the audios once heard by the model).

4.4.3 Effect of the Target Model Trained with Different Data Distributions
The previous experiment draws conclusions based on the assumption that the distributions of training
sets for the shadow model and the target model are the same. That is, these two sets were sampled from

53
(a) Accuracy (b) Precision (c) Recall

Figure 4.5: The auditor model audits target ASR models trained with training sets of different data
distributions. We observe that in regards to accuracy and recall the target model with the same distribution
as the auditor performs the best, while the contrary is observed for precision. Nevertheless, the data
transferability is well observed with reasonably high metrics for all data distributions.

LibriSpeech corpus DL (Atar ∼ DL , Ashd ∼ DL , Atar ∩ Ashd = ∅). Aside from the effects that a
changing number of users used to train the auditor, we relax this distribution assumption to evaluate the
data transferability of the auditor. To this end, we train one auditor using a training set sampled from
LibriSpeech DL (Ashd ∼ DL ). Three different target ASR models are built using data selected from
LibriSpeech, TIMIT, and TED, respectively.
Fig. 4.5 plots the auditor’s data transferability on average accuracy, precision, and recall. Once above a
certain threshold of the training set size (≈ 10), the performance of our auditor significantly improves with
an increasing number of users data selected as its user-level training samples. Comparing the peak results,
the audit of the target model trained with the same data distribution (LibriSpeech) slightly outperforms the
audit of target models with different distributions (TIMIT and TED). For instance, the average accuracy
of the auditor auditing LibriSpeech data reaches 78.81% when training set size is 500, while the average
audit accuracy of the TIMIT target model peaks at 72.62% for 2,000 users. Lastly the average audit
accuracy of TED target model reaches its maximum of 66.92% with 500 users. As shown in Fig. 4.5, the
peaks of precision of the LibriSpeech, TIMIT, and TED target model are 81.40%, 93.54%, and 100%,
respectively, opposite of what was observed with accuracy and recall. The observation of the TED target
model with extremely high precision and low recall is perhaps due to the dataset’s characteristics, where
all of the audio clips of TED are long speeches recorded in a noisy environment.

In conclusion, our auditor demonstrates satisfying data transferability in general.

4.4.4 Effect of the Number of Audio Records Per User


The fewer audio samples a speaker is required to submit for their user-level query during the auditing
phase, the more convenient it is for users to use the auditor. Additionally, if the auditor can be trained
with user-level training samples accumulated from a reduced number of audios per user, both added
convenience and the efficiency of feature preprocessing during the auditor’s training can be realized.
A limited number of audio samples per user versus a large number of audio samples per user.
Assuming that each user audits their target ASR model by querying with a limited number of audios, we
shall consider whether a small number or a large number of audio samples per user should be collected to
train our auditor. Herein, varying the number of audios per user only affects the user-level information

54
Figure 4.6: A comparison of average accuracy
for one audio, five audios, and all audios per
user when training the auditor model with a Figure 4.7: A varying number of audios used for
limited number of audios per user gained in the each speaker when querying an auditor model
auditing phase. trained with 5 audios per user.

learned by the auditor during the training phase. To evaluate this, we have sampled one, five, and all audios
per user as the training sets when the querying set uses five audios per user. Fig. 4.6 compares the average
accuracy of the auditors when their training sets are processed from limits of one audio, five audios, and
finally all audios of each user. To set up the five audio auditor’s training sets, we randomly select five
audios recorded from each user uj ← m=5
S
i=1 (xi , yi ), then translate these audios using the shadow model
to produce five transcriptions. Following the feature preprocessing demonstrated in Section 4.3, user-level
information for each user is extracted from these five output transcriptions with their corresponding input
audios. The same process is applied to construct the auditor in which the training data consists of one
audio per user. To set up the auditor’s training set with all the users’ samples, we collect all audios spoken
by each user and repeat the process mentioned above (average m̄ > 62). Moreover, since the two auditors’
settings above rely on randomly selected users, each configuration is repeated 100 times, with users
sampled anew, to report the average result free of sampling biases.
Fig. 4.6 demonstrates that the auditor performs best when leveraging five audios per user during the
feature preprocessing stage. When a small number of users are present in the training set, the performance
of the two auditors is fairly similar, except the auditor trained with one audio per user. For example,
when only ten users are randomly selected to train the auditor, the average accuracy of these two auditors
are 61.21% and 61.11%. When increasing to 30 users in the training set, the average accuracy of the
5-sample and all-sample auditors is 65.65% and 64.56%, respectively. However, with more than 30
users in the training set, the auditor trained on five audios per user outperforms that using all audios per
user. Specifically, when using five audios per user, the auditor’s average accuracy rises to ≈70% with a
larger training set size; compared to the auditor using all audios per user, with a degraded accuracy of
≈55%. This is in part owing to the difficulty of accurately characterizing users’ audios. In conclusion,
despite restrictions on the number of user audio samples when training the auditor, the auditor can achieve
superior performance. Consequently, we recommend that the number of audios per user collected for the
auditor’s training process should be the same for the auditor’s querying process.

55
A limited number of audio samples per user while querying the auditor. While we have investi-
gated the effect of using a limited number of audios per user to build the training set, we now ask how the
auditor performs with a reduced number of audios provided by the user during the querying stage of the
audit process, and how many audios each user needs to submit to preserve the performance of this auditor.
We assume our auditor has been trained using the training set computed with five audios per user. Fig. 4.7
displays performance trends (accuracy, precision, recall, and F1-score) when a varying number of query
audios per user is provided to the target model. We randomly select a user’s audios to query the target
model by testing m = 1, 3, 5, 7, 9, or 11 audios per user. As Fig. 4.6 reveals that the accuracy results
are stable when training set size is large, we conduct our experiments using 10,000 records in the auditor
training set. Again, each experiment is repeated 100 times, and the results are all averaged.
Fig. 4.7 illustrates that the auditor performs well with results all above 60%. Apart from recall, the
other three performances trend upwards with an increasing number of audios per user. The scores of
the accuracy, precision, and F1-score are approximately 75%, 81%, and 78%, respectively, when each
user queries the target model with nine audios, indicating an improvement over the accuracy (≈72%) we
previously observed in Fig. 4.3. It appears, for accuracy, when the number of query audios per user grows,
the upward trend slows down and even slightly declines. The recall is maximized (89.78%) with only
one audio queried by each user, decreasing to 70.97% with eleven audios queried for each user. It might
happen because the increased number of audio per user does not mean the increased number of users (i.e.,
testing samples). Since the auditor was trained with five audios per user, the auditor may fail to recognize
the user’s membership when querying with many more audios.

Overall, with only a limited number of audios used for audit, e.g. nine audios per user, our auditor
still effectively discriminates a user’s membership in the target model’s training set.

4.4.5 Effect of Training Shadow Models across Different Architectures

(a) Accuracy (b) Precision (c) Recall

Figure 4.8: Different auditor model performance when trained with different ASR shadow model architec-
tures.

A shadow model trained with different architectures influences how well it mimics the target model
and the performance of the user-level audio auditor. In this subsection, we experiment with different
shadow model architectures by training the auditor with information from various network algorithms, like
LSTM, RNNs, GRU. If the choice of a shadow model algorithm has a substantial impact on the auditor’s

56
Table 4.3: Information about ASR models trained with different architectures. (W ERtrain : the predic-
tion’s WER on the training set; W ERtest : the prediction’s WER on the testing set; t: target model; s:
shadow model.)

ASR Models Model’s Architecture Dataset Size W ERtrain W ERtest


LSTM-ASR (s) 4-LSTM layer + Softmax 360 hrs 6.48% 9.17%
RNN-ASR (s) 4-RNN layer + Softmax 360 hrs 9.45% 11.09%
GRU-ASR (s) 5-GRU layer + Softmax 360 hrs 5.99% 8.48%
LSTM-ASR (t) 4-LSTM layer + Softmax 100 hrs 5.06% 9.08%

performance, we shall seek a method to lessen such an impact. We also seek to evaluate the influence of
the combining attack as proposed by Salem et al. [43], by combining the transcription results from a set of
ASR shadow models, instead of one, to construct the auditor’s training set. The feature extraction method
is demonstrated in Section 4.3. We refer to this combination as user-level combining audit.
To explore the specific impact of architecture, we assume that the acoustic model of the target ASR
system is mainly built with the LSTM network (we call this model the LSTM-ASR Target model). We
consider three popular algorithms, LSTM, RNNs, and GRU networks, to be prepared for the shadow
model’s acoustic model. The details of the target and shadow ASR models above are displayed in Table 4.3.
Each shadow model is used to translate various audios, with their results processed into the user-level
information to train an auditor. Consider this, the shadow model that mainly uses the GRU network
structure to train its acoustic model is marked as the GRU-ASR shadow model; its corresponding auditor,
named GRU-based auditor, is built using the training set constructed from GRU-ASR shadow model’s
query results. Our other two auditors have a similar naming convention, an LSTM-based auditor and
an RNN-based auditor. Moreover, as demonstrated in Fig. 4.2, we combine these three shadow models’
results (n = 3), and construct user-level training samples to train a new combined auditor. This auditor is
denoted as the Combined Auditor that learns all kinds of popular ASR models.
Fig. 4.8 demonstrates the varied auditor performance (accuracy, precision and recall) when shadow
models using various algorithms are deployed. For accuracy, all four auditors show an upward trend with a
small training set size. The peak is observed at 500 training samples then decays to a stable smaller value
at huge training set sizes. The GRU-based auditor surpasses the other three auditors in terms of accuracy,
with the Combined Auditor performing the second-best when the auditor’s training set size is smaller
than 500. As for precision, all experiments show relatively high values (all above 60%), particularly the
LSTM-based auditor with a precision exceeding 80%. According to Fig. 4.8c), the RNN-based auditor
and GRU-based auditor show an upward trend in recalls. Herein, both of their recalls exceed 80% when
the training set size is larger than 500. The recall trends for the LSTM-based auditor and the Combined
Auditor follows the opposite trend as that of GRU and RNN-based auditors. In general, the RNN-based
auditor performs well across all three metrics. The LSTM-based auditor shows an excellent precision,
while the GRU-based auditor obtains the highest accuracy.

The algorithm selected for the shadow model will influence the auditor’s performance. The Combined
Auditor can achieve accuracy higher than the average, only if its training set is relatively small.

57
4.4.6 Effect of Noisy Queries
An evaluation of the user-level audio auditor’s robustness, when provided with noisy audios, is also
conducted. We also consider the effect of the noisy audios when querying different kinds of auditors
trained on different shadow model architectures. We shall describe the performance of the auditor with
two metrics — precision and recall; these results are illustrated in Fig. 4.9.

(a) GRU-based Auditor Precision (b) LSTM-based Auditor Precision (c) RNN-based Auditor Precision

(d) GRU-based Auditor Recall (e) LSTM-based Auditor Recall (f) RNN-based Auditor Recall

Figure 4.9: Different auditor audits noisy queries with different ASR shadow model.

To explore the effect of noisy queries, we assume that our target model is trained with noisy audios.
Under the strict black-box access to this target model, we shall use different neural network structures
to build the target model(s) and the shadow model(s). That is, the target model is an LSTM-ASR target
model, while the GRU-ASR shadow model is used to train the GRU-based auditor. For evaluating the
effect of the noisy queries, two target models are prepared using (i) clean audios (100 hours) and (ii)
noisy audios (500 hours) as training sets. In addition to the GRU-based auditor, another two auditors are
constructed, an LSTM-based auditor and an RNN-based auditor. The target models audited by the latter
two auditors are the same as the GRU-based auditor. Herein, the LSTM-based auditor has an LSTM-ASR
shadow model whose acoustic model shares the same algorithm as its LSTM-ASR target model.
Fig. 4.9a and Fig. 4.9d compare the precision and recall of the GRU-based auditor on target models
trained with clean and noisy queries, respectively. Overall, the auditor’s performance drops when auditing
noisy queries, but the auditor still outperforms the random guess (>50%). By varying the size of the
auditor’s training set, we observe the precision of the auditor querying clean and noisy audios displaying
similar trends. When querying noisy audios, the largest change in precision is ≈11%, where the auditor’s
training set size was 500. Its precision results of querying clean and noisy audios are around 81% and

58
70%, respectively. However, the trends of the two recall results are fairly the opposite, and noisy queries’
recalls are decreasing remarkably. The lowest decent rate of the recall is about 42%, where the auditor
was trained with ten training samples. Its recall results of querying two kinds of audios are around 74%
and 32%. In conclusion, we observe the impact of noisy queries on our auditor is fairly negative.
Fig. 4.9b and Fig. 4.9e display the LSTM-based auditor’s precision and recall, respectively, while
Fig. 4.9c and Fig. 4.9f illustrate the RNN-based auditor’s performance. Similar trends are observed
from the earlier precision results in the RNN-based auditor when querying clean and noisy queries.
However, curiously, the RNN-based auditor, when querying noisy audios, slightly outperforms queries
on clean audios. Similar to the noisy queries effect on GRU-based auditor, the noisy queries’ recall of
the RNN-based auditor decreases significantly versus the results of querying the clean audios. Though
noisy queries show a negative effect, all recalls performed by the RNN-based auditor exceed 50%, the
random guess. As for the effect of noisy queries on the LSTM-based auditor, unlike GRU-based auditor
and RNN-based auditor, the LSTM-based auditor demonstrates high robustness on noisy queries. For
the most results of its precision and recall, the differences between the performance on clean and noisy
queries are no more than 5%.

In conclusion, noisy queries create a negative effect on our auditor’s performance. Yet, if the shadow
ASR model and the target ASR model are trained with the same algorithm, the negative effect can be
largely eliminated.

4.4.7 Effect of Different ASR Model Pipelines on Auditor Performance

(a) Accuracy (b) Precision (c) Recall

Figure 4.10: The audit model audits different target ASR models trained with different pipelines.

Aside from the ASR model’s architecture, we examine the user-level auditor’s robustness on different
pipelines commonly found in ASR systems. In this section, the ASR pipeline is not just a machine
learning, instead a complicated system, as shown in Fig. 4.1. In practice, the two most popular pipelines
adopted in ASR systems are: a hybrid ASR system, or an end-to-end ASR system. We build our auditor
using the GRU-ASR shadow model with two target models trained on systems built on the aforementioned
ASR pipelines. Specifically, one target model utilizes the Pytorch-Kaldi toolkit to construct a hybrid
DNN-HMM ASR system, while the other target model employs an end-to-end ASR system.
Fig. 4.10 reports the performance (accuracy, precision, and recall) of the auditor when auditing the
two different target pipelines. Overall, the auditor behaves well over all metrics when auditing either

59
target models (all above 50%). The auditor always demonstrates good performance when using a small
number of training samples. The auditor targeting hybrid ASR in comparison to the end-to-end ASR
target achieves a better result. A possible reason is that our auditor is constituted with a shadow model
which has a hybrid ASR architecture. When focusing on the accuracy, the highest audit score of the hybrid
ASR target model is 78.8%, while that of the end-to-end ASR target model is 71.92%. The difference
in the auditor’s precision is not substantially, with their highest precision scores as 81.4% and 79.1%,
respectively. However, in terms of recall, the auditor’s ability to determine the user-level membership on
the hybrid ASR target model is much higher than the end-to-end target model, with maximum recall of
90% and 72%, respectively.
When auditing the hybrid ASR target model we observed the model significantly outperforming other
models. The training and testing data for both state-of-the-art ASR model architectures (i.e., hybrid and
end-to-end) are the same. Thus, to confidently understand the impact of different ASR model pipelines
on the auditor’s performance, we shall also investigate the difference between the overfitting level of the
hybrid ASR target model and that of the end-to-end ASR target model, as the overfitting of the model
increases the success rate of membership inference attacks [38]. Recall that overfitting was previously
defined in Section 4.3.3. The overfitting value of the hybrid ASR target model is measured as 0.04, while
the overfitting level of the end-to-end ASR target model is 0.14. Contrary to the conclusions observed
by [43], the target model that was more overfit did not increase the performance of our user-level audio
auditor. One likely reason is that our auditor audits the target model by considering user-level information
under strict black-box access. Compared to conventional black-box access in [43], our strict black-box
access obtains its output from the transcribed text alone; consequently the influence of overfitting on
specific words (WER) would be minimized. Thus, we can observe that our auditor’s success is not entirely
attributed to the degree of the target ASR model’s overfitting alone.

In conclusion, different ASR pipelines between the target model and the shadow model negatively
impact the performance of the auditor. Nevertheless, our auditor still performs well when the
target model is trained following a different pipeline (i.e., an end-to-end ASR system), significantly
outperforming random guesses (50%).

4.4.8 Real-World Audit Test


To test the practicality of our model in the real world, we keep our auditor model locally and conduct a
proof-of-concept trial to audit iPhone Siri’s speech-to-text service. We select the auditor trained by the
GRU shadow model with LibriSpeech 360-hour voice data as its training set. To simplify the experiments,
we sample five audios per user for each user’s audit. According to the results presented in Fig. 4.6 and
Fig. 4.7, we select the auditor trained with five audios per user, where 1,000 users were sampled randomly
as the auditor’s training set. To gain the average performance of our auditor in the real world, we stored
100 auditors under the same settings with the training set constructed 100 times. The final performance is
the average of these 100 auditors’ results.
Testbed and Data Preprocess. The iPhone Siri provides strict black-box access to users, and the only
dictation result is its predicted text. All dictation tasks were completed and all audios were recorded in
a quiet surrounding. The clean audios were played via a Bluetooth speaker to ensure Siri can sense the

60
audios. User-level features were extracted as per Section 4.3.2. The Siri is targeted on iPhone X of iOS
13.4.1.
Ground Truth. We target a particular user û iPhone Siri’s speech-to-text service. From Apple’s privacy
policy of Siri (see Appendix D), iPhone user’s Siri’s recordings can be selected to improve Siri and
dictation service in the long-term (for up to two years). We do note that this is an opt-in service. Simply
put, this user can be labeled as “member”. As for the “nonmember” user, we randomly selected 52
speakers from LibriSpeech dataset which was collected before 2014 [180]. As stated by the iPhone
Siri’s privacy policy, users’ data “may be retained for up to two years”. Thus, audios sampled from
LibriSpeech can be considered out of this Siri’s training set. We further regard the corresponding speakers
of LibriSpeech as “nonmember” users. To avoid nonmember audios entering Siri’s training data to retrain
its ASR model during the testing time, each user’s querying audios were completed on the same day we
commenced tests for that user, with the Improve Siri & Dictation setting turned off.
As we defined above, a “member” may make the following queries to our auditor: querying the
auditor (i) with audios within the target model’s training set (Dû = Ain
S
mem ); (ii) with audios out of
S out
the target model’s training set (Dû = Amem ); (iii) with part of his or her audios within the target
model’s training set (Dû = ( Ain
S S S out
mem ) ( Amem )). Thus, we generate six “member” samples where
S in S out
the audios were all recorded by the target iPhone’s owner, including Dû = Amem , Dû = Amem ,
S in S S out k=5 m=5
and Dû = ( Amem ) ( Amem ), where k = 1, 2, 3 and k + m = 5. In total, we collected 58 user-level
k m
samples with 6 “member” and 52 “nonmember” samples.
Results. We load 100 auditors to test those samples and the averaged overall accuracy as 89.76%.
Specifically, the average precision of predicting the “member” samples is 58.45%, while the average
precision of predicting the “nonmember” samples is 92.61%. The average ROC AUC result is 72.6%,
which indicates our auditor’s separability in this experiment. Except for the different behaviors of Siri
translating the audios from “member” and “nonmember” users, we suspect that another reason of high
precision on “nonmember” is due to the LibriSpeech audios are out of Siri’s dictation scope. As for the
S in
low precision rate on “member” samples, we single out the data Dû = Amem for testing. Additionally,
k=5
its average accuracy result can reach 100%; thus, the auditor is much more capable in handling Ain
mem
than Aout
mem , corroborating our observation in Section 4.4.2.

In conclusion, our auditor shows a generally satisfying performance for users auditing a real-world
ASR system, Apple’s Siri on iPhone.

4.5 Threats to Auditors’ Validity


Voiceprints Anonymization. In determining the user-level membership of audios in the ASR model, our
auditor relies on the target model’s different behaviors when presented with training and unseen samples.
The auditor’s quality depends on the diverse responses of the target model when translating audio from
different users. The feature is named users’ voiceprints. The voiceprint is measured in [182] based on
a speaker recognition system’s accuracy. Our auditor represents the user’s voiceprint according to two
accumulated features, including missing characters and extra characters. However, if an ASR system is
built using voice anonymization, our user-level auditor’s performance would degrade significantly. The
speaker’s voice is disguised in [182] by using robust voice conversation while ensuring the correctness

61
of speech content recognition. Herein, the most popular technique of voice conversation is frequency
warping [183]. In addition, abundant information about speakers’ identities is removed in [184] by using
adversarial training for the audio content feature. Fig. 4.1 shows that the average accuracy of the auditor
dropped by approximately 20% without using the two essential features. Hence, auditing user-level
membership in a speech recognition model trained with anonymized voiceprints remains as a future
avenue of research.
Differentially Private Recognition Systems. Differential privacy (DP) is one of the most popular
methods to prevent ML models from leaking any training data information. The work [171] protects the
text generative model by applying user-level DP to its language model. This method contains a language
model during the hybrid ASR system’s training, on which the user-level DP can be applied to obscure the
identity, at the sacrifice of transcription performance. The speaker and speech characterization process is
protected in [185] by inserting noise during the learning process. However, due to strict black-box access
and the lack of output probability information, our auditor’s performance remains unknown on auditing
the ASR model with DP. The investigation of our auditor’s performance to this user protection mechanism
is open for future research.
Workarounds and Countermeasures. Although Salem et al. [43] has shown neither the shadow model
nor the attack model are required to perform membership inference, due to the constraints of strict
black-box access, the shadow model and auditor model approach provide a promising means to perform
a more difficult task of user-level membership inference. Instead of the output probabilities, we mainly
leverage the ASR model’s translation errors in the character level to represent the model’s behaviors.
Alternative countermeasures against the membership inference, such as dropout, generally change the
target model’s output probability. However, the changes to probabilities of the ASR model’s output are
not as sensitive as changes to its translated text [43]. Studying the extent of this sensitivity of ASR models
remains as our future work.
Synthetic ASR Models. Another limitation of our work is that we evaluate our auditor on synthetic ASR
systems trained on real-world datasets, and we have not applied the auditor to extensive set of real-world
models aside from Siri. However, we believe that our reconstruction of the ASR models closely mirrors
ASR models in the wild.

4.6 Related Work


Membership Inference Attacks. As a fundamental privacy threat to ML models, the membership
inference attack distinguishes whether a particular data sample is a member of the target model’s training
set or not. Traditional membership inference attacks against ML models under black-box access leverage
numerous shadow models to mimic the target model’s behavior [38, 186, 170]. Salem et al. [43] revealed
that membership inference attacks could be launched by directly utilizing the prediction probabilities and
thresholds of the target model. Both works [186] and [187] prove that overfitting of a model is sufficient
but not a necessity to the success of a membership inference attack. Yeom et al. [187] as well as Farokhi
and Kaafar [178] formalize the membership inference attack with black-box and white-box access. All
previously mentioned works consider record-level inference; however, Song and Shmatikov [171] deploy
a user-level membership inference attack in text generative models, with only the top-n predictions known.
Trustworthiness of ASR Systems. The ASR systems are often deployed on voice-controlled de-
vices [172], voice personal assistants [188], and machine translation services [173]. Tung and Shin [189]

62
propose SafeChat to utilize a masking sound to distinguish authorized audios from unauthorized recording
to protect any information leakage. Recent works [190] and [191] propose an audio cloning attack
and audio reply attack against the speech recognition system to impersonate a legitimate user or inject
unintended voices. Voice masquerading to impersonate users on the voice personal assistants has been
studied [192]. Whereas Zhang et al. [192] propose another attack, namely voice squatting, to hijack the
user’s voice command, producing a sentence similar to the legal command. Du et al. [173] generate the
adversarial audio samples to deceive the end-to-end ASR systems.
Auditing ML Models. Many of the current proposed auditing services also seek to audit the bias and
fairness of a given model [193]. Works have also been presented to audit the ML model to learn and check
the model’s prediction reliability [194, 195, 196]. Moreover, the auditor is utilized to evaluate the ML
model’s privacy risk when protecting an individual’s digital rights [171, 197].
Our Work. Our user-level audio auditor audits the ASR model under the strict black-box access. As
shown in Section 4.3, we utilize the ASR model’s translation errors in the character level to represent the
model’s behavior. Compared to related works under black-box access, our auditor does not rely on the
target model’s output probability [38, 43]. In addition, we sidestep the feature pattern of several top-ranked
outputs of the target model adopted by Song and Shmatikov [171], instead we use one text output, the
user’s speed, and the input audio’s true transcription, as we do not have access to the output probability
(usually unattainable in ASR systems). Hence, our constraints of strict black-box access only allow
accessing one top-ranked output. In this case, our user-level auditor (78.81%) outperforms Song’s and
Shmatikov’s user-level auditor (72.3%) in terms of accuracy. Moreover, Hayes et al. [170] use adversarial
generative networks (GANs) to approximate the target model’s output probabilities while suffering big
performance penalties with only 20% accuracy, our auditor’s accuracy is far higher. Furthermore, our
auditor is much easier to be trained than the solution of finding outlier records with a unique influence
on the target model [186], because we only need to train one shadow model instead of many shadow (or
reference) models.

4.7 Limitations and Future Work


Further investigation on features. From our set of selected features both audio-specific features and
features capturing model behaviors performs well, as observed in our results. It remains to be seen if
additional audio-specific features would specifically aid the task of user-level auditing. As there is a
plethora of potential feature candidates, we consider this as part of future work.
Auditing performance with varied numbers of queries. In our auditor, we observe that only a limited
number of queries per user is necessary to audit the target ASR model, especially when the auditor is
trained with a limited audios per user. An interesting observation was that our user-level auditor’s recall
performance on more queries per user declines under the strict black-box access. We are continuing our
investigation into why our auditor’s ability to find unauthorized use of user data varies in this manner
when being queried with different numbers of audios.
Member audio in Siri auditing. In our setting, we make our best effort to ensure member audios are
used for training. However, in our real-world evaluation, even with the “Improve Siri & Dictation” setting
turned on, with an extended period of continual use by our user, we cannot guarantee that member audios
of our member user where actually used for training although we are confident they have been included.

63
4.8 Conclusion
This work highlights and exposes the potential of carrying out user-level membership inference audit in
IoT voice services. The auditor developed in this chapter has demonstrated promising data transferability,
while allowing a user to audit his or her membership with a query of only nine audios. Even with audios
are not within the target model’s training set, the user’s membership can still be faithfully determined.
While our work has yet to overhaul the audit accuracy on various IoT applications across multiple learning
models in the wild, we do narrow the gap towards defining clear membership privacy in the user level,
rather than the record level [38]. However, questions remain about whether the privacy leakage hails from
the data distribution or its intrinsic uniqueness of the record. Nevertheless, as we have shown, both a small
training set size and the Combined Auditor, which combines results from various ASR shadow models to
train the auditor, have a positive effect on the IoT audit model; on the contrary, audios recorded in a noisy
environment and different ASR pipelines impose a negative effect on the given auditor; fortunately, the
auditor still outperforms random guesses (50%). Examining other performance factors on more real-world
ASR systems in addition to our iPhone Siri trial and extending possible countermeasures against auditing
are all worth further exploration.

Acknowledgments
We thank all anonymous reviewers for their valuable feedback. This research was supported by Australian
Research Council, Grant No. LP170100924. This work was also supported by resources provided by the
Pawsey Supercomputing Centre, funded from the Australian Government and the Government of Western
Australia.

64
Appendix
A. Datasets
The LibriSpeech speech corpus (LibriSpeech) contains 1,000 hours of speech audios from audiobooks
which are part of the LibriVox project [180]. This corpus is famous in training and evaluating speech
recognition systems. At least 1,500 speakers have contributed their voices to this corpus. We use 100 hours
of clean speech data with 29,877 recordings to train and test our target model. 360 hours of clean speech
data, including 105,293 recordings, are used for training and testing the shadow models. Additionally,
there are 500 hours of noisy data used to train the ASR model and to test our auditor’s performance in a
noisy environment.
The TIMIT speech corpus (TIMIT) is another famous speech corpus used to build ASR systems. This
corpus recorded audios from 630 speakers across the United States, totaling 6,300 sentences [23]. In this
work, we use all this data to train and test a target ASR model, and then audit this model with our auditor.
The TED-LIUM speech corpus (TED) collected audios based on TED Talks for ASR development
[181]. This corpus was built from the TED talks of the international workshop on spoken language
translation (IWSLT) 2011 Evaluation Campaign. There are 118 hours of speeches with corresponding
transcripts.

B. Evaluation Metrics
The user-level audio auditor is evaluated with four metrics calculated from the confusion matrix, which
reports the number of true positives, true negatives, false positives and false negatives: True Positive
(TP), the number of records we predicted as “member” are correctly labeled; True Negative (TN),
the number of records we predicted as “nonmember” are correctly labeled; False Positive (FP), the
number of records we predicted as “member” are incorrectly labeled; False Negative (FN), the number
of records we predicted as “nonmember” are incorrectly labeled. Our evaluation metrics are derived
from the above-mentioned numbers.

• Accuracy: the percentage of records correctly classified by the auditor model.

• Precision: the percentage of records correctly determined as “member” by the auditor model
among all records determined as “member”.

• Recall: the percentage of all true “member” records correctly determined as “member”.

• F1-score: the harmonic mean of precision and recall.

C. ASR Models’ Architectures


On the LibriSpeech 360-hour voice dataset, we build one GRU-ASR model with the Pytorch-Kaldi toolkit.
That is, we train a five-layer GRU network with each hidden layer of size 550 and one Softmax layer.
We use tanh as the activation function. The optimization function is Root Mean Square Propagation
(RMSProp). We set the learning rate as 0.0004, the dropout rate for each GRU layer 0.2, and the number
of epochs of training 24.
On the LibriSpeech 360-hour voice dataset, we train another ASR model using the Pytorch-Kaldi
toolkit. Specifically, it is a four-layer RNN network with each hidden layer of size 550 using ReLU as the

65
activation function and a Softmax layer. The optimization function is RMSProp. We set the learning rate
as 0.00032, the dropout rate for each RNN layer 0.2, and the number epochs of training 24.
On the LibriSpeech 360-hour voice dataset, we train one hybrid LSTM-ASR model. The acoustic
model is constructed with a four-layer LSTM and one Softmax layer. The size of each hidden LSTM layer
is 550 along with 0.2 dropout rate. The activation function is tanh, while the optimization function is
RMSProp. The learning rate is 0.0014, and the maximum number of training epochs is 24.
On the LibriSpeecch 100-hour voice dataset, we train a hybrid ASR model. The acoustic model is
constructed with a four-layer LSTM and one Softmax layer. Each hidden LSTM layer has 550 neurons
along with 0.2 dropout rate. The activation function is tanh, while the optimization function is RMSProp.
The learning rate is 0.0016, and the maximum number of training epochs is 24.
On the LibriSpeecch 100-hour voice dataset, we train an end-to-end ASR model. The encoder is
constructed with a five-layer LSTM with each layer of size 320 and with 0.1 dropout rate. We use one
layer location-based attention with 300 cells. The decoder is constructed with a one-layer LSTM with 320
neurons along with 0.5 dropout rate. The CTC decoding is enabled with a weight of 0.5. The optimization
function is Adam. The learning rate is 1.0, and the total number of training epochs is 24.
On the TEDLium dataset, we train a hybrid ASR model. The acoustic model is constructed with
a four-layer LSTM and one Softmax layer. Each hidden LSTM layer has 550 neurons along with 0.2
dropout rate. The activation function is tanh, while the optimization function is RMSProp. The learning
rate is 0.0016, and the number of maximum training epochs is 24.
On the TIMIT dataset, we train a hybrid ASR model. The acoustic model is constructed with a
four-layer LSTM and one Softmax layer. Each hidden LSTM layer has 550 neurons along with 0.2
dropout rate. The activation function is tanh, while the optimization function is RMSProp. The learning
rate is 0.0016, and the number of maximum training epochs is 24.

D. Real-World Audit Test


Siri is a virtual assistant provided by Apple in their iOS, iPadOs, and macOS. Siri’s natural language
interface is used to answer users’ voice queries and make recommendations [198]. The privacy policy of
Apple’s Siri is shown in Fig. 4.11. The user, who is considered as a member user of Siri’s ASR model
in our setting, has used the targeted iPhone for more than two years, frequently interacting with Siri,
often with the Improve Siri & Dictation service opted in. As for the member user’s member audio, we
carefully chose five phrases that the user had certainly used when engaging with Siri. Starting with “Hey
Siri” and phrases from common interactions including “Hey Siri”, “What’s the weather today”, “What
date is it today”, “Set alarm at 10 o’clock”, and “Hey Siri, what’s your name”. As for the member
user’s non-member audio, we chose five short phrases in LibriSpeech that the user had never used to
interact with Siri (e.g. “we ate at many men’s tables uninvited”). These phrases were recorded using the
member user’s voice along with our member user’s non-member audios. The use of either set of phrases
should produce an ability to audit as recall that our method imitates the user as a whole when auditing
the model, irrespective of whether a specific audio phrase was used to train/update this model. Lastly,
for the nonmember users’ nonmember audio, the target Siri’s language is English (Australia). Since the
LibriSpeech dataset was collected before 2014 and as stated by the iPhone Siri’s privacy policy, users’
data “may be retained for up to two years”. We consider these recordings to not be part of the Siri training

66
dataset, and thus we have nonmember users’ nonmember audio. We further assume that our selected user
phrases about book reading are nonmembers.

Figure 4.11: The privacy policy of Apple’s Siri

67
Chapter 5

The Audio Auditor: No-Label


User-Level Membership Inference in
Internet of Things Voice Services

With the fast development of machine learning techniques, the voice services embeded in various Internet
of Things (IoT) devices becomes the most popular function in people’s daily life. In this chapter, we
examine user-level membership inference targeting an automatic speech recognition (ASR) model within
the voice services under no-label black-box access. Specifically, we design a user-level audio auditor to
determine whether a specific user had unwillingly contributed audio used to train the ASR model, when
the service only reacts on user’s query audio without providing the translated text. With user representation
of the input audio data and their corresponding system’s reaction, our auditor shows an effective auditing
in user-level membership inference. Our experiments shows that the auditor behaves better with more
training samples and samples with more audios per user. We evaluate the auditor on ASR models trained
with different algorithms (LSTM, RNNs, and GRU) on the hybrid ASR system (Pytorch-Kaldi). We hope
the methodology developed in this chapter and findings can inform privacy advocates to overhaul IoT
privacy.

5.1 Introduction
With the advance of machine learning (ML) techniques, ML-powered acoustic systems, also known as
automatic speech recognition (ASR) systems, have become more efficient and effective in our daily lives
[144, 145, 146]. Devices and applications integrating the acoustic systems are ubiquitous, like Amazon
Echo, Google Assistant and Apple’s Siri, enabling the full potential of intelligent voice-controlled devices,
voice personal assistants, and machine translation services [173]. In spite of its popularity, the privacy
risk and unauthorized access of personal acoustic data has raised concerns in the security community
[192, 189, 199]. As Malkin et al. [200] surveyed and reported by the media [149, 150, 201], most users
considered storing audio recordings permanently in voice assistants, as unacceptable, while also strongly
against exposure of their data to any third parties.
A user’s audio records are protected and enforced by laws and regulations, including the General
Data Protection Regulations (GDPR) [142], the Children’s Online Privacy Protection Act (COPPA) [143],

68
and the California Consumer Privacy Act (CCPA) [168]. Specifically, the “Right to be Forgotten” [147]
protects user’s audio data from being continuously accessed by any third-party [148]. However, many
devices could sniff and analyze audio without a user’s consent when using voice services [151]. By
analyzing a few audio samples and learning the speaker’s voice characteristics, some voice cloning
systems can synthesize his/her voice [190, 202, 191]. A news article [152] reported that a scammer had
impersonated an acquaintance by using cloned audios [152]. Hence, a user-level audio auditor is strongly
desired to allow users to verify their data’s leaking provenance.
A fundamental problem named membership inference has been conducted widely in recent years.
It exposes information about an ML model’s training set under different conditions with black-box
access. Despite of its privacy risk, it is also a good method for the auditing problem [171, 197]. The first
investigation is conducted by Shokri et al. [38] using shadow training technique. A record’s membership
of a target model’s training set can be determined under black-box access with a few assumptions. The
first assumption is shadow models were established using the same structure as the target model. The
second assumption is shadow models were trained using the dataset from the same distribution as the
target model. The third assumption is the prediction results of the target model contained the output label
and its corresponding confidential score.
The follow up researches release these assumptions gradually. Salem et al. [43] released the first two
assumptions by picking a threshold based on the output confidential score. Song and Shmatikov [171]
further conducted the membership inference exploiting several top ranked labels as the model’s prediction.
Choo et al. [203], Li and Zhang [204], and Miao et al. [197] further release the third assumption as the
prediction results of the target model only contains the output label. Our paper fully release the third
assumption as no explicit label provided by the target model.
Motivation. The ASR model is the core model in Voice Assistant, who has a fundamental function
about translation. Querying with an input audio, the ASR model translates into the text command
understood and processed by the Voice Assistant. Herein, the translated text is the output label of the
ASR model. However, some voice services, especially in IoT devices, would not provide the translated
text. Instead, the system reacts directly according to the text content. In this case, it is impossible for
users to audit the ASR model with the help of previous membership inference techniques. More advanced
membership inference techniques are in need. Additionally, record-level membership inference determines
whether a specific record is the member of an ML model’s training set or not. However, when using
the voice service, it is hard for users query audio recordings the same as the training samples, even if
their text contents and speakers are the same. It is more plausible for auditor investigating the user-level
membership inference. Consistent with [197], we define user-level membership inference as: querying
with a user’s data, if this user has any data within target model’s training set, even if the query data are
not members of the training set, this user is the user-level member of this training set.
No-Label Audio Auditor. We design and implement a no-label audio auditor for this specific problem.
Assume we conducting an ASR model with black-box access who reacts directly based on the voice
content without providing the explicit translation. Thus, instead of using the ASR model’s translations,
we try to observe the model’s behavior according to its reactions. To simplify the reaction information
collecting process, we assume the voice service contains searching online function. Further, we limits
the querying voice content which requires the service to search the answer online. In such case, the
reaction information are the searching results. Our auditor analyzes such reaction information based on

69
the ASR model’s predicted text; compares with the searching results based on the audio’s true text; learns
the model’s different behaviors of translating its known data and unknown data; and finally determine
the user-level membership for a specific user. In the no-label access setting, a naive baseline strategy
considers that a user is a member user of the target training set, when the model’s reaction based on its
translations are all match to the searching results based on the true transcription.
It is challenging to conduct such a no-label audio auditor for user-level membership inference. (i)
No-label access means little information about the target ASR model. The prior knowledge, the system’s
reaction information, is quite roughness representing the model’s translation results. (iii) Different from
record-level inference, user-level membership inference requires the auditor has a high level robustness in
distinguishing the model’s different behavior. Specifically, the auditor should be capable of distinguishing
the model’s behavior on two aspects including the translation accuracy for different voice content and the
translation accuracy for different speakers. (iv) ASR systems have a complicated learning architectures
processing the time-series audio data [172, 154, 173]. It is quite time-consuming and computational
source-consuming to build shadow models for membership inference.
In summary, we design and evaluate our no-label audio auditor to help users distinguish whether their
audio samples have been used to train an ASR model without their consent. The contributions of this
work are listed as follows:

• We broaden the class of membership inference problem and propose a no-label audio auditor
against an ASR model. A set of features are extracted from searching results as the target ASR
model’s reaction. With statistic analysis in user-level, no-label audio auditor is built for user-level
membership inference. With access to the system’s reaction information only, our auditor achieves
around 75% AUC score. In the meanwhile, the random guessing method only achieves 50%.

• A new shadow training technique is proposed. Instead of imitating the target model’s behavior, our
shadow model imitates the target system’s reactions. Accordingly, the system’s reaction analysis
reflects the target model’s behavior.

• Our auditor is generic and not dependent to the ASR model’s structure.We established shadow
models with different algorithms.

The rest of the chapter is organized as follows: Section 5.2 introduces the background about the
target ASR model and related membership inference attack. Section 5.3 illustrates the details of no-label
user-level membership inference as our auditor. Section 5.4 shows the setup and the results of our
experiments. Finally, Section 5.5 concludes the chapter.

5.2 Related Work


This section introduces the state-of-the-art Automatic Speech Recognition (ASR) models and the related
work about membership inference on ASRs.

5.2.1 The Automatic Speech Recognition (ASR) Model


While conventional ASR models are based on hidden Markov models (HMMs), current state-of-the-art
ASR models utilise deep neural networks (DNNs). Our audio adversarial attack targets a state-of-the-art

70
ASR system based on a DNN — end-to-end ASR systems [174]. Assuming white-box access, we evaluate
our attack using an ASR model downloaded from a popular open-source ASR system DeepSpeech.
A Pytorch-Kaldi ASR system are mainly DNN-HMM-based acoustic models [157]. As shown in
Fig. 3.1, typically, a hybrid ASR system is composed of a preprocessing step, a model training step, and a
decoding step [158]. During the preprocessing step, features are extracted from the input audio, while
the corresponding text is processed as the audio’s label. The model training step trains a DNN model to
create HMM class posterior probabilities. The decoding step maps these HMM state probabilities to a
text sequence. In this work, the hybrid ASR system is built using the pytorch-kaldi speech recognition
toolkit [159]. Specifically, feature extraction transforms the audio frame into the frequency domain, as
Mel-Frequency Cepstral Coefficients (MFCCs) features. For an additional processing step, feature-space
Maximum Likelihood Linear Regression (fMLLR) is used for speaker adaptation. Three popular neural
network algorithms are used to build the acoustic model, including Long Short-Term Memory (LSTM),
Gated Recurrent Units (GRU), and Recurrent Neural Networks (RNNs). The decoder involves a language
model which provides a language probability to re-evaluate the acoustic score. The final transcription
output is the sequence of the most suited language with the highest score.

5.2.2 Membership Inference Attack on ASRs


The membership inference attack is considered as a significant privacy threat for machine learning (ML)
models [77]. The attack aims to determine whether a specific data sample is within the target model’s
training set or not. The attack is driven by the different behaviors of the target model when making
predictions on samples within or out of its training set.
Various membership inference attack methods have been recently proposed. Shokri et al. [38] train
shadow models to constitute the attack model against a target ML model with black-box access. The
shadow models mimic the target model’s prediction behavior. To improve accuracy, Liu et al. [176] and
Hayes et al. [170] leverage Generative Adversarial Networks (GAN) to generate shadow models with
increasingly similar outputs to the target model. Salem et al. [43] release the attack assumptions mentioned
in the work [38], demonstrating that shadow models are not necessary to launch the membership inference
attack. Instead, a threshold of the predicted confidence score can be defined to substitute the attack model.
Intuitively, a large confidence score indicates the sample as a member of the training set [177]. Choo
et al. [203] and Li and Zhang [204] further broaden the attack assumptions and launch the membership
inference attack when the target model only provide the predicted label without confidence score. Choo
et al. [203] utilized the data augmentations and adversarial examples to expose the model’s decision
boundary. Li and Zhang [204] proposed a transfer-based attack and perturbation-based attack. The former
one relies on the shadow model and the same distribution dataset as the target training set. The later one
relies on the adversarial example techniques, and try to measure the effort in perturbing a sample predicted
as a different label.
Apart from the black-box access, Farokhi and Kaafar [178] model the record-level membership
inference attack under the white-box access. The attacks mentioned in the work above are all performed
on the record level, while Song and Shmatikov [171] study a user-level membership inference attack
against text generative models. Instead of using the prediction label along with the confidence score, Song
and Shmatikov [171] utilize word’s rank list information of several top-ranked predictions as key features
to generate the shadow model.

71
In this work, we use the membership inference techniques to audit the ASR model without providing
any explicit translations. By observing the ASR system’s reaction, we aim to verify whether a specific
speaker had unwillingly contributed audio to train an ASR model. Different from image recognition
systems or text generative systems, the auditor faces additional challenges in ASR systems especially with
no-label access [172]. With limited discriminative power, features can only be extracted from the system’s
reactions, its input audio, and true transcription to launch our membership inference, i.e., no-label audio
auditing in this chapter.

5.3 No-Label Audio Auditor


In this section, we firstly formalize our research problem. Secondly, we give the overview of our user-level
audio auditor, including the processes of constructing our user-level audio auditor under no-label black-box
access and how we use this auditor to audit the target ASR model. Finally, we show how we implement
this auditor.

5.3.1 Problem Statement


When users use a voice assistant, an ASR model translate the input audio into text implicitly and deliver
the text content to the system for analysis and reaction. With non-label black-box access, our auditor can
be established by collecting and analyzing the information about the input audio and the system’s reaction.
Firstly, we formalize the process of the ASR model’s translation. Secondly, we formalize the process of
the system’s analysis and reaction. Thirdly, we formalize the success of our auditing process. Finally, we
summarize the prior knowledge for the auditor under no-label black-box access to the target model.
Let (x, y) ∈ X × Y denote an audio sample, where x presents the audio component, and y is the
actual text of x. Assume an ASR model is a function F : X → Y. F (x) is the model’s translated text.
The smaller the difference between F (x) and y, the better the ASR model performs. Consider a training
audio set A is sampled from D of size N (A ∼ DN ), where D represents a distribution of audio samples.
The ASR model trained with the dataset A is denoted as FA . Querying the ASR model with an audio
sample (x, y), then the text delivered to the system is denoted as y 0 = FA (x).
The system receives the text y 0 , analyze its content, and react accordingly. Assume the content does
not include any commands except searching online, which can be controlled by users. Thus, the reaction
should be the results of searching the text y 0 online. We mark the reaction function as R, while the
y 0
reaction information for the delivered text y 0 is denoted as rA = RA (y 0 ). If we search online with the
y
actual text y, then the reaction information should be rA = RA (y).
We define user-level membership inference as querying a user’s audio and trying to determine whether
any audio within the target model’s training set belongs to this user. Even if the queried audio are not
members of the training set, but other audio belonging to this user is members in the training set, then this
user is regarded as the user-level member of this training set. Assume the target ASR model is FA and
the system provides the corresponding reaction RA . Let U be the speaker set of A of size M (U ← A).
When A represent our no-label audio auditor, and the user-level auditing process can be formalized as:
Sn
• A speaker u has S = i=1 (xi , yi ), where u ← S.
Y =
Sn yi yi
• Let RA i=1 rA , when rA = RA (yi ).

72
0 Sn yi0 y0
Y =
• Let RA i=1 rA , when rAi = RA (yi0 ) and yi0 = FA (xi ).

• Let “member” = 0 and “nonmember” = 1.

• Set b = 0 if u ∈ U , or b = 1 if u ∈
/ U.
0
Y , RY ) = b; otherwise it fails.
• The auditor successes if A(u, S, RA A

Prior Knowledge. Our auditor performs the user-level membership inference under the no-label
black-box access. With no-label black-box access, when querying an ASR model with an audio, the
system reacts directly based on the model’s translation content without providing the explicit text. To
simplify the problem, we define the reaction function as searching online by controlling the content of the
queried audio. When an auditor aims to audit an ASR model, the query audio can be selected or generated
artificially. If the system’s reaction is not searching online, the corresponding query audio would not be
analyzed by the auditor. Detailed descriptions of our prior knowledge with the no-label black-box access
are listed below:

• Query records. When an auditor selects or generates a proper audio to query the ASR model, the
audio sample and its true transcription are known.

• Reaction results. When the system reacts on the ASR model’s query audio, the reaction informa-
tion, a.k.a. the searching results, are available to be collected and analyzed.

• User-level information. Since the query audio are selected or generated by the auditor, the amount
of speakers and their corresponding audio are known.

• Reaction function. Although the ASR model is under non-label black-box access, the reaction
function it related to can be observed.

5.3.2 No-Label User-Level Membership Inference


The nature of membership inference [38] is to observe the difference of a model fed with the samples
it knows ( training data) and the unknown samples. User-level membership inference needs a higher
robustness to learn the relationship between model’s behavior and the speaker’s characteristics. With
no-label black-box access setting, our auditor needs to consider the ASR model in a system containing
the searching online function. Such online searching results are extracted to represent the ASR model’s
behavior.
Fig. 5.1 illustrates an overall process of our audio auditor performing user-level membership inference
under no-label black-box access. Generally, there are two processes including training and auditing. The
training process is to build a binary classifier as a user-level membership auditor Aaudit using a supervised
learning algorithm. The testing process uses this auditor to audit an ASR model Ftar by querying a few
audios spoken by one user u. Both of these two processes need to perform the same data collection and
feature extraction steps.
Training Process. The primary task in training process is to establish a shadow system which includes a
shadow model Fshd and a system’s reaction function Rshd . To mimic the target system, we set the Rshd
is the same as Rtar . As we mentioned above, the reaction function is searching online. Thus, our shadow
model should has similar behavior as the target model in semantic level.

73
Figure 5.1: The overall process of our audio auditor performing user-level membership inference under
no-label black-box access. (i) In the training process, we sample one audio set from the auxiliary reference
dataset Dref to build one shadow model. The shadow model dataset Ashd ∼ Dref is split to a training set
Atrain test test train
shd and a testing set Ashd . Then we query the shadow system with Ashd and Ashd to collect data.
After feature extraction process, we label the user-level record as “member” or “nonmember”. Then
an audit model can be trained with these outputs of the shadow system. (ii) In the auditing process, we
randomly sample a particular speaker’s (u’s) audios Au ∼ Dusers to query our target ASR system and
collect data. Feature vectors from outputs of the target ASR system can be passed to the audit model to
determine whether u ∈ Utar ← Atar holds.

We sample an audio set Ashd from the auxiliary reference dataset Dref based on the target system’s
performance. The shadow model dataset Ashd ∼ Dref is split to a training set Atrain
shd and a testing
y0
set Atest ref
shd . Specifically, query the target system with an audio (xref , yref ) firstly and get rtar . Then
ref y ref y0 ref y
query the system reaction Rtar with yref and get rtar . If rtar has a high similarity with rtar , then
(xref , yref ) ∈ Atrain test
shd , otherwise (xref , yref ) ∈ Ashd .
In data collection step, we query our shadow model Fshd and reaction function Rshd with Atest
shd and
Atrain
shd . For each audio record (xref , yref ) spoken by a speaker uref , we can collect a set of information
y ref y0 ref
(xref , yref , rtar , rtar ) described in Section 5.3.2. In feature extraction step, we extract nine features for
each record and perform statistic analysis in user-level.
train ∩ U test = ∅. If the user
The feature extraction is described hereafter. Assume uref ∈ Uref and Uref ref
uref has n query records, simple statistics are computed for these n records including sum, mean, median,
minimum, maximum, standard deviation, and variance. Then we label the uref ’s statistic analyzed record
as “member” if (xref , yref ) ∈ Atrain
shd , otherwise label as “nonmember”.
After the feature extraction step, the training set for the auditor A is prepared. Random Forest (RF) is
used to build this binary classification model for the following auditing process.
Auditing Process. After training an auditor model, we randomly sample a particular speaker’s (u’s)
audios Au ∼ Dusers . With the same data collection step, we query our target ASR system and the
system’s reaction function Rtar with Au . Then we proceed the same feature extraction and pass this
user-level record to the audit model. Our auditor A determine its membership if u ∈ Utar ← Atar holds.

74
Figure 5.2: Data collection for the auditor with no-label black-box access.

Data Collection

Fig. 5.2 depicts the data collection step for the auditor with no-label black-box access. Querying the ASR
system with an input audio (x, y), we collect an indirect output of the ASR model’s prediction. First of
all, to make sure the system’s reaction is searching online, the true transcription y should not contain
any specific intents like open a specific application or set alarm at any specific time. In this setting, the
reaction functionn R is the search engine linked to the ASR model. Then the data collection will be
proceeded with four steps. Firstly, query the ASR system (F and R) with the audio x. Secondly, make
sure the system’s reaction is searching online for the predicted transcription and only the searching results
ry0 are provided by the system. Herein, y 0 is the translation predicted by the ASR model F and hidden by
the system. Thirdly, using the same search engine R, query the true transcription y and gain the results ry .
0
Finally, information about this input audio and the target model’s behavior are collected as (x, y, r y , ry )
for the following feature extraction step.

Feature Extraction
0
Given an input audio (x, y), we can collect (x, y, r y , ry ) after the data collection step. Specifically, two
types of searching results are extracted from the searching results r which are denoted as r1 and r2.
Herein, r1 contains titles from top three searching results, while r2 contains titles and corresponding
related content from top three searching results.
The intuition of membership inference is to learn the target model’s different behaviors querying its
training samples or other samples. Observing the model’s behavior is key to the success of membership
inference. Normally, for an ASR model, the better the model behaves, the smaller the differences are
between the query audio’s true text and the translated text. To expose the model’s performance under
no-label black-box access, we compare the y with ry0 , y with ry , and ry with ry0 . Since the length of
these three pairs of strings are quite different, we use fuzzy string matching method to calculate their
similarity [205]. Table 5.1 demonstrate all nine features in detail. Except the first feature speed, the rest
eight features try to capture the model’s performance indirectly.

75
Table 5.1: Datasets across models

Feature Description
Speed The user u’s speaking speed.
0
Take out the common tokens of y and r1y .
fuzz_y_r1y’
Then calculate the Levenshtein distance similarity ratio between the two strings.
0
Take out the common tokens of y and r2y .
fuzz_y_r2y’
Then calculate the Levenshtein distance similarity ratio between the two strings.
Take out the common tokens of y and r1y .
fuzz_y_r1y
Then calculate the Levenshtein distance similarity ratio between the two strings.)
Take out the common tokens of y and r2y .
fuzz_y_r2y
Then calculate the Levenshtein distance similarity ratio between the two strings.
0
Format r1y as a string and r1y as a vector of string with size 3.
extract_r1y_r1y’_top Return the strings along with a Levenshtein distance similarity score out of
a vector of strings and record the top similarity score.
0
Format r2y as a string and r2y as a vector of string with size 3.
extract_r2y_r2y’_top Return the strings along with a Levenshtein distance similarity score out of
a vector of strings and record the top similarity score.
0
Format r1y as a string and r1y as a vector of string with size 3.
extract_r1y_r1y’_sum Return the strings along with a Levenshtein distance similarity score out of
a vector of strings and record the sum of these similarity scores.
0
Format r2y as a string and r2y as a vector of string with size 3.
extract_r2y_r2y’_sum Return the strings along with a Levenshtein distance similarity score out of
a vector of strings and record the sum of these similarity scores.

5.4 Experimental Evaluation and Results


5.4.1 Experimental Setting
Dataset Description

The LibriSpeech corpus is one of the famous speech corpus to build and evaluate ASR systems. 1000
hours of English speech are sampled at 16kHz. The content of these corpus are mainly reading books
derived from audiobooks — one part of the LibriVox project [180]. Thus, these audio content do not have
any specific intents to trigger the target system’s reaction except searching online. We use 100 hours clean
training set to establish our target ASR model. Then querying its 360 hours clean speech set to the target
system, a proper set of speech are selected to build our shadow model as described in previous section.

Target System

The target system in no-label black-box setting contains a target ASR model and a system’s reaction
function.
Our target model is a speech-to-text ASR model. The inputs are a set of audio files with their
corresponding transcriptions as labels, while the outputs are the transcribed sequential texts. To simulate
most of the current ASR models in the real world, we created a state-of-the-art hybrid ASR model [158]
using the PyTorch-Kaldi Speech Recognition Toolkit [159]. In the preprocessing step, fMLLR features
were used to train the ASR model with 24 training epochs. Then, we trained an ASR model using a deep
neural network with four hidden layers and one Softmax layer. We experimentally tuned the batch size,
learning rate and optimization function to gain a model with better ASR performance. To mimic the ASR

76
model in the wild, we select an audio set for the shadow model’s training process, only if the reaction of
the audio’s translated text is similar to the reaction of its true text.
We assume the reaction should be searching online and the reaction function is regarded as a search
engine. In our experiment, we assume Google Chrome as the search engine embeded in the target system.
Specifically, we use ChromeDriver 88.0.4324.96 for automated batch searching.

The Baseline Method

One popular baseline method is random guessing used by Shokri et al [38], Salem et al. [43], and Li
and Zhang [204]. Specifically, the membership inference model is a binary classifier. We evaluate the
membership inference model on dataset randomly sampled from Dtar , where the model is trained with
the same size of member records and nonmember records. In such case, the random guessing inference
should be around 50%.
We measures the area under the Receiver Operator Characteristic (ROC) curve as the evaluation
metric to evaluate the membership inference model. We adopted AUC instead of ROC since the AUC is
threshold independent.

5.4.2 User-Level Auditor with No-Label Black-box access


Assume a user u has n audio recordings. Query the shadow system with these recordings and gain
n record-level information to represent the shadow ASR model’s behavior. When all n record-level
information analyzed with the statistic analysis mentioned previously, the shadow ASR model’s behavior
on this user u is recorded as one training sample for our auditor. In such case, we call the training set of
our auditor generated with n samples per user. Both the auditor’s training process and auditing process
use the same data collection method and feature process method. Therefore, the testing set of this auditor
are also generated with n samples per user. Additionally, with statistical analysis, at least two audio
recordings of one user are queried to the target system to generate the user-level features (n ≥ 2).
Figure 5.3 shows the auditor’s performance using different number of samples per user. We evaluate
the performance verifying the number of samples per user from two to six. Herein, we assume the
shadow ASR model and the target ASR model using the same algorithm LSTM. For each experiment, 300
user-level records were sampled and generated for our auditor’s training step. We repeated the experiment
100 times and averaged these results to reduce deviations in performance. The results shows that the more
audios for each user used to audit their membership, the more accurate our audio auditor can performed.
Specifically, the highest AUC score can reach to 74.04% using six samples per user. Considering users’
convenient and the performance of our auditor, we recommend five samples per user whose auditor’s
AUC score is over 70% as well.
Figure 5.4 evaluates the effect of different training set size used to train the user-level auditor. We
trained the audit model with a small set of users and a relatively large set of users. The small set of training
set obtains 20, 40, 60, 80, and 100 users, while the large set of training set obtains 200, 300, 400, 500,
600, 700, 800, 900, and 1,000 users separately. The testing set querying these audit models is fixed at
100 test audio records. The user-level samples in the training set were randomly sampled and processed
from the outputs of the shadow system. The shadow ASR model use the same algorithm as LSTM as
our target ASR model. These two ASR models are trained with training samples generated with five
record-level samples per user. To eliminate trial specific deviations, we repeated each experiment 100

77
Figure 5.3: The audit model’s performance by Figure 5.4: The audit model’s performance
the number of audios for each user. across the training set size.

times and averaged the results. As shown in Figure 5.4, the model performs better when the number of
user-level records within the training set increases. When 150 users used in auditor’s training set, the
performance achieves a highest AUC score (73%). When the number of users is over 150, the audit model
performs well and the AUC score is stable to around 73%. In all, the more users that are used to train the
audit model, the more accurate a user’s membership within the target model can be determined.

5.4.3 Model Independent User-Level Auditor


The previous experiments build the shadow model using the same algorithm used by the target model and
using the training set with the same distribution as the target model’s. When we query the target system
with black-box access, the algorithm used to train the target model is unknown. Thus, different algorithms
are used to build the shadow model to evaluate the model independent auditor. Assume the target ASR
model is trained with a four-layer LSTM network. We build three shadow models using different kinds of
networks including LSTM, RNN, and GRU. For these shadow models, their training sets are all sampled
and used to train following the same training process. Assuming the shadow model is trained with a
four-layer LSTM network, the corresponding auditor is named LSTM-based auditor. Accordingly, three
auditors, including LSTM-based auditor, RNN-based auditor and GRU-based auditor, are used to audit
the same target ASR model.
Fgigure 5.5 evaluates the effect of different shadow models used to train the user-level auditors. Based
on previous conclusions, we collect the results of five records per user and extract the user-level features
for each user. Similar to the previous experiments, 100 users’ audio recordings are used to query the
target system and processed as the auditor’s testing set. Different training sizes of the user-level data
were randomly sampled from the outputs of the shadow system. To eliminate trial specific deviations,
we repeated each experiment 100 times and reported the averaged results. As shown in Figure 5.5, the
performances of different auditors show the same trend in auditing user-level membership. Specifically,
the AUC score is increasing when the training set size of the auditor is growing till around 150. Then the
upward trend gradually slows down until it stabilizes. The highest averaged AUC score can even reach to

78
75.31% when the RNN-based auditor audits the target ASR system. We can simply conclude that our
auditor is robust to different ASR models.

Figure 5.5: Effect of different shadow models used to train the user-level auditors.

5.5 Conclusion
This work proposed an auditor for an ASR model in IoT voice services under no-label black-box access.
We investigate the user-level membership inference assuming that even the translated text will not be
provided explicitly. Instead, the translated text of the ASR model will be implicitly passed to cause the
system’s reaction. Our auditor broadens the boundary of membership inference by releasing the label-only
membership inference assumption [204]. According to the reaction information, our auditor tries to learn
the pattern of the ASR model’s behavior when querying its known data and its unknown data. We extract
nine features for the system’s reaction on each audio recordings and process the statistic analysis to gain
the user-level features. As we shown, both the size of user base and the number of audio samples per user
used in the testing set have a positive effect on our audit model against the target ASR model. Our auditor
is quite robust to audit different target ASR models. Specifically, the highest AUC score can reach to
around 75%. Examining other factors on performance and extending possible defenses against audit are
all worth further exploration.

79
Chapter 6

FAAG: Fast Adversarial Audio


Generation through Interactive Attack
Optimisation

Automatic Speech Recognition services (ASRs) inherit deep neural networks’ vulnerabilities like crafted
adversarial examples. Existing methods often suffer from low efficiency because the target phases are
added to the entire audio sample, resulting in high demand for computational resources. This chapter
proposes a novel scheme named FAAG as an iterative optimization-based method to generate targeted
adversarial examples quickly. By injecting the noise over the beginning part of the audio, FAAG generates
adversarial audio in high quality with a high success rate timely. Specifically, we use audio’s logits output
to map each character in the transcription to an approximate position of the audio’s frame. Thus, an
adversarial example can be generated by FAAG in approximately two minutes using CPUs only and
around ten seconds with one GPU while maintaining an average success rate over 85%. Specifically,
the FAAG method can speed up around 60% compared with the baseline method during the adversarial
example generation process. Furthermore, we found that appending benign audio to any suspicious
examples can effectively defend against the targeted adversarial attack. We hope that this work paves the
way for inventing new adversarial attacks against speech recognition with computational constraints.

6.1 Introduction
Automatic speech recognition (ASR) technologies have enabled the transformation of human spoken
language into text. In recent years, with the development of advanced deep learning techniques, the effi-
ciency and effectiveness of ASR systems are enhanced as a Deep-Learning-as-a-Service. ASR service has
become an increasingly popular human-machine interface due to its accuracy and efficiency/convenience.
The number of devices with voice assistants is estimated to reach 8.4 billion by 2024 from the current 4.2
billion globally [206]. The value of the global ASR market will be over USD 21.5 billion by 2024 [206].
International corporate giants like Microsoft, Google, IBM, and Amazon, are heavily investing in new
technologies to expand their market shares. Devices and applications integrating the acoustic systems
are ubiquitous, like Amazon Echo, Google Assistant, and Apple’s Siri, enabling the full potential of
intelligent voice-controlled devices, voice personal assistants, and machine translation services [173, 207].

80
Hence, the security problems associated with ASR systems are worth millions of dollars.
With the advancement of deep neural networks, ASR systems have become increasingly prevalent
in our daily lives [144, 145, 158, 175]. Despite ASR’s popularity, the security risk and the adversarial
attack against ASRs have raised concerns in the security community [192, 189, 199, 208, 209, 210].
The community has confirmed that ASR systems inherit vulnerabilities from neural networks [211].
For example, neural network models are vulnerable to adversarial examples [13, 212]. State-of-the-
art ASR systems consisting of deep neural network structures can be fooled by adversarial examples
[213]. Machine learning-based cybersecurity has become an important challenge in various real-world
applications [214, 215, 216, 217, 218].
Existing research shows that well-crafted adversarial audio can lead an ASR system to misbehave
unexpectedly. There are two types of adversarial attacks — targeted attacks and untargeted attacks.
Untargeted attacks against ASR systems can damage the performance of the ASR system. Abdullah et
at. [219] forced an ASR system to transcribe the input audio into incorrect text. Targeted attacks against
ASR systems not only cause low accuracy in transcription but also inject the attacker’s desired phrases
without being recognized. Carlini et al. [92] leveraged noise-like hidden voice commands to embed
commands into a normal audio example so that users can only hear a meaningless noise, but the ASR
system can execute the hidden commands. The DolphinAttack further crafted the audio to make the
embedded command inaudible and imperceptible to human beings. Carlini and Wagner [94] proposed an
interactive optimization-based method to enable an adversarial example generated in a small distortion
of Decibels (dB). Qin et al. [96] improved the method using the psychoacoustic principle of auditory
masking to generate unnoticeable noise.
Although various methods have been proposed to generate adversarial audio in high quality, these
methods may not perform as well as expected under some certain conditions. Firstly, all those methods
use a complete audio to generate the adversarial example. However, as users’ security awareness has
gradually increased nowadays, an adversarial audio may not be played completely if a user noticed the
anomaly. When this audio was not played completely, the success rate of the attack would be decreased
significantly. Thus, an adversarial audio generation method based on a part of the target audio is more
powerful than the one based on the complete audio. The shorter the part of the audio is used, the higher
the opportunity the attack is successful. Secondly, all previous methods generating the adversarial audio
use multiple GPUs. However, when provided with limited resources, e.g. only CPUs or just one GPU
can be used, previous attacks can be more time-consuming than expected. Especially for a bunch of
adversarial examples, a fast adversarial audio generation method could be more dangerous.
This chapter aims to find an effective and efficient method for adversarial attack under white-box
access to the target end-to-end ASR system. Our method aims to improve the existing popular method
in [94] based on previous concerns. We propose to modify a part of an audio example, instead of its
whole frame. The beginning part of the audio is large enough to be covered by the target phrase to
embed any phrases into an audio example, including voice commands. Only a space separating the target
phrase and the remaining transcription texts are needed to ensure the ASR system can understand the
target phrase. Thus, our method’s key task is to find a proper length of the frame at the proper position
of the audio. The state-of-the-art ASR systems can filter out some noise and rectify some contextual
errors recognized by the system using their language model. Therefore, it is difficult to hide any phrases
correctly as a part of original audio’s transcription. A fixed length at the beginning part of the audio is the

81
best solution.It is essential to find a fixed length of frames according to the target phrase and the original
audio. Otherwise, there is not enough space to embed the target phrase and result in a low success rate.
Specifically, the proper clip of the audio frame is selected by mapping each word in the transcription to a
rough position in the audio’s frame according to the logits output. This audio clip is subsequently used to
generate the adversarial example based on the interactive optimization-based method proposed by [94].
The contributions of this work can be summarized as follows:

• We propose a new scheme, called Fast Adversarial Audio Generation (FAAG), developing a new
optimisation algorithm based on an interactive attack strategy. According to different phrases and
provided audios, FAAG can automatically select a proper length of the frame at the beginning of
the audio. The shortest ratio of the frame used for an adversarial example generation can be reached
to 14.79%.

• We develop a fast adversarial sample generation with a satisfied success rate and tolerable distortion.
Under a limited resources, our method takes half an hour using the CPUs only to generate ten
adversarial examples, and cause at around two minutes using one GPU. Both are faster than previous
attacks using the same resources, specifically speeding up around 60% in generation time.

• The empirical study provides us with two new observations: (1) different words in target phrases
will not significantly affect the performance of our adversarial examples; (2) a target phrase with
fewer words has a slight positive boost on the adversarial example generation, compared to a target
phrase containing more words.

• The target phrase can only be hidden at the beginning of the original audio, otherwise the transcrip-
tion of the phrase part will be at low accuracy. On the contrary, appending a begin audio at the
beginning of any suspicious audio can effectively protect the service from the targeted adversarial
attack.

The rest of the chapter is organized as follows: Section 6.2 introduces the background about the target
ASR model and related adversarial attacks against the ASR system. Section 6.3 illustrates the details of
our method to generate audio adversarial examples with limited resources. Section 6.4 shows the setup
and the results of our experiments. Section 6.5 discuss different positions of audio using our method and
countermeasures. Finally, Section 6.6 concludes the chapter.

6.2 Related Work


This section provides brief introductions to the state-of-the-art Automatic Speech Recognition (ASR)
models and the related work about adversarial attacks on ASRs.

6.2.1 The Automatic Speech Recognition Model


While conventional ASR models are based on hidden Markov models (HMMs), current state-of-the-art
ASR models utilise deep neural networks (DNNs). Our audio adversarial attack targets a state-of-the-art
ASR system based on a DNN — end-to-end ASR systems [174]. Assuming white-box access, we evaluate
our attack using an ASR model downloaded from a popular open-source ASR system DeepSpeech.

82
Figure 6.1: An end-to-end ASR system.

Figure 6.2: The overall process of generating the targeted adversarial attack.

End-to-end ASR systems in Baidu’s DeepSpeech implemented by Mozilla are sequence-to-sequence


neural network models [165, 220]. Unlike other typical hybrid ASR systems, the end-to-end system
predicts word sequences that are converted directly from individual characters from the raw waveform. As
shown in Fig. 6.1, the end-to-end system is a unified neural network modeling framework containing three
main components, including a feature pre-processing step, a neural network model as the probabilistic
model, and a decoder to refine the final outputs.
The feature pre-processing step uses Mel-Frequency Cepstral Coefficients (MFCC) features to repre-
sent the raw audio data. The method in the first step is Mel-Frequency Cepstrum (MFC) transformation.
The whole frame with the MFCC features extracted will be split into multiple frames with an overlapping
window applied. Each audio frame will be fed into the probabilistic model. Herein, Recurrent Neural
Networks (RNNs) are popular in End-to-End ASR systems where an audio waveform is mapped to a
sequence of characters [165]. However, the sequence of the character output cN
i does not mean the se-
quence of the word output qiM . Thus, the decoder is used to reevaluate the character output. Connectionist
Temporal Classification (CTC) [221] is a powerful method for the unknown alignment between the input
and output sequence. DeepSpeech uses CTC as a decoder to score the character output and map it to a
word sequence by de-duplicating sequentially repeated characters.

83
6.2.2 Adversarial Attack on ASRs
Audio adversarial attacks on ASR systems have recently become popular, focusing on both targeted and
untargeted adversarial attacks [207]. Knowledge of these attacks’ target model is set at white-box access
or black-box access. Generally, the audio adversarial attack aims to generate an audio adversarial example
to deceive the ASR model without users’ awareness. When the ASR model simply mistranslates an audio
example, it is an untargeted adversarial attack. When the ASR model translates an audio example into a
phrase designed by the attacker, it is a targeted adversarial attack.
Unlike the image domain, the targeted adversarial attack is much more dangerous against ASR systems
than the untargeted adversarial attack. Except a few examples like an untargeted adversarial attack on an
ASR system to force mistranscription in [219], most attacks focus on targeted adversarial attacks, where
the target phrase is usually a common voice command [92].
Targeted adversarial attacks with white-box access can generate adversarial examples of high quality.
Specifically, [94] generated an adversarial audio example with only slight distortion on the DeepSpeech
model. CommanderSong [222] can embed the desired voice commands into any songs stealthily. SirenAt-
tack is proposed to generate adversarial audios under both white-box and black-box settings [173]. Under
the white-box setting, SirenAttack applies a fooling gradient method to find the adversarial noise, whose
success rate can reach to 100%. Understanding the ASR model’s detail, [223] can reverse the perturbed
MFCC features into adversarial speech. Different from [223], [224] generated adversarial audios by
modifying the raw waveform directly with an end-to-end scheme. Furthermore, the adversarial examples
could be generated with different systems and different features according to [225]. Audio adversarial
examples are designed in [96] by leveraging the psychoacoustic principle of auditory masking.
Targeted adversarial attacks with black-box access are more practical than white-box attacks. With
little knowledge of the ASR system, Hidden Voice Commands generates the noisy command by repeatedly
querying the model [92]. As a result, semantics of the generated adversarial audio is difficult for people
to understand. DolphinAttack exploits the non-linearity of the microphones to generate inaudible voice
commands [95]. Under the black-box attack, SirenAttack proposed an iterative and gradient-free method
[173]. The works in [226] and [227] considered genetic algorithms and gradient estimation to modify
the original audio under black-box access. The work in [158] generated adversarial examples based
on psychoacoustic hiding in the black-box access, which can embed any audio with a malicious voice
command. Devil’s whisper [172] proposed a general adversarial attack against the ASR systems by
training a local model under white-box access.
Different to the related work, this chapter focuses on the efficiency of adversarial speech recognition.
We propose to modify a part of an audio example through interactive attack optimisation to guarantee
high success rate, low distortion and high generation speed.

6.3 Generating Audio Adversarial Examples


State-of-the-art audio adversarial attacks can generate a high-quality adversarial audio with a high success
rate. However, two limitations are neglected. Firstly, all previous attacks assume the attacker has multiple
GPUs and abundant time to generate such high-quality adversarial audio. Once the computing sources
are not that intensive, the time spent on attack would increase significantly. However, a successful audio
adversarial attack requires timely action in the real world. Secondly, as the user’s awareness of security

84
and privacy is enhanced, even a slight noise within the audio would be noticed. In such cases, the shorter
the adversarial audio can be recognized as the target phrase, the more powerful the adversarial audio is.
We propose the FAAG method to hide the target phrase in a small piece of adversarial audio quickly with
limited computing sources supported.

6.3.1 Threat Model


Given an audio waveform x and the target transcription t, we aim to construct an adversarial audio
x0 = x + δ. Assuming that the target ASR system transcribes both the audio x into text y (y = F (x))
and the audio x0 as text y 0 (y 0 = F (x0 )), we expect that the target transcription t is a sub-string of
y 0 . Additionally, the audio x and our adversarial audio x0 should sound normal to human beings. We
formulate the similarity based on the difference of the distortion in dB between the original audio x and
the crafted adversarial audio x0 , which is represented by the noise δ, the same as [94]. The dB difference
(dBx (δ)) between the modified adversarial example and the original audio reflects the relative loudness of
the added noise comparing to the original audio. The smaller the dBx (δ) is, the more similar that two
audio examples sound. When the loudness of noise is small enough, the noise can be ignored so that the
adversarial audio can be transcribed by the ASR system without human awareness of the attack.
We assume that the adversarial audio is generated with white-box access to the target ASR model.
Herein, the attacker has complete knowledge of the ASR model, including its structure and its parameters.
In this chapter, we choose to use Baidu’s DeepSpeech model [165]. DeepSpeech includes three parts
— an MFC conversion for audio preprocessing, RNN layers to map each input frame into a probability
distribution over each character, and a CTC loss function to measure the RNN’s output score. The core of
the DeepSpeech model is an optimized RNN trained by multiple GPUs on over 5,000 hours of speech from
9,600 speakers. The RNN layers finally output logits that are computed over the probability distribution
of output characters.
We do not consider over-the-air audio transcription because many live settings may jeopardize the
experiments. We will discuss our method adapted in the real world in Section 6.5. FAAG only modifies a
small portion of the given audio example instead of the whole frame. It will be challenging to compare
FAAG with other methods using live transcription. Furthermore, our adversarial examples are validated
by transcribing the waveform directly. We treat the adversarial attack as a successful attack if the output
transcription y 0 includes the target phrase t correctly, denoted by t ∈ y 0 .

6.3.2 Fast Adversarial Audio Generation (FAAG)


FAAG generates an adversarial audio clip within a short period, even with limited computing sources,
while embedding the attacker’s desired command (the target phrase) in a short clip of this adversarial
audio. To shorten the adversarial audio clip within the whole audio, we try to find a proper position and
length of frames used to embed our target phrase with a high success rate and low distortion. We find
that it is unnecessary to transcribe the whole audio waveform as our selected phrase. When generating an
adversarial audio example, the longer the waveform used does not lead to less distortion or higher success
rate. However, the longer the waveform that is used, the slower the adversarial example will be generated.
By constructing the targeted phrase at the beginning of the adversarial example and separating the targeted
phrase from the rest transcription with a long space, the ASR model can still recognize the targeted phrase

85
Algorithm 1: Select the Proper Clip xbegin
Require: Original audio x; Target phrase t; Pre-trained ASR model F ; Step = s; Fine-tune variable λ
1: y = F (x)
2: c = f (x)
3: t = t + ‘ ’
4: | . . . | presents the character number or the frame length in the vector . . . .
Ensure: The selected audio clip xbegin is long enough for adversarial example generation.
5: Initialize λ = 0
6: |tallocated | = |t| + λ
|c|
7: |xbegin | ≥ |y| × |tallocated | × s
8: index = |xbegin |
9: xbegin ← x[: index]
10: xrest ← x[index :]
11: return Two clips xbegin and xrest

correctly. Comparing to the prior work on targeted attacks on speech-to-text [94], we generate the audio
adversarial example based on the beginning part of a long audio waveform instead of the whole waveform.
Fig. 6.2 illustrates our adversarial example generation, focusing on a targeted adversarial attack on the
DeepSpeech model. In general, given an audio waveform x, the target ASR model’s transcription is y and
our adversarial attack can be summarized as three steps.
Step 1: Based on any chosen short phrase t, we select the proper frames at the beginning of x as xbegin
to add noise δ. We choose the beginning of the audio because no prior noise would affect the accuracy,
and the effect of the subsequent clip’s noise could be limited. Herein, we describe the phrase t as a short
phrase when its length is shorter than the length of the given transcription y, denoted by |t| < |y|.
Step 2: We construct the inaudible noise δ with an iterative and optimization-based attack. Therefore,
xbegin + δ can be recognized by the ASR model as the phrase t with a specific conjunction (i.e. ‘and’)
or a long space. Herein, the long space means a silence recognized by the ASR model. It is necessary
so that the transcription y 0 of the adversarial example excluding the phrase t will not affect the model’s
understanding of our chosen phrase t.
Step 3: We combine x0begin = xbegin + δ with the rest of the frames of x (named as xrest = x − xbegin )
so that the adversarial example x0 = x0begin + xrest sounds similar to the original audio x. In addition, the
adversarial example x0 is recognized as y 0 by the ASR model where t ∈ y 0 . To evaluate the success rate of
the adversarial example, we calculate the character error rate (CER) of t in y 0 .

Selecting the proper frames at the beginning of a given audio.

We define the proper frames xbegin to satisfy three conditions: 1) the frames used to generate
adversarial examples should be at the beginning of the original audio x; 2) the length of the frames should
be long enough to cover the target phrase t correctly; 3) the generated adversarial examples should have
relatively small distortion. Thus, we choose the frames of audio x corresponding to the first n words of
its transcription y, where the number of words in the target phrase is n = len(t). To meet the second
condition, we consider the frame length of each logit and the amount of logits for n words. Therefore,
we need to find out the relationship between the input audio x, its logit output c and its corresponding
transcription y. As for the third condition, we add a variable λ to fine tune the length to result in a small
distortion.

86
Figure 6.3: Diagram of DeepSpeech model transcribing an input audio x as its transcription y.

As for the first condition, generating the adversarial examples as the beginning parts of the audio
has clear advantages. Firstly, as the beginning parts of the audio, the target phrase can be recognized by
the ASR model with less effect from the remaining original audio frames than if it were inserted in the
middle of the audio. Secondly, it is more easy for the ASR model to recognize and execute the target
phrase, especially when the target phrase is a kind of command. For example, when the victim plays
our crafted adversarial audio, and the ASR model recognizes a command hidden at the beginning of the
audio, it is more likely to execute the command regardless of the remaining audio’s meaning. However,
there are some special cases. For example, when the target phrase contains a trigger word of an ASR
system, the ASR system will only listen to the sentence behind that trigger word. The position of the
adversarial example hidden in the original audio is then not that important. Thus, we also consider the
case of generating an adversarial audio clip at the middle and at the end of the original audio in later
experiments. To shed light on the method of our attack, we take the beginning position as an example.
To satisfy the second condition, we need clarify the relationship between x, c, and y. Thus, it is
necessary to understand the mechanism of the target ASR model. The target ASR model in this work
is Baidu’s DeepSpeech model [165], specifically an end-to-end Speech-to-Text model implemented by
Mozilla. Fig. 6.3 demonstrates the diagram of the DeepSpeech model transcribing an input audio x as its
transcription y. The input audio x firstly is split from a whole frame into several overlapping windows
with a window size w. Herein, each window slips to the next window with a step length of s. After the
MFC transformation, the RNN model f (·) in DeepSpeech maps each output logit ci ∈ c as a probability
distribution over each character during each window frame (kck = kxk). The character ci is in range ‘a’
to ‘z’, white space, and the 0 −0 symbol which represents the epsilon value  in CTC decoding. Then the
CTC decoder C(·) outputs a sequence of characters y with an overall probability distribution, merges
repeats and drops epsilons. To decode a vector c to a transcription vector y, the best alignment can be
found by Equation 6.1 [94].
C(x) = argmaxy P r(ykf (x)) (6.1)

We can summarize the relationship between x, c, and y in line with the mechanism of the target ASR
model. The whole frame of input audio x is spilt into several overlapping windows with a step of length

87
s. The RNN model f (·) in DeepSpeech maps each output logit ci as a probability distribution over each
character within each window frame. Then, we can conclude the relationship between the output logits c
and the whole frame of the input audio x as per Equation 6.2. Herein, the | · | represents the number of
characters within a text (i.e. |c|) or the length of frames in the audio (i.e. |x|).

|x| = |c| × s + (|x| mod s) (6.2)

Naturally, the more characters the transcription has, the longer the length of the output logits. There is
a positive correlation between between the output logits c and transcription y. We summarise this as
Equation 6.3.
ρ(|c|, |y|) > 0 (6.3)
Translating audios with the same ASR model, we assume that the relationship between our generated
adversarial audio clip x0begin , the corresponding output logits c0 , and the corresponding transcription
y 0 is the same as that between x, c, and y. According to Equation 6.3, ρ(|c0 |, |y 0 |) > 0. However,
the transcription y 0 is different from the original transcription y. Between the output logits and the
transcription, the CTC decoder C(·) will merge repeats and drop epsilons within the logits to get the
final transcription. The number of repeats and epsilons within the logits varies in lines with the speaker’s
speaking habits and the ASR model’s window size. Although the window size is the same, the speaker’s
speaking habits are hard to control in different recordings. We assume that ρ(|c0 |, |y 0 |) ≈ ρ(|c|, |y|). To
simplify the experiment, we refine this relationship into the following equations.
|c| |c0 |
= 0 (6.4)
|y| |y |
|x0begin | = |c0 | × s + (|x0begin | mod s) (6.5)
|c|
= × |y 0 | × s + (|x0begin | mod s) (6.6)
|y|
Analogous to Equation 6.2, we know the relationship between frame length of x0begin and the number of
output logits. Combining this with Equation 6.4, we can find the frame length of x0begin as per Equation 6.6.
Assuming our adversarial example generation is successful, the transcription of our adversarial audio clip
y 0 is the same as our target phrase t. The frame length of the selected audio clip should be the same as the
frame length of our generated adversarial audio clip (|xbegin | = |x0begin |). Thus, we can find a proper
minimum length of the frames for adversarial audio generation from Equation 6.6 to meet the second
condition. At least, we know the range of proper frame length selected from the beginning of the original
audio (as shown in Equation 6.7).
|c| |c|
× |t| × s + s > |xbegin | = |x0begin | ≥ × |t| × s (6.7)
|y| |y|
We need to ensure that the selected length is long enough for adversarial example generation. In this work,
|c|
we set (|xbegin | = |y| × |t| × s) during experiments and use a variable λ to fine tune the frame length of
the selected audio clip. Using Algorithm 1, the original audio can be split into two audio clips including
xbegin = x [: |xbegin |] and xrest = x [|xbegin | :]. With the proper frame length selected, not only the
time is saved, but also the negative effect of the remaining audio on the adversarial example’s transcription
can be neglected. Accordingly, the success rate of the generated adversarial example can be increased.
Apart from the guarantee of the success rate, we also consider selecting a proper frame length to
satisfy the third condition — less distortion in the adversarial audio. Normally, using the same generation

88
Algorithm 2: Audio Adversarial Example Generation
Require: Original audio x; Target transcription t; Pre-trained model F ; WS = s; iter = 1, 000
Ensure: len(y) ≥ len(t)
1: Call Algorithm 1
2: return Two audio clips: x0begin and xrest
3: dB(xbegin ) = 20 ∗ log10 (np.max(np.abs(xbegin )))
4: Optimize δ: con is a constant to narrow down the dB
5: dBxbegin (δ) = dB(δ) − dB(xbegin )
6: for iteration range from 1 to iter do
7: while F (δ) 6= t and dBx0begin (δ) > con do
8: xbegin ← xbegin − w · sign(∆xbegin ctc_L(xbegin , t))
9: x0begin ← xbegin
minimize dBxbegin (δ) + i wi · ctc_L(x0begin ))
P
10:
11: end while
12: if F (x0begin ) == t and dBxbegin (δ) ≤ con and iter/100 = 0 then
13: con ← con × 0.8
14: end if
15: end for
16:
17: x0 = x0begin + xrest
18: time = end_time − start_time
19: Calculate its distortion: dBx (δ) = dB(x0 ) − dB(x)
20: Verify: y 0 ← F(x0 )
21: Success if t ∈ y 0
22: return Adversarial audio x0 ; Adversarial transcription y 0 ; distortion in Decibels dBx (δ); time

method based on an audio clip with a fixed frame length, the more characters the target phrase has, the
more distortion occurs in the generated adversarial audio because with more characters in the target phrase,
more characters in the original audio are needed to be changed. However, targeting a phrase based on a
different length of audio, the longer audio may not produce adversarial audio with less distortion. We
introduce a new variable λ to fine tune the proper frame length at the beginning of the original audio and
discuss it in the next section.

Constructing inaudible noise in the proper frames

Knowing the proper audio segment xbegin , we construct inaudible noise δ and generate x0begin = xbegin +δ.
Based on the optimization method proposed in [94], we optimize δ according to the CTC loss function
ctc_L(·) with a constraint on dBx (δ) mentioned below. The optimization method can be summarized as
Equation 6.8. Herein, wi represents the relative importance of being close to t and remaining close to
xbegin . ci is a character of the output logits processed by the RNN model, while the dBx (δ) indicates the
difference of dB between the original audio and the noise. The constant con is initially a large constant
value, and will be reduced to run the minimization again till the result converges. Finally, the output of
this step is constructed as x0begin .
X
minimize kδk22 + wi · ctc_L(xbegin + δ, ci )
i (6.8)
such that dBx (δ) ≤ con

89
According to [94], the minimization problem is solved using an Adam optimization with 100 learning
rate and 1,000 iterations. As shown in Algorithm 2, xbegin will be updated with the CTC loss function as
x0begin varies. For every 100 iterations, if the current adversarial example is successful (F (x0begin == t)
& dBxbegin (δ) ≤ con), we scale down the constant con by 80% to search for even smaller distortions.
Different from the distortion calculated in [94], we combine the modified x0begin with the remaining audio
xrest and subsequently calculate the dB of the whole adversarial example marked as x0 = x0begin + xrest
before obtaining the distortion as dBx (δ) = dB(x0 ) − dB(x).

Adversarial example generation and evaluation

With x0begin generated, the adversarial example is generated by combining this clip with xrest . Thus,
the final adversarial example is x0 = x0begin + xrest . The whole process is defined in Algorithm 2. We
define success by verifying the transcription recognized by the target ASR model with an adversarial
example. Specifically, we say the adversarial example x0 is successfully generated, when its transcription
y 0 contains the short phrase t0 and exactly matches with the target phrase (t = t0 ). When t 6= t0 , we say
that our adversarial example is generated without 100% accuracy. In this case, we measure its success rate
with character error rate (CER) defined in Section 6.4.

6.4 Evaluation
6.4.1 Experimental Setting
We set up a number of experiments to evaluate our proposed adversarial example generation. Ten audios
are selected randomly from the TIMIT dataset as the target audio to generate the adversarial example,
which can be recognized by a pre-trained DeepSpeech model as our target phrase. At the same time, the
change is inaudible to human beings. All the experiments are conducted on a workstation with an Intel
Core X i9-7960X CPU (16 Cores) and 128GB memory. Some experiments use one TITAN XP GPU in
addition to the CPUs.

Dataset Description

The TIMIT speech corpus is a famous speech corpus to build and evaluate ASR systems. 630 speakers
across the United States recorded audios for this corpus, including 6,300 sentences [23]. Each speech
waveform in this corpus is sampled at 16-bit, 16kHz for each utterance. In this work, we propose an
effective method of injecting the target phrase into a long waveform. Thus, we randomly select five audios
waves whose transcriptions have relatively more words than our target phrases. Herein, we define the
audio’s transcription with more than ten words as relatively more words. This setting applied to our
target phrases is because the number of words in most sensitive commands is less than ten. For example,
the command “call someone” and “turn on the airplane mode” have a handful of words. Hence, it is
reasonable to select these audio waveforms as our target audios.

Target ASR Model

The DeepSpeech ASR model is our target, which is an end-to-end ASR model with the CTC loss function
applied [165, 220]. Herein, RNN is the core engine to translate audio to a sequence of text. Specifically,

90
Table 6.1: Two sets of target phrases to evaluate the audio adversarial generation.

Two Sets Target Phrase # of Characters


Target Phrases call john smith 15
with call david jone 15
Different Words play music list 15
Target Phrases call john smith 15
with call john 10
Different Lengths call john smith and david 25

the pre-trained model we targeted is deepspeech-0.4.1-model. This speech-to-text model is


trained with multiple corpora, including LibriSpeech, Fisher, Switchboard, and English common voice
training corpus. It can reach 8.26% WER when testing on the LibriSpeech dataset. The training batch
size is 24; the testing batch size is 48; the learning rate is 0.0001; the dropout rate is 0.15; and the number
of neurons in the hidden layer is 2,048.

Baseline

Since our adversarial example generation is an improvement based on Carlini and Wagner’s work [94],
we generate adversarial examples using the iterative optimization-based method proposed by [94] as the
baseline. According to the implementation in [94], any audio may be translated into any phrase. [94]
applied the perturbations over the complete frames of the original audio. They solved the optimization
problem using the Adam optimizer with a learning rate of 10. The default iteration to generate an audio
adversarial example in this work is 1,000, while the maximum iteration is 5,000. They can generate
targeted adversarial examples with 100% success rate with a mean distortion from −31dB to −38dB.
Our generated adversarial example only modifies the beginning of the original audio, but the baseline
method modifies the original audio frames. Both FAAG and the baseline method are evaluated on the
same workstation for a fair comparison.

Target Phrases

Apart from comparing our generation method with the baseline, we also evaluate FAAG’s effectiveness in
modifying audio results in different target phrases. Specifically, we define two sets of target phrases where
each set contains three phrases listed in Table 6.1. The first set is used to evaluate the FAAG’s performance
by injecting different words of the target phrases containing three words each into the original audio.
We name this set as the target phrases with different words. Another set — target phrases with different
lengths — is used to evaluate the FAAG’s performance by injecting target phrases of different lengths into
the original audio.

Evaluation Metrics

We evaluate our adversarial example generation method from three aspects, including the attack’s success
rate, the dB level of the noise δ compared to the original audio x, and the time needed under the limited
resource. The three metrics are described in detail as follows:

• The success rate of injecting the target phrase t into the modified adversarial example can be
calculated with the character error rate (success rate = 1 - CER). Assuming that the predicted

91
target phrase is t0 , the CER is the ratio of the number of incorrect characters predicted in t0 over
the total number of characters in t.

• The distortion in dB is used to quantify the distortion of the modified adversarial example comparing
to the original one. Each audio’s dB is their relative loudness represented as dB(x) = maxi 20 ×
log10 (xi ). In addition, the difference of dB between the original audio and the noise can be
formulated as dBx (δ) = dB(x0 ) − dB(x) [94]. It is always hard to determine whether the modified
audio is imperceptible to human beings using dB. We provide a benchmark of an adversarial
example’s distortion using method in [94]. According to [228], dB ≈ 30dB is similar to the
loudness of whisper.

• Ratio of frames is used to measure the ratio of the selected clip to the complete audio. The larger
the ratio of frames, the longer the audio clip is clipped for adversarial example generation. When
the ratio of frames is 100%, FAAG does not claim to select the best length of an audio clip for
adversarial attacks, consistent with the baseline methods [94].

• Generation time is used to evaluate our method’s effectiveness in generation process. In our work,
we only employ CPUs and one GPU to generate audio adversarial examples. Comparing to the
attack with GPUs, the attack with CPUs is a relatively time-consuming task.

6.4.2 Proper Frame Length Selection


A proper frame length selected for an adversarial audio clip generation should satisfy three conditions.
The first condition is about the selected clip’s position in the original audio. The second condition is
related to the proper minimum number of windows mapping to our audio clip’s frame length. The third
condition adds a new variable λ to fine tune to frame length for less distortion. We explore the specific
impact of these three factors on our proper frame length selection for adversarial audio generation.

Proper Minimum Frame Length for Adversarial Audio Clip

We firstly evaluate our method towards the second condition. That is, whether the frame length we
selected is long enough to cover the target phrase correctly. According to Fig. 6.3, each window of the
original audio will be translated to one logit character. As we discussed in Section 6.3, we know the
|c|
proper minimum frame length (|x0begin | = |c0 | × s = |y| × |t| × s) from the Equation 6.7. Thus, the
length of the adversarial audio clip can be determined by the number of characters in the target phrase.
Keep the target phrase unchanged, we change the required length for the target phrase to alter the number
of output logits |c0 |. Thus, one less number of characters may means a few less number of windows.
Herein, the smaller frame length is only a few windows’ size smaller than our selected proper minimum
frame length. To evaluate the proper minimum frame length selection, we compare it with the results of a
smaller frame length of the selected audio clip. Then this reference frame length is

|c|
|x00
begin | = × (|t| − 1) × s.
|y|

In addition, to ensure the remaining phrase would not affect the translation of our adversarial audio
clip, we add a word or a long space after the chosen phrase as our target phrase. Thus, along with different

92
Table 6.2: Performance of Adversarial Generation on A Target Phrase with Different Frame Lengths. Avg:
average

Setting Avg dBδ Avg Accuracy (%)


x0begin → tand 30.87 90.37
x00begin → tand 36.28 89.11
x0begin → tspaces 38.55 94.7
x00begin → tspaces 43.01 93.33
x0begin → t 39.73 81.19

number of windows, we evaluate the necessity of a word or a long space being added to the chosen phrase.
Herein, a long space means the number of space should be larger than one.
We run FAAG and compare the results in five different settings. The target model is deepspeech-0.4.1-model
[229]. One phrase t is chosen as part of the transcription of our adversarial audio. A word ‘and’ and
two spaces are added after the chosen phrase separately as two target phrases, marked as tand and tspaces .
The first two settings are selecting an audio clip with proper minimum frame length |x0begin |, when
the target phrases are tand and tspaces respectively. We mark these two settings as x0begin → tand and
x0begin → tspaces . The second two settings are selecting an audio clip with a smaller frame length |x00
begin |,
when the target phrases are tand and tspaces respectively. We mark these two settings as x00begin → tand
and x00begin → tspaces . The fifth setting selects an audio clip with the minimum frame length |x0begin |,
when the target phrase is t with only one space appended. We mark this as x0begin → t. Ten audios are
selected randomly from the TIMIT dataset which has a different distribution from the model’s training
corpus. For each audio, we repeated the experiment ten times to report the average result.
Table 6.2 shows the averaged dBδ of generated audio examples and averaged accuracy of translating
these adversarial audio examples with different experiment settings. Comparing the experiment x0begin →
tand with x00begin → tand , not only the accuracy is decreased but also the distortion is increasing because
of a smaller frame length of the audio clip. The same conclusion can be obtained by comparing the
experiment x0begin → tspaces with x00begin → tspaces . The reason is that generating an adversarial example
with the same target phrase, less length of frames needs more effort to alter less number of original
characters to all characters in the target phrase. The proper minimum frame length can be calculated
using Equation 6.7. In addition, by comparing experiments x0begin → tand and x0begin → tspaces with
x0begin → t, a word or a long space is required to be added to ensure the accuracy of adversarial examples.
The reason is that the last word of the target phrase is easy to be influenced by the rest phrase of the original
audio because of the nature of an ASR model. When the target phrase is tand , the average distortion is
less than the result when the target phrase is tspaces . We infer that more modifications are required to alter
a character to a space symbol than to another character. Meanwhile, with a long space between the target
phrase and the rest phrase, the translation in the target phrase part will be less influenced.

Proper Fine-tune Length for Adversarial Audio Clip

As we find the proper minimum frame length, we evaluate our method towards the third condition. Apart
from the accuracy, it is important to achieve a less distortion of generated adversarial audio example to
hide the attacker’s intention. To fine tune the length for adversarial audio clip, we explore the relationship
between the allocated frame length of an audio clip |xallo
begin | and the number of characters in the target
phrase |t| in different audios. Here, we introduce a variable λ to alter the allocated frame length. Based

93
on Equation 6.7, we say the allocated frame length can be calculated as

|c|
|xallo
begin | = × (|t| + λ) × s.
|y|

Since we use the right equation of Equation 6.7, we only consider the positive value of λ. Because of the
relationship we stated in Equation 6.4, the λ should satisfy the condition where |t| + λ ≤ |y|.
To clarify the relationship, we use the ratio_f rame to measure the ratio of allocated frame length for
the adversarial audio generation to the whole frame length of this audio. To find the best λ, we randomly
select one original audio and three different phrases appended with two spaces. To avoid the impact
of different lengths in the target phrase, we use the set of target phrases with different words stated in
Table 6.1.

(a) Success rate (b) Distortion

Figure 6.4: Performance of the FAAG with increasing length of selected audio clip for generation.

With the increasing ratio of frames clipped for adversarial audio generation, Fig. 6.4 shows the
performance of our adversarial attack. When the ratio of the frame reaches to 100%, the results are similar
to our baseline. In general, the frame ratio can influence the success rate and distortion of generated
adversarial audio. The best accuracy result outperforms by around 13% than the baseline targeting the
phrase ‘call john smith’, while the best distortion is around 7dB smaller than the baseline. However, based
on our observations, there are no specific rules about how λ impacts the adversarial example’s success
rate and distortion. We infer that the reason may be the different efforts to transform various combinations
of original characters to the target phrase. The greater the phoneme gap between the selected combination
of original characters and the target phrase, the louder the noise is needed to generate our adversarial
audio. In all, the best λ to fine-tune the best length of a selected audio clip depends on the specific original
audio and the target phrase.
Apart from the effective performance, we also consider the efficiency of our FAAG method. Appar-
ently, from the results shown in Fig. 6.5, we conclude that less time is spent using a shorter audio clip.
To have further investigation, we generate adversarial examples using one original audio targeting three

94
(a) One original audio with three target phrases. (b) Different original audios with one target phrase.

Figure 6.5: Duration of the FAAG generating each adversarial audio ten times.

different phrases. Herein, these phrases are only different in words having the same length. As shown
in Fig. 6.5a, different words in the target phrase will not affect the time spent on adversarial example
generation. Additionally, we generate adversarial examples using three different original audios targeting
the same phrase. Although different audios have different time requirements for FAAG, their growth rate
in time is similar when the ratio of frames increases.

6.4.3 Effectiveness and Efficiency Analysis


We evaluate our generation method by comparing the generation performance of target phrases with
different words, target phrases with different lengths, and comparing the generation performance with the
baseline. Our experiments are conducted under a limited resource, as we only use CPUs and one GPU. In
the rest results, when the ratio of frames is equal to 100%, this experiment uses the baseline method to
generate the adversarial example.

Adversarial Generation of Target Phrases with Various Words

We generate our audio adversarial examples corresponding to the 100 TIMIT audio we selected randomly.
To compare the adversarial attack performance on target phrases with different words, we choose the
target phase the phrase set named target phrases with different in Table 6.1. Moreover, we record their
average results of success rate and distortion, while the duration time is counted for all 100 audio examples
generation. Assuming that the attacker does not have enough time to generate the adversarial example,
we only run the optimization-based method with 1,000 iterations to generate the adversarial example
x0begin = xbegin + δ. Comparing with the baseline results, Table 6.3 lists the performance of FAAG in the
environment with limited computational resources and time. We analyze the results from the effectiveness
and efficiency aspects.
For the effectiveness analysis, we focus on the success rate and distortion results in Table 6.3. In
general, updating the noise and distortion within 1,000 iterations, our FAAG method shares similar

95
Table 6.3: Performance of Adversarial Generation on Target Phrases with Different Words. (Duration
results are in Hour:Minute:Second format for all 100 adversarial audio generation. Apart from the duration,
the other results are averaged.)

Target Phrase Success Rate dBx (δ) Duration Ratio of Frames


89.07% 29 01:10:45 100%
call john smith
91.33% 35 00:26:09 45.65%
90.06% 27 01:11:08 100%
call david john
91.13% 32 00:26:29 45.65%
90.39% 26 01:11:18 100%
play music list
90.67% 33 00:26:21 45.65%

effectiveness as the baseline. All success rates are around or over 90%, while the distortions are around
30. The impact of the different words in the target phrase can be ignored because of slight differences.
As we discussed in the previous section, the audio clip’s frame length will impact the whole adversarial
example’s performance. However, the impact relies on the specific original audio and target phrase.
Attackers can fine-tune this factor using λ. In all, our FAAG method will not bring a negative impact on
the effectiveness of the attack proposed in previous work. Different words in the target phrase do not have
a particularly noticeable impact.
For the efficiency analysis, we focus on the duration and ratio_f rame. Generally, the ratio of frames
used to generate the adversarial examples positively correlates to the duration required in the generation.
Specifically, when the target phrase is ‘call john smith’, almost half (45.65%) of the original audio (100%)
is clipped for FAAG. About half of the time is spent on the FAAG compared to around one hour used in
the baseline. When the target phrases are different in words, but the average ratio of frames is the same in
the same length. Accordingly, the duration of FAAG targeting these three phrases is similar to each other.

Adversarial Generation of Target Phrases with Different Lengths

Some audio adversarial examples are generated corresponding to the selected 100 TIMIT audios with
another feature set. To compare the adversarial attack performance on target phrases with different words,
we choose the target phase the phrase set named target phrases with different lengths in Table 6.1. Similar
to the above experiment, we evaluate each adversarial example’s construction and record their average
results in Table 6.4. Assuming that the attacker wants to generate the adversarial example as quickly
as possible, the optimization-based method is executed for 1,000 iterations to generate the adversarial
example x0begin = xbegin + δ. Table 6.4 lists the performance of FAAG and baseline in the environment
with limited computational resources and time.

Table 6.4: Performance of Adversarial Generation on Target Phrases with Different Length. (Duration
results are in Hour:Minute:Second format for all 100 adversarial audio generation. Apart from the duration,
the other results are averaged.)

Target Phrase Success Rate dBx (δ) Duration Ratio of Frames


89.07% 29 01:10:45 100%
call john smith
91.33% 35 00:26:09 45.65%
84.44% 27 01:10:25 100%
call john
85.11% 33 00:17:42 29.54%
call john smith 95.47% 32 01:11:49 100%
and david 95.52% 35 00:41:14 70.21%

96
Table 6.5: Comparing the performance of adversarial generation of the audio file SA1.wav with the
baseline result. The target phrase is “call john smith” for all experiments here. The T ime here is counted
for this adversarial audio generation.

Generation Method Ratio Frames Success Rate dBx (δ) Time


FAAG 51.96% 98.6% 28.12 26min
[94] 100% 82.67% 29.33 70min

For the effectiveness analysis, we examine the success rate and distortion results, as listed in Table 6.4.
Overall, updating the noise and distortion within 1,000 iterations, all averaged success rates of our method
are surpass 85%, which is comparable to the baseline. The target phrase’s different length is not the
deterministic factor for either the attack’s success rate or the distortion. As discussed in the previous
section, the best distortion can be found by adding the λ in FAAG for a specific target. In all, no matter
how long or short the target phrase is, the performance of our FAAG method is as high as that of the attack
proposed in previous work.
For the efficiency analysis, we focus on the duration and ratio_f rame in Table 6.4. Similar to the
results in Table 6.3, the ratio of frames used to generate the adversarial examples has a positive correlation
to the duration required in the generation. When the target phrase is ‘call john smith and david’, even
around 60% of the whole audio is used for attack, almost half of the time is spent comparing to the baseline.
Specifically, different from the baseline, the shorter the target phrase is, the more time is saved using our
FAAG method. In all, the FAAG method is less time-consuming than the previous work, especially when
the environment is limited to computational resources.
Over a hundred phrases were collected as the target phrase. These phrases are common voice
commands that users usually interact with the surrounding voice assistants, including Google Assistant
and Apple Sir [230, 231, 232]. We randomly selected 100 benign audio clips from the TIMIT dataset
and randomly chose each phrase from the collected phrase set for each audio clip as its target phrase. It
took 20 minutes and 59 seconds for FAAG to complete all 100 adversarial generations. Compared with
the baseline model [94] that spent 50 minutes and 44 seconds, FAAG was much faster in adversarial
example generation targeting common command phrases. For effectiveness analysis, FAAG showed
a slight advantage on the average success rate (90.45% > 88.9%) and a slight larger distortion result
(33.11dB > 28.38dB). In all, the FAAG method is more effective with limited computational resources.

Comparison with the Baseline using CPUs

FAAG is an improved method based on the iterative optimization-based method proposed by [94] that is
the baseline model in this chapter. Empirical results show that our method is significantly more efficient in
adversarial audio generation with only one GPU and CPUs. We compare the two methods’ performance
using CPUs only, primarily focusing on their success rate as dBx (δ) and time, as listed in Table 6.5. Apart
from using the CPU only, we compare their performance in a small number of iterations (1,000 iterations)
when the target phrase is “call john smith”. Again, for each adversarial example generation, we repeat ten
times and record their averaged results.
Choosing the best λ for the specific audio SA1.wav with one target phrase, we conduct the results
using FAAG. As shown in Table 6.5, FAAG significantly exceeds the baseline in [94] from time and
success rate aspects using only CPUs with the best λ. The reason is that FAAG only needs to modify part
of the original audio frames, while the baseline modify the whole audio. Without using any GPUs, the

97
advance in time using our FAAG is more prominent. As for the success rate and distortion, by choosing
the best λ, FAAG can reach or even surpasses the baseline.

Figure 6.6: Three pair waves’ visualizations. Each column represents a pair of original audio and it
adversarial example when the target phrase is “play music”. The images in the first row present the
original waveform of SA1.wav, SI488.wav and SI667.wav. The images in the second row presents
the audio waveform of the adversarial audio of corresponding audio.

Case Study

We present the detailed process of generating adversarial examples with three audio clips SA1.wav,
SI488.wav, and SI667.wav with the target phrase “play music” as a case study. Assuming that the
attacker only uses CPUs to generate the adversarial example for the targeted attack against an ASR model
and generates it within one hour. The target ASR model is a DeepSpeech pre-trained ASR model. For the
original audio SA1.wav, the attacker aims to embed the target phrase “play music” at the beginning of
the original audio by adding noise that will be recognized by the target model but inaudible to human
beings. Algorithm 1 is used to select a proper clip at the beginning of the original audio x which is marked
as xbegin = x − xrest . Then following the process in Algorithm 2, we minimize the problem using the
Adam optimizer and the CTC loss function with 100 learning rates and 1,000 iterations. Finally, one
adversarial example of the audio SA1.wav is generated. The averaged result is obtained by repeating the
previous step ten times.
We randomly select one adversarial example for each SA1.wav, SI488.wav and SI667.wav
with the target phrase as “play music”. Their success rate reached 100%, 100%, and 90.9%. Then we plot
the waveform of these adversarial examples comparing to the waveform of their original audio, as shown
in Fig. 6.6. In general, it is challenging to differentiate these two waveforms when the distortion is 34dB,
36dB, and 39dB, respectively. The beginning part of adversarial examples’ waveform is slightly thicker
than their original audio. Since we only modify the beginning part of the audio, it is faster to generate the
adversarial examples than the whole frame’s modification.

98
Table 6.6: Comparison of Generation T ime between using the baseline method and FAAG. All results
are averaged values, while time is in the Hour:Minute:Second format.

Target Dataset Target Phrase(s) Using Generation Generation Speedup


GPU Time_Baseline Time_FAAG
SA1.wav call john smith Yes 00:06:03 00:02:25 60.1%
SA1.wav call david jone Yes 00:06:05 00:02:25 60.3%
SA1.wav play music list Yes 00:06:01 00:02:26 59.6%
SI2248.wav call john smith Yes 00:05:34 00:02:24 56.9%
SI667.wav call john smith Yes 00:08:26 00:02:28 70.6%
SA1.wav call john smith No 01:00:00 00:26:00 56.7%
100 TIMIT audio clips Phrases with different words Yes 01:11:04 00:26:20 63.4%
100 TIMIT audio clips Phrases with different lengths Yes 00:50:44 00:20:59 60.0%

6.4.4 Summary in Speed Advantage


FAAG significantly outperforms the baseline methods in terms of generation time. Table 6.6 summarizes
the generation time in different conditions using the baseline method [94] and FAAG. When adversarial
examples were generated with CPUs only, FAAG can speed up around 56.7% than the baseline method.
When adversarial examples were generated with one GPU and CPUs, FAAG can speed up around 60%
than the baseline method.

6.5 Discussion on Different Position of Adversarial Audio Clip


6.5.1 Different Position of Adversarial Audio Clip
Except for hiding the target phrase at the beginning of the audio, we also discuss other audio positions,
including the middle part and the ending part. These two positions are also meaningful in some cases. For
example, when the target phrase contains a trigger word of an ASR system, the ASR system will only
listen to the sentence behind that trigger word. In this section, we present and analyze the results of the
last two hiding positions. The original audio is SA1.wav, and the phrase is ‘call john smith’.
The selection of audio frames clipped from the middle and the end of the audio is slightly different
from the beginning one. For hiding the phrase at the end of the audio, we append two spaces at the
beginning of the phrase as the final target phrase. The frame length of audio clips selection is quite
similar to the Algorithm 1. The differences are index = |x| − |xend and xend ← x[index :]. For
hiding the phrase in the middle of the audio, we append two spaces at the beginning and the end of
the phrase, respectively, as the final target phrase. The frame length determination is quite similar
to the one hiding in the beginning position. The final index calculates in the Algorithm 1 are totally
different. Another index is introduced for the middle position generation, marked as index0 . Assume
|c|
corresponding three characters of output logits will not be replaced. Then index0 = |y| ×3×s
|c|
and index = index + |y| × |tallocated | × s. Two other audio clips and one target audio clip are
separated from the original audio, marking as xrest , xmiddle , xrest0 . Specifically, xrest ← x[: index0 ],
xmiddle ← x[index0 : index], and xrest0 ← x[index :].
Fig. 6.7 shows that hiding the phrase from the beginning of the audio reaches or surpasses the
baseline result. The success rates of hiding the phrase at the end of the audio are always lower than 25%.
Specifically, the first-word ‘call’ cannot be recognized in most cases. Under this situation, the attack is
meaningless because the ASR system listing the sentences after the trigger word. As for hiding the phrase

99
Figure 6.7: Averaged accuracy of the FAAG used in different hiding positions. The baseline is the result
using the entire audio sample to generate the adversarial audio [94].

in the middle of the audio, no matter how long the frame length is clipped for generation, the adversarial
attack did not work. We infer that this is due to the powerful language model within the end-to-end
ASR system. Considering the contextual information, especially the previous text, the latter part of the
transcription will be influenced.

6.5.2 Countermeasures
Because of the dramatic decrease in the attack’s success rate, we find an effective protection method
against the current targeted adversarial attack against an end-to-end ASR system. We can append benign
audio at the beginning of this suspicious audio before playing it for any suspicious audio. Although the
original transcription accuracy will be influenced, the success rate of the targeted attack decreases.
To evaluate the protection results, we append benign audio to the generated adversarial examples
and classify them with the target ASR model. We select clean audio named SA1.wav and append its
complete audio frames to the beginning of any suspicious audios. If the suspicious audio is benign, we
expect that the appending audio would not decrease the benign audio’s transcription result. Otherwise, if
the suspicious audio is an adversarial example, we expect that the combined audio transcription would
not include the attacker’s target phrase. Assume the attacker’s target phrase is “call john smith”. The
suspicious audios include benign audio, several adversarial audios generated using the baseline method
[94], and several adversarial audios generated using our method FAAG. Each experiment repeats ten
times, and the averaged results are presented.
Table 6.7 shows the performance of various audios comparing their translations to the target phrase
and their true translation text. In general, appending a complete sample of benign audio at the beginning
of any suspicious audio is an effective countermeasure against the targeted adversarial attack. Specifically,
comparing the results of SA1.wav and SA1.wav+SA1.wav, this protection method has only a slight
decline in translation accuracy. Comparing Baseline with SA1.wav+Baseline and FAAG with
SA1.wav+Baseline, the target phrase is not included in the final translation by appending the begin

100
Table 6.7: Comparing the performance of various audio files. SA1.wav is the benign audio that we
selected to append to the suspicious audios. SA1.wav+SA1.wav represents two begin audios’ combina-
tion. P hrase represents the target phrase, while the T rans is the final translation text. SuccessRate
represents the attack’s success rate, while the AccuracyinT rans comparing the predicted translation
with the true translation.

Audios Phrase in Trans? Success Rate Accuracy in Trans


SA1.wav No None 98.04%
SA1.wav+SA1.wav No None 96.12%
Baseline Yes 82.67% None
SA1.wav+Baseline No None 92.03%
FAAG Yes 94.45% None
SA1.wav+FAAG No None 92.52%

Table 6.8: Performance of FAAG and Yakura and Sakuma’s method [97]. The target phrase is “call nine
one one” selected from common voice commands used in the previous experiments. Ten benign audio
clips were randomly selected from the TIMIT dataset. All the reported results are averaged.

Generation Method Ratio Frames Success Rate dBx (δ) Generation Time
[97] 100% 72.06% 84.56 30min
[97]+FAAG 63.69% 70.89% 79.94 21min

audio. Meanwhile, by appending benign audio, the accuracy of the original audio’s translation surpasses
90%.

6.5.3 Transferable FAAG


FAAG aims to achieve a fast adversarial audio generation. According to Section 6.3, the proper length
selection associates with the target model’s structure, the benign audio’s true transcription, and the target
phrase. This chapter investigates a recurrent network-based ASR model which is targeted by an iterative
optimization-based method. To investigate FAAG’s transferability, we adopt another similar adversarial
attack against the DeepSpeech model.We also discuss how to find the proper frame length targeting
generic ASR models, like a conventional ASR model.
Yakura and Sakuma [97] proposed a robust adversarial audio generation method evaluated in an
over-the-air setup. The success rate reached 100% at the cost of more than 18 hours of computational time
for each audio. Various techniques like band-pass filters, impulse response, and white Gaussian noise
were applied in [97] to achieve the best outcomes for very few audio clips. To compare with Yakura and
Sakuma’s method, we chose a small set of audio clips with one target phrase. Thus, FAAG and Yakura
and Sakuma’s method became comparable.
We selected one phrase from the common voice commands used in the previous experiment, i.e.,
“call nine one one". Ten benign audio clips were randomly selected from the TIMIT dataset. The
target ASR model is deepspeech-0.4.1, different from [97]. Yakura and Sakuma’s method requires
adjusting perturbation magnitudes if either the input sample or the target phase changes. Thus, we
adjusted perturbation magnitudes for the audio clips chosen for FAAG. This experiment was designed
to measure the speed of the generation method, so over-the-air attack was not evaluated. To ensure fair
and reproducible comparisons by avoiding the influences of environmental noises, we used the direct
translating method, which is different from the over-the-air evaluation in [97]. Moreover, the same number
of training iterations were applied for generating each adversarial audio example. Twenty adversarial

101
audio examples were generated for each audio file by using the method in [97]. Feeding these adversarial
examples directly into the ASR model, the best example was selected by comparing their transcriptions
with the target phrase. Then, FAAG was used with the same setup. Table 6.8 shows the averaged results of
ten audio clips. The results demonstrated that FAAG was faster than the method in [97] while generating
adversarial examples of similar quality.
When the target model’s structure is different, the frame length selection will differ. Kaldi [167] is a
conventional ASR model based on hidden Markov models (HMMs). There are no logit outputs to the
CTC decoder. Thus, Equation 6.7 no longer holds for Kaldi. A different scheme is required to find an
explicit relationship among phoneme, HMM-state, the true transcription, the benign audio, and the target
phrase. We leave this investigation as our future work.

6.6 Conclusion and Future Work


We propose the FAAG method to generate adversarial examples for audio clips through white-box access.
A novel algorithm is used to determine the appropriate length of the audio frame for adversarial attacks,
according to the target phrase, the original audio, and the logit outputs of the target ASR model. The
adversarial example can reach a high success rate by adding negligible noise, and the generation process
can be completed within a short period using CPUs with one GPU or even without GPUs. Our empirical
studies proved that adding noise to part of the audio can effectively generate an adversarial example.
The distortion caused by the generated adversarial examples is similar to the baseline method, implying
little quality loss. The FAAG maintains the success rate despite the choice of words in target phrases.
More importantly, different positions of hiding phrases are discussed. Hiding the phrase at the beginning
of the audio is a plausible attack. Appending benign audio to the beginning of adversarial audio can
effectively protect service from targeted adversarial audio attack. We verify the adversarial example
generated over-the-line to ensure the correctness of the results. Work on conducting the attack over-the-air
and under black-box access will be left to the future.

102
Chapter 7

Research Challenges and Future Work

In Chapter 2, the recent publications about the ML-based stealing attacks against the controlled information
and the corresponding defense methods are reviewed. Some attacks can steal the information, but they
make strong assumptions of the attacker’s prior knowledge. For instance, the attacker is assumed to know
the ML algorithm as a necessary condition prior to stealing the model/training samples. However, this
prior knowledge is not always publicly known in the real world cases. Additionally, the attack methods
are not mature technologies and have great room for improvement. Chapter 2 outlines the target and
accessible data for each paper, and Table 2.11 summarizes the core research papers in the perspectives of
attack, protection, related ML techniques, and evaluation. The following sections will discuss the future
directions of the ML-based stealing attack and feasible countermeasures as shown in Figure 7.1.

Figure 7.1: The Challenges of ML-based Stealing Attack and Its Defenses

7.1 Attack
During the battle between attackers and defenders, it is crucial for defenders to anticipate the directions of
the attackers’ future actions. To discuss their future direction, the challenges of the ML-based stealing
attack are analyzed. The analysis results and possible solutions can be summarized and regarded as the
future direction of the ML-based stealing attack. In this section, challenges and future directions are
discussed from five phases of the MLBSA methodology listed as reconnaissance, data collection, feature
engineering, attacking the objective, and evaluation.

103
7.1.1 Reconnaissance
As illustrated in Chapter 2.1.1, the reconnaissance phase consists of two main tasks — the target definition
and valuable accessible data analysis. The denotation of the target determines which kind of accessible
resources is valuable. The further attack mechanism is designed according to the analysis of accessible
data during the reconnaissance phase. It is essential to ensure that the information accessible to legitimate
users contains valuable information for stealing attacks to succeed.
A challenge of an attack during the reconnaissance phase is the lack of the effective information
from the accessible data. As stated in Chapter 2, the first category attack — stealing the user activities
information — primarily relies on the accessible data source including the kernel data and the sensor data.
The attacker captures the information without special permissions and utilizes different user activities’
representatives as explained in Chapter 2.1. Setting the appropriate permission requirements can protect
the accessible data from being exploited by the attacker. For example, Android version 8 restricts the
access to kernel resources including interrupt timing log files [36]. Because of an insufficient amount
of information collected under the black-box setting, the model/training data obtained by the stealing
attacks is insufficient to reconstruct an ML model as good as the original model [18, 12, 29]. As for the
third category of the attack, the majority of the stealing methods are proposed based on a large amount
of PII, effective sensor data or coarse-grained cache data. However, these information, especially PII,
are sensitive enough to raise privacy concerns and may be protected in the future [233, 234]. All in all,
current attack vectors restricted to access can block part of ML-based stealing attacks.
Dealing with the lack of information, the future work of the attacker is to find new exploitable
data sources as replacements. Some plausible solutions are proposed by [36] and [49]. [36] defines
all interested targets in a list and automatically triggers the activities of interest, followed by searching
exploitable logs filed in the newest version of the targeted system (Android version 7 and 8); and it is
proven worthless to simply monitor the changes in the accessible data such as the sensor data, because
the detection of the changes gives clues for stealing information. Consequently, future directions can be
inspired as searching exploitable source in iOS and monitoring the possible changes of sensor data in order
to perform a potential stealing attack. For the attack stealing authentication information, a solution is to
use Corpus of Contemporary American English (COCA) corpus instead of PII. Using the COCA corpus,
a successful password guessing attack was performed by [33]. Additionally, analyzing the password
structure with anthropological analysis [33] may reduce the attacker’s reliance on PII. The attackers were
predicted to search new sources or explore new characteristics for further attacks.

7.1.2 Data Collection


Determining the valuable accessible data is only a part of an ML-based stealing attack. To take advantages
of the ML mechanism, the valuable dataset collected in this phase should guarantee its representation,
reliability, and comprehensiveness. If either one of three is unsatisfactory, then the results of the stealing
attack will be inaccurate.
The first challenge is collecting valuable data with the representative information of the targeted
information including all systems/devices. Especially when the valuable data is kernel data or sensor data,
some forms of data recording may vary greatly from systems and devices. Regarding this problem, the
data was collected by [10] and [47] from eight different mobile devices and different machines. Hence, a

104
future work is collecting data from heterogeneous sources and aggregating the representative data. Various
forms of representative information affect the attack’s probability of success.
The second challenge appears while collecting a reliable dataset. The quality of the training dataset
is critical to the attack performance. Most of the explored stealing attacks utilized the model’s query
output results — confidential values associated with attacker’s query inputs. The preciseness of this value
affects the success of the attack. Specifically, the confidential information was leveraged by [12], [29]
and [18] to imitate the objective ML model through techniques such as equation solving, path searching,
and inversion attack as summarized in Table 2.2 and Table 2.3. Furthermore, the performance of an
important attack named membership inference attack [38] depends on the training dataset of the shadow
models. The training dataset can be generated based on the target model’s confidential information, which
is distributed similarly to the targeted training set. Under these circumstances, the above attacks will not
succeed if the target model’s API outputs the class only or the polluted confidential information. This
inconvenience was scrutinized by [12], and a method was proposed to extract the model with only class
labels are informed. Accordingly, these findings were further explored in the context of several other
ML algorithms together with less monetary advantage when using ML APIs as an honest user. The poor
quality of collected dataset hinders the success of the ML-based stealing attack.
The third challenge of comprehensive dataset collection involves determining the size/distribution
of the training dataset and the testing dataset. The size of the training inputs often dictates whether the
attacker can easily gain all possible classes of the targeted controlled information, especially when the
predictive model outputs only one class per query. In [38], a comprehensive training dataset was collected
by generating the dataset which has a similar distribution to the targets. The size of testing dataset indicates
indirectly the amount of controlled information that attackers can learn. For instance, the testing set size
of a membership inference model depended on how many training members might be included and would
be distinguished [38]. A future work may investigate the impact of the size/distribution of training and
testing datasets to the success of ML-based stealing attacks. Partial or imbalanced distribution reduces the
success rates of stealing attacks.

7.1.3 Feature Engineering


Feature engineering in this MLBSA methodology intends to refine the collected data for the effective and
efficient training process. It is critical to the performance of ML-based attack by eliminating the noise
from the collected data. However, among the current research, the techniques used in feature engineering
remain underdeveloped.
As shown in Table 2.6, Table 2.7, Table 2.8 and Table 2.10, many existing works select features
manually. Manual feature selection, relying on the attacker’s domain-specific knowledge and human
intelligence, usually produces a small number of features. That is, manual feature selection is inefficient
because of the nature of human involvements and it may ignore the useful features with low discriminative
power. To improve the attack’s effectiveness, the automation of feature selection has great research
potentials. For example, [39] used CNN to learn features based on the correlation among data for optimal
classification.
A future trend of the attack is searching for or developing other automatic methods to overpower the
manual feature selection [235]. [236] developed a regression-based feature learning algorithm to select
and generate features without domain-specific knowledge required. Automating feature selection with

105
such generic algorithm would promote the efficiency and effectiveness of the ML-based attack.

7.1.4 Attacking the Objective


In the phase of attacking the objective with ML techniques, the main tasks include training and testing the
ML model to steal the controlled information. There are a few challenges of stealing attacks with respect
to training and testing ML models including unknown model algorithms, unknown hyperparameters of
ML model, and the limited amount of testing time.
For ML-based attack stealing the controlled model/training data, the first challenge is that most of
the research considered the model algorithm as a prior knowledge. However, the model algorithms for
many MLaaS are unknown to the end-user. Most of the attacks would not succeed without specifying
the correct model algorithm. In [29], this concern was discussed, afterwards, a conclusion was drawn
that attacks without understanding the model algorithm can be impossible in some circumstances. It is
worth investigating the possibility of success for an attack in the context of the unknown model algorithm.
In 2019, [43] considered membership inference attack against a black-box model whose algorithm is
unknown by choosing a threshold. However, whether this method is applicable to other attacks under
black-box access like the parameter stealing attack [12] remains unknown.
The second challenge involves the unknown hyperparameters of the ML model learned from stealing
attack comparing to the targeted model. The more precise the model learned, the more accurate the
model’s functionality; the more precise the model learned, the more detailed the training records that
will be revealed. The stealing attack predominantly stole the model by reckoning the parameters of
matching objective functions. However, another critical element — hyperparameter — has been ignored,
the values of which influence the accuracy of stealing attacks. In [29], a solution was proposed to prevent
an attacker calculating the hyperparameters of some ML algorithms which consist of a set of linear
regression algorithms and three-layer neural networks. The future direction toward solving this difficulty
can enable a hyperparameter stealing attack against other popular ML algorithms such as k-NN and RNN.
With unknown hyperparameters, only the parameters can be calculated while extracting the ML model.
The third category of stealing attacks is password guessing attacks. This type of attack generally
assumes an unlimited login testing attempts for each account. One exception is in [31] where each
login password attack was performed less than 100 times. To crack the password effectively, researchers
applied ML algorithms to analyze the password generator based on personal information, website related
information and/or previous public leaked passwords. The future work for this stealing attack can be a
successful attack mechanism, which is designed for the targeted authentication system with less than 100
times for login testing. With limited login testing attempts, guessing attack may be failed with the first a
few guesses.

7.1.5 Evaluation
To effectively infer the controlled information, most of the investigated research applied ML mechanism
mentioned in Chapter 2. The prediction of the unknown testing samples is a challenge for ML-based
stealing attacks, as the supervised learning algorithm dominates the attack methods. That is, if the true
label of a testing sample has not been learned by the model during the training phase, this sample will be
recognized as an incorrect class. The testing samples, which are unknown to the training dataset, affect

106
the evaluation results and subsequently reduce the stealing attack’s accuracy. To improve the performance
of such attacks, the attacker needs to achieve breakthroughs towards predicting the unknown data.
For stealing the user activities attack, when an attacker wants to know the foreground app running in a
user’s mobile, some distinctive features of the accessible data set which represents the status of running
apps will be extracted and learned by ML algorithms [10, 37, 48]. After the attack model is trained, the
accessible data recording a new foreground app running in the mobile is the testing sample, this new app
is unknown to the attack model [10, 37, 48, 50]. For stealing authentication information attack, these
attacks are difficult to be effective when users change their passwords frequently or adopt new layouts for
the keyboard [34]. This is owing to the uncertainty of the users’ password generation behaviors and the
variety of users’ input keyboards. Evaluating the prediction of unknown class is a challenging task for
stealing attack against the user activities and the authentication information.

7.2 Defense
Targeting diverse controlled information, the countermeasures in protecting the information from ML-
based stealing attacks are summarized. In general, the countermeasures can be summarized into three
groups: 1) the detection is indicated as detecting related critical indications; 2) the disruption intends to
break the accessible data at a tolerable cost of service’s utility; and 3) isolation aims to limit the access to
some valuable data sources. As Figure 7.1 depicted, the countermeasures mainly applied in the first two
phases. Specifically, isolation restricts the attacker’s access and makes the attack fail at the first phase;
and disruption can confuse the attacker in the second phase and hinder the attacker to build a successful
attack model. The detection techniques can detect the attacker’s actions and then protect the information
from being stolen. These issues are explained as follows.

7.2.1 Detection
To detect potential stealing attacks in advance, the relevant crucial indications are required by analyzing
the functionality related to the controlled information. Defenders should notice the attackers’ actions as
soon as the attackers start the reconnaissance or the data collection processes. Based on the attacker’s
future directions, the detection is proposed accordingly in order to prevent the attack at an early stage and
minimize the loss of stealing the controlled information.
In the presence of any malicious activities in the stage of reconnaissance and data collection, the
change or the usage of relevant crucial information should be analyzed and checked. For example, when
the attacker steals the information based on the accessible sensor data calling from some APIs, the calling
rate and the API usage can be deemed as two critical indications for detection [37]. Since attackers may
intend to exploit unknown critical indications, a defender can trade off between the access frequency of all
related information and the service’s utility. Thereafter, the defender detects the unusual access frequency
against the stealing attack.
Another detection method is assessing the ability of the service securing the controlled information.
This protection can promise that the ML-based attack for stealing information is less powerful than the
current attacks. The memory page’s cacheability was managed by [40] to protect secret keys within user’s
memory activities, while the password’s guessability was checked by [32]. Thereafter, users were alarmed
with the weak password. If a defender can assess the ability to hide the ML model and training set from

107
unauthorized peeps, the controlled information are protected to some extent. The detector alerts the user
when the assessed ability is under a certain threshold.

7.2.2 Disruption
Disruption can protect the controlled information via obstructing the information used in each phase of the
MLBSA methodology. Disrupting the accessible data currently involves two methods as adding noise to
data sources and degrading the quality/precision of service’s outputs. For more advanced countermeasures,
further research needs to better understand the attacker’s future directions.
By disrupting the accessible data, attackers cannot find out valuable accessible sources in the re-
connaissance phase, get reliable dataset in the collection phase, or use feature engineering effectively.
Therefore, disruption minimizes the success rate of the ML-based stealing attack. The major technique
for adding noise is the differential privacy as applied in [48, 37, 88, 138, 50]. As for the latter method, the
specific techniques include rounding/coarse-grained the values of outputs (i.e. predictions/confidence)
[37, 12, 18, 38], and regularizing/obfuscating the accessible data sources [10, 49, 88, 77, 34].
The advanced disruption methods against the attacker’s further attacks is considered. Concerning
attackers may search and collect a class of information from a few devices in the future, advanced
disruption should premeditate adding different noise for the similar information in various devices. To
prevent the advanced feature engineering techniques and ML-based analysis that attackers might apply,
complicated and skillful methods for disruption can be applied to defend the controlled information, like
an adversarial training algorithm acting as a strong regularizer [77].

7.2.3 Isolation
Isolation can assist the system by getting rid of the information stealing threat, which hinders the attacker
from the reconnaissance phases. No matter how attackers improve their strategies and techniques, isolation
can protect the controlled information by restricting access to the data. Specifically, it is effective to
control the accessible data via restricting the access or managing the dynamic permission [10, 36, 38, 35].
Since attackers may improve their stealing attacks, defenders can apply ML techniques to automatically
control all accesses related to the targeted controlled information. However, this protection is highlighted
to be applied cautiously by concerning the utility of the service. On the one hand, specialists can remove
the some information channels which may reveal valuable information to the adversary [37, 30]. On the
other hand, if attackers find new exploitable accessible sources in the future, it is challenging to isolate
all the relevant data while ensuring the service’s utility. Isolation protects the information by restricting
access to the data.

7.3 Research Problem


Recently, machine learning techniques are widely applied in various areas. Additionally, MLaaS platforms
are aroused to aid users with limited computing power and/or with limited ML expertise to use such
techniques, like Google ML [55] and Amazon ML [53]. The ML models they built are implied in cyber
attack prediction [25], insider threat detection [209], network traffic classification [237, 238, 239, 240],
spam detection [241], software vulnerability detection [242], and so on. Therefore, it is essential to
maintain the confidentiality of the ML model and its training set.

108
However, as we surveyed in Chapter 2, plenty of ML-based stealing attacks can reconstruct the ML
model and/or its training samples, which is difficult to be detected. Our research problem is: for each
ML-based service, we aim to check its confidentiality with the newest and strongest ML-based stealing
attack with MLSA methodology, and develop a protection system to defeat the attack and maintain its
confidentiality.
To launch a new advanced ML-based stealing attack, we follow the attack methodology — MLSA
demonstrated in Chapter 2.1 and overcome any attack challenge listed in Chapter 7.1. The cyclical
processes in MLSA methodology include reconnaissance, data collection, feature engineering, attacking
the objective, and evaluation. The challenges we intend to overcome for new attacks are: 1) finding new
effective information except confidential information; 2) collecting balanced and high quality dataset with
various formats, like outputs with and without probabilities from confidential sources; and 3) attacking
the model and/or its training set without knowing its ML algorithm. Hence, a new advanced ML-based
stealing attack can be found with high performance.
The protection system can be designed with the help of the defense suggestions illustrated in Chap-
ter 7.2. For instance, we can isolate user access to some valuable source, in the meanwhile, disrupt some
accessible information which has little effect on the service’s utility. Additionally, develop a detector to
reveal the stealing attack before the attacker succeeds. Finally, evaluating this protection system with
advanced ML-based stealing attacks.
We list a few research questions where the first two are conducted and the last one is for the future
research.

1. Discuss the protection method against user-level information leakage of membership inference
attack targeting a deep learning model in black-box access. Chapter 3, Chapter 4, and Chapter 5
limit the available resource gain from the balck-box access. The user-level membership inference is
conducted with nno label knowledge against the ASR system.

2. The impact of stealing the ML model’s information on a security domain including adversarial
machine learning can be conducted. If the ML model’s information can be inferred by the attackers,
they can launch attack against this ML model similar to the attack under the white-box access.
Chapter 6 explored an efficiency advanced adversarial machine learning under the white-box access
with limited computational resource. The future work can conduct the adversarial attack under
black-box access with the help of stealing attack.

3. Protect the ML model from stealing attack proposed previously by rounding the confidence values
and applying the differential privacy in the model parameters.

109
Chapter 8

Conclusion

In this thesis, the ML-based stealing attack against the controlled information and the defense mechanisms
published in the past five years are reviewed. The generalized MLBSA methodology compatible with the
published work is outlined. Specifically, the MLBSA methodology uncovers how adversaries steal the
controlled information in five phases, i.e., reconnaissance, data collection, feature engineering, attacking
the objective, and evaluation. Based on different types of the controlled information, the literature was
reviewed in three categories consisting of the controlled user activities information, the controlled ML
model related information, and the controlled authentication information. The attacker is assumed to use
the system without any administrative privilege. This assumption implies that user activities information
was stolen by leveraging the kernel data and the sensor data both of which are beyond the protection of
the application. The attack against the controlled ML model related information is demonstrated with
stealing the model description and/or stealing the training data. Similarly, keystroke data, secret keys, and
password data are the examples of stealing the controlled authentication information.
Three related technical work about membership inference against an ASR model are investigated.
Specifically, these three work proposed three user-level membership inference method to build an audio
auditor against an ASR model. Three auditors audit whether any user unwillingly contributes their audio
to train the target ASR model or not. The first auditor is conducted under the black-box access to the
ASR model whose output contains the confidential score. The second auditor is conducted under the
black-box access where the target ASR model only output the label — translated text. The third auditor is
conducted under the no-label black-box access. Specifically, the target ASR model will not display neither
the confidential score nor the translated text explicitly. Instead, the ASR model passes the translated text
to the system. The system directly reacts based on the translated text’s content.
Another technical work can be considered as a follow up research, which investigate the audio
adversarial example generation under the white-box access. By stealing the ML information, we know
the details of the target ASR model. We proposed a Fast Adversarial Audio Generation (FAAG) method
under the white-box access to an ASR model. Instead of adding perturbations to the whole frame of the
benign audio, FAAG adds noise to the beginning part of the benign audio, which speed up the generation
process significantly.
Additionally, the future directions matching various limitations of ML-base stealing attack are sug-
gested. Comparing to the explicit breaking/destroying attack, the controlled information leaked by such
stealing attack is much more difficult to be detected, so that the estimated loss should be extended

110
accordingly. This thesis, therefore, can help researchers familiarize these stealing attacks, their future
trends, and the potential defense methods.
Chapters 3, 4 and 5 show that it is possible to infer the model’s training set information even limiting
the available resource gain from the balck-box access. Specifically, the user-level membership inference
is conducted with only label knowledge against the ASR system. If the ML model’s information can be
inferred by the attackers, an attack against this ML model under the black-box access is similar to the
attack under the white-box access. It is much more convenient and effective for the attacker to launch
any attack under the latter accessibility than the former one. Chapter 6 explored an efficiency advanced
adversarial machine learning under the white-box access with limited computational resource.

111
Bibliography

[1] Sultan Alneyadi, Elankayer Sithirasenan, and Vallipuram Muthukkumarasamy. A survey on data
leakage prevention systems. Journal of Network and Computer Applications, 62(Feb):137–152,
2016.

[2] Mohammad Ahmadian and Dan Cristian Marinescu. Information leakage in cloud data warehouses.
IEEE Transactions on Sustainable Computing, pages 1–12, 2019.

[3] Long Cheng, Fang Liu, and Danfeng Yao. Enterprise data breach: causes, challenges, prevention,
and future directions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,
7(5):e1211, 2017.

[4] InfoWatch Analytics Center. Global data leakage report, 2017, 2018.

[5] Ponemon from IBM. 2018 cost of a data breach study: Global overview, 2018.

[6] Sam Smith from Juniper Research. Cybercrime will cost businesses over $2 trillion by 2019, 2015.

[7] Carlos Flavián and Miguel Guinalíu. Consumer trust, perceived security and privacy policy: three
basic elements of loyalty to a web site. Industrial Management & Data Systems, 106(5):601–620,
2006.

[8] Richard Kissel. Glossary of key information security terms. National Institute of Standards and
Technology (NIST) - Computer Security Resource Center, Gaithersburg, MD, US, 2013.

[9] Anthony Califano, Ersin Dincelli, and Sanjay Goel. Using features of cloud computing to defend
smart grid against ddos attacks. In Proceedings of the 10th Annual Symposium on Information
Assurance (Asia 15), pages 44–50, Albany, New York, 2015. NYS.

[10] Wenrui Diao, Xiangyu Liu, Zhou Li, and Kehuan Zhang. No pardon for the interruption: New
inference attacks on android through interrupt timing analysis. In Proceedings of the 2016 IEEE
Symposium on Security and Privacy (SP), pages 414–432, San Jose, CA, USA, 2016. IEEE.

[11] Wale Ogunwale. Lockdown am.getrunningappprocesses api with permission.real_get_tasks, 2016.

[12] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine
learning models via prediction apis. In Proceedings of the 25th USENIX Security Symposium
(USENIX Security 16), pages 601–618, Washington, D.C., USA, 2016. USENIX Association.

112
[13] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov,
Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In
Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, pages 387–402, Prague, Czech Republic, 2013. Springer.

[14] Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and JD Tygar. Adversarial
machine learning. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence,
pages 43–58, Chicago, Illinois, USA, 2011. ACM.

[15] Daniel Lowd and Christopher Meek. Adversarial learning. In Proceedings of the 11th ACM
SIGKDD International Conference on Knowledge Discovery in Data Mining, pages 641–647,
Chicago, Illinois, USA, 2005. ACM.

[16] Nedim Srndic and Pavel Laskov. Practical evasion of a learning-based classifier: A case study. In
Proceedings of the 2014 IEEE Symposium on Security and Privacy (SP), pages 197–211, San Jose,
CA, USA, 2014. IEEE.

[17] Mauro Ribeiro, Katarina Grolinger, and Miriam AM Capretz. Mlaas: Machine learning as a
service. In Proceedings of the 2015 IEEE 14th International Conference on Machine Learning and
Applications (ICMLA), pages 896–902, Miami, FL, USA, 2015. IEEE.

[18] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit
confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Con-
ference on Computer and Communications Security (CCS), pages 1322–1333, Denver, Colorado,
USA, 2015. ACM.

[19] Mohamed Amine Ferrag, Leandros Maglaras, and Ahmed Ahmim. Privacy-preserving schemes for
ad hoc social networks: A survey. IEEE Communications Surveys & Tutorials, 19(4):3015–3045,
2017.

[20] R Barona and EA Mary Anita. A survey on data breach challenges in cloud computing security:
Issues and threats. In Proceedings of the 2017 International Conference on Circuit, Power and
Computing Technologies (ICCPCT), pages 1–8, Kollam, India, 2017. IEEE.

[21] Mordechai Guri and Yuval Elovici. Bridgeware: The air-gap malware. Communications of the
ACM, 61(4):74–82, 2018.

[22] Yong Zeng and Rui Zhang. Active eavesdropping via spoofing relay attack. In Proceedings of the
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
2159–2163, Shanghai, China, 2016. IEEE.

[23] John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. Darpa
timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon
Technical Report, 93, 1993.

[24] Muhammad Salman Khan, Sana Siddiqui, and Ken Ferens. A cognitive and concurrent cyber kill
chain model. Springer, Cham, 2018.

113
[25] Nan Sun, Jun Zhang, Paul Rimba, Shang Gao, Yang Xiang, and Leo Yu Zhang. Data-driven cyberse-
curity incident prediction: A survey. IEEE Communications Surveys & Tutorials, 21(2):1744–1772,
2019.

[26] Tarun Yadav and Arvind Mallari Rao. Technical aspects of cyber kill chain. In Proceedings of
the International Symposium on Security in Computing and Communication, pages 438–452, New
York, NY, 2015. Springer.

[27] Dennis Kiwia, Ali Dehghantanha, Kim-Kwang Raymond Choo, and Jim Slaughter. A cyber kill
chain based taxonomy of banking trojans for evolutionary computational intelligence. Journal of
computational science, 27:394–409, 2018.

[28] CW Dukes. Committee on national security systems (cnss) glossary. Technical report, Committee
on National Security Systems Instructions (CNSSI), 2015.

[29] B. Wang and N. Z. Gong. Stealing hyperparameters in machine learning. In Proceedings of the
2018 IEEE Symposium on Security and Privacy (SP), pages 36–52, San Francisco, CA, USA, 2018.
IEEE.

[30] Ben Gras, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. Translation leak-aside buffer:
Defeating cache side-channel protections with {TLB} attacks. In Proceedings of the 27th USENIX
Security Symposium (USENIX Security 18), pages 955–972, Baltimore, MD, USA, 2018. USENIX
Association.

[31] Ding Wang, Zijian Zhang, Ping Wang, Jeff Yan, and Xinyi Huang. Targeted online password
guessing: An underestimated threat. In Proceedings of the 2016 ACM SIGSAC Conference on
Computer and Communications Security (CCS), pages 1242–1254, Vienna, Austria, 2016. ACM.

[32] William Melicher, Blase Ur, Sean M Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin,
and Lorrie Faith Cranor. Fast, lean, and accurate: Modeling password guessability using neural
networks. In Proceedings of the 25th USENIX Security Symposium (USENIX Security 16), pages
175–191, Washington, D.C., USA, 2016. USENIX Association.

[33] Rafael Veras, Christopher Collins, and Julie Thorpe. On semantic patterns of passwords and their
security impact. In Proceedings of the 21st Annual Network and Distributed System Security
Symposium (NDSS), pages 1–16, San Diego, CA, USA, 2014. IEEE.

[34] Jingchao Sun, Xiaocong Jin, Yimin Chen, Jinxue Zhang, Yanchao Zhang, and Rui Zhang. Visible:
Video-assisted keystroke inference from tablet backside motion. In Proceedings of the 23rd Annual
Network and Distributed System Security Symposium (NDSS), pages 1–15, San Diego, CA, USA,
2016. IEEE.

[35] Xiangyu Liu, Zhe Zhou, Wenrui Diao, Zhou Li, and Kehuan Zhang. When good becomes evil:
Keystroke inference with smartwatch. In Proceedings of the 22nd ACM SIGSAC Conference on
Computer and Communications Security (CCS), pages 1273–1285, Denver, Colorado, USA, 2015.
ACM.

114
[36] Raphael Spreitzer, Felix Kirchengast, Daniel Gruss, and Stefan Mangard. Procharvester: Fully
automated analysis of procfs side-channel leaks on android. In Proceedings of the 2018 on Asia
Conference on Computer and Communications Security (AsiaCCS), pages 749–763, Incheon,
Republic of Korea, 2018. ACM.

[37] Xiaokuan Zhang, Xueqiang Wang, Xiaolong Bai, Yinqian Zhang, and XiaoFeng Wang. Os-level
side channels without procfs: Exploring cross-app information leakage on ios. In Proceedings of
the 25th Annual Network and Distributed System Security Symposium (NDSS), pages 1–15, San
Diego, CA, USA, 2018. IEEE.

[38] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference
attacks against machine learning models. In Proceedings of the 2017 IEEE Symposium on Security
and Privacy (SP), pages 3–18, San Jose, CA, USA, 2017. IEEE.

[39] Briland Hitaj, Giuseppe Ateniese, and Fernando Perez-Cruz. Deep models under the gan: in-
formation leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC
Conference on Computer and Communications Security (CCS), pages 603–618, Dallas, Texas,
USA, 2017. ACM.

[40] Ziqiao Zhou, Michael K Reiter, and Yinqian Zhang. A software approach to defeating side channels
in last-level caches. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and
Communications Security (CCS), pages 871–882, Vienna, Austria, 2016. ACM.

[41] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram
Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM
on Asia Conference on Computer and Communications Security (AsiaCCS), pages 506–519, Abu
Dhabi, United Arab Emirates, 2017. ACM.

[42] Seong Joon Oh, Max Augustin, Bernt Schiele, and Mario Fritz. Towards reverse-engineering
black-box neural networks. In Proceedings of the 6th International Conference on Learning
Representations (ICLR 2018), pages 1–20, Vancouver, BC, Canada, 2018. OpenReview.net.

[43] Ahmed Salem, Yang Zhang, Mathias Humbert, Pascal Berrang, Mario Fritz, and Michael Backes.
Ml-leaks: Model and data independent membership inference attacks and defenses on machine
learning models. In Proceedings of the 26th Annual Network and Distributed System Security
Symposium (NDSS), pages 1–15, San Diego, California, USA, 2019. IEEE.

[44] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. Exploiting unin-
tended feature leakage in collaborative learning. In Proceedings of the 2019 IEEE Symposium on
Security and Privacy (SP), pages 1–16, San Fransisco, CA, US, 2019. IEEE.

[45] Pranav Patel, Eamonn Keogh, Jessica Lin, and Stefano Lonardi. Mining motifs in massive time
series databases. In Proceedings of the 2002 IEEE International Conference on Data Mining
(ICDM), pages 370–377, Maebashi City, Japan, 2002. IEEE.

[46] Jessica Lin and Yuan Li. Finding structural similarity in time series data using bag-of-patterns rep-
resentation. In Proceedings of the International Conference on Scientific and Statistical Database
Management, pages 461–477, New Orleans, LA, USA, 2009. Springer.

115
[47] Avesta Hojjati, Anku Adhikari, Katarina Struckmann, Edward Chou, Thi Ngoc Tho Nguyen,
Kushagra Madan, Marianne S Winslett, Carl A Gunter, and William P King. Leave your phone at
the door: Side channels that reveal factory floor secrets. In Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security (CCS), pages 883–894, Vienna, Austria,
2016. ACM.

[48] Qiuyu Xiao, Michael K Reiter, and Yinqian Zhang. Mitigating storage side channels using statistical
privacy mechanisms. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and
Communications Security (CCS), pages 1582–1594, Denver, Colorado, USA, 2015. ACM.

[49] Amit Kumar Sikder, Hidayet Aksu, and A Selcuk Uluagac. 6thsense: A context-aware sensor-based
attack detector for smart devices. In Proceedings of the 26th USENIX Security Symposium (USENIX
Security 17), pages 397–414, Vancouver, BC, Canada, 2017. USENIX Association.

[50] Ninghui Li, Wahbeh Qardaji, Dong Su, Yi Wu, and Weining Yang. Membership privacy: a unifying
framework for privacy definitions. In Proceedings of the 2013 ACM SIGSAC Conference on
Computer and Communications Security (CCS), pages 889–900, Berlin, Germany, 2013. ACM.

[51] Karan Ganju, Qi Wang, Wei Yang, Carl A Gunter, and Nikita Borisov. Property inference attacks
on fully connected neural networks using permutation invariant representations. In Proceedings
of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages
619–633, Toronto, ON, Canada, 2018. ACM.

[52] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar,
and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security (CCS), pages 308–318, Vienna, Austria,
2016. ACM.

[53] AMAZON ML SERVICES. Amazon aws machine learning, 2019.

[54] Microsoft. Azure machine learning studio, 2019.

[55] Google. Predictive analytics - cloud machine learning engine, 2019.

[56] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial
examples. In Proceedings of the 3rd International Conference on Learning Representations (ICLR
2015), pages 1–11, San Diego, CA, USA, 2015. OpenReview.net.

[57] UCIdataset. Uci machine learning repository, 2018.

[58] Tom W Smith, Peter Marsden, Michael Hout, and Jibum Kim. The general social surveys. Technical
report, National Opinion Research Center at the University of Chicago, 2012.

[59] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier
Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn:
Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.

[60] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The mnist database of handwritten
digits, 2011.

116
[61] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Bench-
marking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332,
2012.

[62] BigML. Machine learning made beautifully simple for everyone, 2019.

[63] Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart.
Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In
Proceedings of the 23rd USENIX Security Symposium (USENIX Security 14), pages 17–32, San
Diego, CA, USA, 2014. USENIX Association.

[64] Ferdinando S Samaria and Andy C Harter. Parameterisation of a stochastic model for human face
identification. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision,
pages 138–142, Sarasota, FL, USA, 1994. IEEE.

[65] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.
Technical report, Citeseer, 2009.

[66] Kaggle Inc. Acquire valued shoppers challenge, 2014.

[67] Dingqi Yang, Daqing Zhang, and Bingqing Qu. Participatory cultural mapping based on collective
behavior data in location-based social networks. ACM Transactions on Intelligent Systems and
Technology (TIST), 7(3):30:1–30:23, 2016.

[68] Texas Health and Human Service. Hospital discharge data public use data file, 2018.

[69] Li Deng. The mnist database of handwritten digit images for machine learning research [best of
the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012.

[70] Kaggle Inc. 20 newsgroups, 2017.

[71] Erik Learned-Miller, Gary B Huang, Aruni RoyChowdhury, Haoxiang Li, and Gang Hua. Labeled
faces in the wild: A survey. In Advances in Face Detection and Facial Image Analysis, pages
189–248. Springer, New York, NY, 2016.

[72] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild.
In Proceedings of the International Conference on Computer Vision (ICCV), pages 3730–3738,
New York, NY, December 2015. IEEE.

[73] Hong-Wei Ng and Stefan Winkler. A data-driven approach to cleaning large face datasets. In
Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), pages 343–
347, New York, NY, 2014. IEEE.

[74] Ning Zhang, Manohar Paluri, Yaniv Taigman, Rob Fergus, and Lubomir Bourdev. Beyond frontal
faces: Improving person recognition using multiple cues. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages 4804–4813, New York, NY, 2015.
IEEE.

[75] Yelp. Yelp open dataset, 2014.

117
[76] Ben Verhoeven and Walter Daelemans. Clips stylometry investigation (csi) corpus: a dutch corpus
for the detection of age, gender, personality, sentiment and deception in text. In Proceedings of the
9th International Conference on Language Resources and Evaluation (LREC), pages 3081–3085,
Reykjavik, Iceland, 2014. European Languages Resources Association (ELRA).

[77] Milad Nasr, Reza Shokri, and Amir Houmansadr. Machine learning with membership privacy using
adversarial regularization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and
Communications Security (CCS), pages 634–646, Toronto, Canada, 2018. ACM.

[78] Peter Muennig, Gretchen Johnson, Jibum Kim, Tom W Smith, and Zohn Rosen. The general social
survey-national death index: an innovative new dataset for the social sciences. BMC research notes,
4(1):1–6, 2011.

[79] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd
ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 1310–1321,
Denver, Colorado, USA, 2015. ACM.

[80] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine
Learning Research, 15(1):1929–1958, 2014.

[81] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical
risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109, 2011.

[82] Prateek Jain, Vivek Kulkarni, Abhradeep Thakurta, and Oliver Williams. To drop or not to
drop: Robustness, consistency and differential privacy properties of dropout. arXiv preprint
arXiv:1503.02031, 2015.

[83] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015.

[84] Giuseppe Ateniese, Luigi V. Mancini, Angelo Spognardi, Antonio Villani, Domenico Vitali, and
Giovanni Felici. Hacking smart machines with smarter ones: How to extract meaningful data from
machine learning classifiers. Int. J. Secur. Netw., 10(3):137–150, 2015.

[85] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov,
and Alexander J Smola. Deep sets. In Proceedings of the Advances in Neural Information
Processing Systems, pages 3391–3401, Long Beach, CA, USA, 2017. Curran Associates, Inc.

[86] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient
learning of deep networks from decentralized data. In Proceedings of the 20th International
Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, Fort Lauderdale,
FL, USA, 2017. PMLR.

[87] Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-
supervised knowledge transfer for deep learning from private training data. In Proceedings of
the 5th International Conference on Learning Representations (ICLR 2017), pages 1–16, Toulon,
France, 2017. OpenReview.net.

118
[88] Mathias Lecuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang, and Siddhartha Sen. Pyramid:
Enhancing selectivity in big data protection with count featurization. In Proccedings of the 2017
IEEE Symposium on Security and Privacy (SP), pages 78–95, San Jose, CA, USA, 2017. IEEE.

[89] Yang Tang, Phillip Ames, Sravan Bhamidipati, Ashish Bijlani, Roxana Geambasu, and Nikhil
Sarda. Cleanos: Limiting mobile data exposure with idle eviction. In Proceedings of the USENIX
Symposium on Operating Systems Design and Implementation (OSDI), pages 77–91, Hollywood,
CA, USA, 2012. USENIX.

[90] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity
in private data analysis. In Proceedings of the Theory of Cryptography Conference, pages 265–284,
New York, NY, USA, 2006. Springer.

[91] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael P Wellman. Sok: Security and
privacy in machine learning. In Proceedings of the 2018 IEEE European Symposium on Security
and Privacy (EuroS&P), pages 399–414, London, UK, 2018. IEEE.

[92] Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields,
David Wagner, and Wenchao Zhou. Hidden voice commands. In Proceedings of the 25th USENIX
Security Symposium (USENIX Security’16), pages 513–530, 2016.

[93] Paul Lamere, Philip Kwok, William Walker, Evandro Gouvea, Rita Singh, Bhiksha Raj, and
Peter Wolf. Design of the cmu sphinx-4 decoder. In Eighth European Conference on Speech
Communication and Technology, 2003.

[94] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-
to-text. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW’18), pages 1–7.
IEEE, 2018.

[95] Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. Dol-
phinattack: Inaudible voice commands. In Proceedings of the 2017 ACM SIGSAC Conference on
Computer and Communications Security (CCS’17), pages 103–117, New York, NY, USA, 2017.

[96] Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. Imperceptible,
robust, and targeted adversarial examples for automatic speech recognition. In Proceedings of the
International Conference on Machine Learning (ICML’19), pages 5231–5240, 2019.

[97] Hiromu Yakura and Jun Sakuma. Robust audio adversarial example for a physical attack. In
Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19), pages
5334–5341. ijcai.org, 2019.

[98] Lea Schönherr, Thorsten Eisenhofer, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. Imperio:
Robust over-the-air adversarial examples for automatic speech recognition systems. In Annual
Computer Security Applications Conference, pages 843–855, 2020.

[99] Maximilian Christ, Andreas W Kempa-Liehr, and Michael Feindt. Distributed and parallel time
series feature extraction for industrial big data applications, 2016.

119
[100] G David Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.

[101] Nicholas D Lane, Emiliano Miluzzo, Hong Lu, Daniel Peebles, Tanzeem Choudhury, and Andrew T
Campbell. A survey of mobile phone sensing. IEEE Communications magazine, 48(9):140–150,
2010.

[102] Nicholas D Lane, Ye Xu, Hong Lu, Shaohan Hu, Tanzeem Choudhury, Andrew T Campbell,
and Feng Zhao. Enabling large-scale human activity inference on smartphones using community
similarity networks (csn). In Proceedings of the 13th International Conference on Ubiquitous
Computing, pages 355–364, Beijing, China, 2011. ACM.

[103] Bong-Won Park and Kun Chang Lee. The effect of users’ characteristics and experiential factors
on the compulsive usage of the smartphone. In Proceedings of the International Conference
on Ubiquitous Computing and Multimedia Applications, pages 438–446, Daejeon, Korea, 2011.
Springer.

[104] Yan Yu, Jianhua Wang, and Guohui Zhou. The exploration in the education of professionals in
applied internet of things engineering. In Proceedings of the 4th International Conference on
Distance Learning and Education (ICDLE), pages 74–77, San Juan, PR, USA, 2010. IEEE.

[105] Elsa Macias, Alvaro Suarez, and Jaime Lloret. Mobile sensing systems. Sensors, 13(12):17292–
17321, 2013.

[106] Nan Zhang, Kan Yuan, Muhammad Naveed, Xiaoyong Zhou, and XiaoFeng Wang. Leave me
alone: App-level protection against runtime information gathering on android. In Proceedings of
the 2015 IEEE Symposium on Security and Privacy (SP), pages 915–930, San Jose, CA, USA,
2015. IEEE.

[107] Orcan Alpar. Frequency spectrograms for biometric keystroke authentication using neural network
based classifier. Knowledge-Based Systems, 116(Jan):163–171, 2017.

[108] Sowndarya Krishnamoorthy, Luis Rueda, Sherif Saad, and Haytham Elmiligi. Identification of
user behavioral biometrics for authentication using keystroke dynamics and machine learning. In
Proceedings of the 2018 2nd International Conference on Biometric Engineering and Applications,
pages 50–57, Amsterdam, The Netherlands, 2018. ACM.

[109] Pei-Yuan Wu, Chi-Chen Fang, Jien Morris Chang, and Sun-Yuan Kung. Cost-effective kernel ridge
regression implementation for keystroke-based active authentication system. IEEE transactions on
cybernetics, 47(11):3916–3927, 2017.

[110] Adam Goodkind, David Guy Brizan, and Andrew Rosenberg. Utilizing overt and latent linguistic
structure to improve keystroke-based authentication. Image and Vision Computing, 58(Feb):230–
238, 2017.

[111] Liang Cai and Hao Chen. Touchlogger: Inferring keystrokes on touch screen from smartphone
motion. In Proceedings of the 6th USENIX Workshop on Hot Topics in Security (HotSec’11), pages
9–15, San Francisco, CA, USA, 2011. USENIX Association.

120
[112] Zhi Xu, Kun Bai, and Sencun Zhu. Taplogger: Inferring user inputs on smartphone touchscreens
using on-board motion sensors. In Proceedings of the 5th ACM Conference on Security and Privacy
in Wireless and Mobile Networks, pages 113–124, Tucson, AZ, USA, 2012. ACM.

[113] Emiliano Miluzzo, Alexander Varshavsky, Suhrid Balakrishnan, and Romit Roy Choudhury. Tap-
prints: your finger taps have fingerprints. In Proceedings of the 10th International Conference on
Mobile Systems, Applications, and Services, pages 323–336, Ambleside, UK, 2012. ACM.

[114] Yigael Berger, Avishai Wool, and Arie Yeredor. Dictionary attacks using keyboard acoustic emana-
tions. In Proceedings of the 13th ACM SIGSAC Conference on Computer and Communications
Security (CCS), pages 245–254, Alexandria, Virginia, USA, 2006. ACM.

[115] Michael Backes, Markus Dürmuth, and Dominique Unruh. Compromising reflections-or-how to
read lcd monitors around the corner. In Proceedings of the 2008 IEEE Symposium on Security and
Privacy (SP), pages 158–169, Oakland, CA, USA, 2008. IEEE.

[116] Rongmao Chen, Yi Mu, Guomin Yang, Fuchun Guo, and Xiaofen Wang. Dual-server public-key
encryption with keyword search for secure cloud storage. IEEE Transactions on Information
Forensics and Security, 11(4):789–798, 2016.

[117] Venkata Koppula, Omkant Pandey, Yannis Rouselakis, and Brent Waters. Deterministic public-key
encryption under continual leakage. In Proceedings of the International Conference on Applied
Cryptography and Network Security, pages 304–323, Guildford, UK, 2016. Springer.

[118] Zheng Yan and Mingjun Wang. Protect pervasive social networking based on two-dimensional
trust levels. IEEE Systems Journal, 11(1):207–218, 2017.

[119] L Yu Paul, Gunjan Verma, and Brian M Sadler. Wireless physical layer authentication via fingerprint
embedding. IEEE Communications Magazine, 53(6):48–53, 2015.

[120] Debiao He, Sherali Zeadally, Neeraj Kumar, and Jong-Hyouk Lee. Anonymous authentication
for wireless body area networks with provable security. IEEE Systems Journal, 11(4):2590–2601,
2017.

[121] Qi Jiang, Sherali Zeadally, Jianfeng Ma, and Debiao He. Lightweight three-factor authentication and
key agreement protocol for internet-integrated wireless sensor networks. IEEE Access, 5(Mar):3376–
3392, 2017.

[122] Fangfei Liu, Qian Ge, Yuval Yarom, Frank Mckeen, Carlos Rozas, Gernot Heiser, and Ruby B Lee.
Catalyst: Defeating last-level cache side channel attacks in cloud computing. In Proceedings of the
2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages
406–418, Barcelona, Spain, 2016. IEEE.

[123] Daniel Gruss, Julian Lettner, Felix Schuster, Olya Ohrimenko, Istvan Haller, and Manuel Costa.
Strong and efficient cache side-channel protection using hardware transactional memory. In
Proccedings of the 26th USENIX Security Symposium (USENIX Security 17), pages 217–233,
Vancouver, BC, Canada, 2017. USENIX Association.

121
[124] Matt Weir, Sudhir Aggarwal, Breno De Medeiros, and Bill Glodek. Password cracking using
probabilistic context-free grammars. In Proceedings of the 2009 IEEE Symposium on Security and
Privacy (SP), pages 391–405, Berkeley, CA, USA, 2009. IEEE.

[125] Jerry Ma, Weining Yang, Min Luo, and Ninghui Li. A study of probabilistic password models. In
Proceedings of the 2014 IEEE Symposium on Security and Privacy (SP), pages 689–704, San Jose,
CA, USA, 2014. IEEE.

[126] Blase Ur, Sean M Segreti, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, Saranga Komanduri,
Darya Kurilova, Michelle L Mazurek, William Melicher, and Richard Shay. Measuring real-
world accuracies and biases in modeling password guessability. In Proceedings of the 24th
USENIX Security Symposium (USENIX Security 15), pages 463–481, Washington, D.C., USA,
2015. USENIX Association.

[127] Patrick Gage Kelley, Saranga Komanduri, Michelle L Mazurek, Richard Shay, Timothy Vidas, Lujo
Bauer, Nicolas Christin, Lorrie Faith Cranor, and Julio Lopez. Guess again (and again and again):
Measuring password strength by simulating password-cracking algorithms. In Proceedings of the
2012 IEEE Symposium on Security and Privacy (SP), pages 523–537, San Francisco, CA, USA,
2012. IEEE.

[128] Richard Shay, Saranga Komanduri, Adam L Durity, Phillip Seyoung Huh, Michelle L Mazurek,
Sean M Segreti, Blase Ur, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. Can long
passwords be secure and usable? In Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems, pages 2927–2936, Toronto, ON, Canada, 2014. ACM.

[129] Michelle L Mazurek, Saranga Komanduri, Timothy Vidas, Lujo Bauer, Nicolas Christin, Lor-
rie Faith Cranor, Patrick Gage Kelley, Richard Shay, and Blase Ur. Measuring password guessability
for an entire university. In Proceedings of the 2013 ACM SIGSAC Conference on Computer and
Communications Security (CCS), pages 173–186, Berlin, Germany, 2013. ACM.

[130] Thomas Brewster. 13 million passwords appear to have leaked from this free web host, 2015.

[131] Saranga Komanduri. Modeling the adversary to evaluate password strength with limited samples.
PhD thesis, School of Computer Science, Carnegie Mellon University, 2016.

[132] Yue Li, Haining Wang, and Kun Sun. A study of personal information in human-chosen passwords
and its security implications. In Proceedings of the 35th Annual IEEE International Conference on
Computer Communications (INFOCOM), pages 1–9, San Francisco, CA, USA, 2016. IEEE.

[133] Joseph Bonneau. The science of guessing: Analyzing an anonymized corpus of 70 million
passwords. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (SP), pages
538–552, San Francisco, CA, USA, 2012. IEEE.

[134] Anupam Das, Joseph Bonneau, Matthew Caesar, Nikita Borisov, and XiaoFeng Wang. The tangled
web of password reuse. In Proceedings of the 21st Annual Network and Distributed System Security
Symposium (NDSS), pages 1–15, San Diego, CA, USA, 2014. IEEE.

122
[135] Himanshu Raj, Ripal Nathuji, Abhishek Singh, and Paul England. Resource management for
isolation enhanced cloud services. In Proceedings of the 2009 ACM workshop on Cloud Computing
Security, pages 77–84, Chicago, Illinois, USA, 2009. ACM.

[136] Christopher D Manning, Christopher D Manning, and Hinrich Schütze. Foundations of statistical
natural language processing. MIT press, London, UK, 1999.

[137] Indiana University Nan from System Security Lab. App guardian: An app level protection against
rig attacks, 2015.

[138] Cynthia Dwork. Differential privacy: A survey of results. In Proceedings of the International
Conference on Theory and Applications of Models of Computation, pages 1–19, Xi’an, China, 2008.
Springer.

[139] Geetha Jagannathan, Krishnan Pillaipakkamnatt, and Rebecca N Wright. A practical differentially
private random decision tree classifier. In Proceedings of the IEEE International Conference on
Data Mining Workshops (ICDMW’09), pages 114–121, Miami, Florida, USA, 2009. IEEE.

[140] Staal A Vinterbo. Differentially private projected histograms: Construction and use for prediction.
In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery
in Databases, pages 19–34, Bristol, UK, 2012. Springer.

[141] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as
a defense to adversarial perturbations against deep neural networks. In Proceedings of the 2016
IEEE Symposium on Security and Privacy (SP), pages 582–597, New York, NY, 2016. IEEE.

[142] European Parliament and Council of the European Union. Regulation (eu) 2016/679 of the
European Parliament and of the Council of 27 April 2016 on the protection of natural persons with
regard to the processing of personal data and on the free movement of such data and repealing
Directive 95/46/EC (general data protection regulation). Official Journal of the European Union,
119:1–88, 2016.

[143] E. McReynolds, S. Hubbard, T. Lau, A. Saraf, M. Cakmak, and F. Roesner. Toys that listen: A
study of parents, children, and Internet-connected toys. In Proceedings of the 2017 CHI Conference
on Human Factors in Computing Systems, pages 5197–5207. ACM, 2017.

[144] S. Lokesh, P. K. Malarvizhi, M. D. Ramya, P. Parthasarathy, and C. Gokulnath. An automatic tamil


speech recognition system by using bidirectional recurrent neural network with self-organizing
map. Neural Computing and Applications, pages 1–11, 2018.

[145] M. Mehrabani, S. Bangalore, and B. Stern. Personalized speech recognition for Internet of Things.
In Proceedings of the 2015 IEEE 2nd World Forum on Internet of Things (WF-IoT), pages 369–374.
IEEE, 2015.

[146] S. Nick. Amazon may give app developers access to Alexa audio recordings, 2017.

[147] Minhui Xue, Gabriel Magno, Evandro Cunha, Virgilio Almeida, and Keith W Ross. The right to
be forgotten in the media: A data-driven study. Proceedings on Privacy Enhancing Technologies,
2016(4):389–402, 2016.

123
[148] BBC. Hmrc forced to delete five million voice files, 2019.

[149] W. Kyle. How Amazon, Apple, Google, Microsoft, and Samsung treat your voice data, 2019.

[150] P. Sarah. 41% of voice assistant users have concerns about trust and privacy, report finds, 2019.

[151] M. Sapna. Hey, Alexa, what can you hear? and what will you do with it?, 2018.

[152] CCTV. Beware of WeChat voice scams: “cloning” users after WeChat voice, 2018.

[153] C. Song and V. Shmatikov. The natural auditor: How to tell if someone used your words to train
their model. arXiv preprint arXiv:1811.00513, 0:1–15, 2018.

[154] M. Shokoohi-Yekta, Y. Chen, B. Campana, B. Hu, J. Zakaria, and E. Keogh. Discovery of


meaningful rules in time series. In Proceedings of the 21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD), pages 1085–1094. ACM, 2015.

[155] BBC. Plan to secure Internet of Things with new law, 2019.

[156] BBC. Smart device security guidelines “need more teeth”, 2018.

[157] F. Weninger, H. Erdogan, Watanabe S, Vincent E, Jonathan Le Roux, John R Hershey, and Björn
Schuller. Speech enhancement with LSTM recurrent neural networks and its application to noise-
robust ASR. In Proceedings of the International Conference on Latent Variable Analysis and
Signal Separation, pages 91–99. Springer, 2015.

[158] L. Schönherr, K. Kohls, S. Zeiler, T. Holz, and D Kolossa. Adversarial attacks against automatic
speech recognition systems via psychoacoustic hiding. arXiv preprint arXiv:1808.05665, 0(0):1–18,
2018.

[159] M. Ravanelli, T. Parcollet, and Y. Bengio. The Pytorch-Kaldi speech recognition toolkit. In
Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 6465–6469. IEEE, 2019.

[160] F. Seide, G. Li, and D. Yu. Conversational speech transcription using context-dependent deep
neural networks. In Proceedings of the 12th Annual Conference of the International Speech
Communication Association, 2011.

[161] M. JF. Gales, K. M. Knill, A. Ragni, and S. P. Rath. Speech recognition and keyword spotting for
low-resource languages: Babel project research at cued. In Proceedings of the 4th International
Workshop on Spoken Language Technologies for Under-Resourced Languages, 2014.

[162] M. Dutta, C. Patgiri, M. Sarma, and K. K. Sarma. Closed-set text-independent speaker identification
system using multiple ann classifiers. In Proceedings of the 3rd International Conference on
Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2014, pages 377–385.
Springer, 2015.

[163] M. Sodanil, S. Nitsuwat, and C. Haruechaiyasak. Thai word recognition using hybrid mlp-hmm.
International Journal of Computer Science and Network Security, 10(3):103–110, 2010.

124
[164] C. Canevari, L. Badino, L. Fadiga, and G. Metta. Cross-corpus and cross-linguistic evaluation of a
speaker-dependent dnn-hmm asr system using ema data. In Proceedings of the Speech Production
in Automatic Speech Recognition Conference, 2013.

[165] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh,


S. Sengupta, A. Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv
preprint arXiv:1412.5567, 2014.

[166] H. Zhang, L. Xiao, W. Chen, Y. Wang, and Y Jin. Multi-task label embedding for text classification.
arXiv preprint arXiv:1710.07210, 2017.

[167] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek,


Y. Qian, P Schwarz, et al. The kaldi speech recognition toolkit. Technical report, IEEE Signal
Processing Society, 2011.

[168] DEFINITIONS UNDER CCPA. California consumer privacy act (ccpa) website policy, 2020.

[169] A. Hern. Apple contractors ’regularly hear confidential details’ on siri recordings, 2019.

[170] Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. Logan: Membership
inference attacks against generative models. Proceedings on Privacy Enhancing Technologies,
2019(1):133–152, 2019.

[171] Congzheng Song and Vitaly Shmatikov. Auditing data provenance in text-generation models. In
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining (KDD), pages 196–206, 2019.

[172] Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and
XiaoFeng Wang. Devil’s whisper: A general approach for physical adversarial attacks against
commercial black-box speech recognition devices. In Proceedings of the 29th USENIX Security
Symposium (USENIX Security 20), 2020.

[173] Tianyu Du, Shouling Ji, Jinfeng Li, Qinchen Gu, Ting Wang, and Raheem Beyah. Sirenattack:
Generating adversarial audio for end-to-end acoustic systems. arXiv preprint arXiv:1901.07846,
2019.

[174] Juan M Perero-Codosero, Javier Antón-Martín, Daniel Tapias Merino, Eduardo López Gonzalo,
and Luis A Hernández-Gómez. Exploring open-source deep learning ASR for speech-to-text TV
program transcription. In Proceedings of the IberSPEECH, pages 262–266, 2018.

[175] Alexander Liu, Hung-yi Lee, and Lin-shan Lee. Adversarial training of end-to-end speech recogni-
tion using a criticizing language model. In Proceeding of the 2019 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

[176] Gaoyang Liu, Chen Wang, Kai Peng, Haojun Huang, Yutong Li, and Wenqing Cheng. Socinf: Mem-
bership inference attacks on social media health data with machine learning. IEEE Transactions on
Computational Social Systems, 6(5):907–921, 2019.

125
[177] Liwei Song, Reza Shokri, and Prateek Mittal. Privacy risks of securing machine learning models
against adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer
and Communications Security (CCS), pages 241–257, 2019.

[178] Farhad Farokhi and Mohamed Ali Kaafar. Modelling and quantifying membership information
leakage in machine learning. arXiv preprint arXiv:2001.10648, 2020.

[179] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word
representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543, 2014.

[180] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR
corpus based on public domain audio books. In Proceedings of the 2015 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP’15), pages 5206–5210. IEEE,
2015.

[181] Anthony Rousseau, Paul Deléglise, and Yannick Esteve. Ted-lium: An automatic speech recognition
dedicated corpus. In Proceedings of the International Conference on Language Resources and
Evaluation (LREC), pages 125–129, 2012.

[182] Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, and Xiangyang Li. Speech sani-
tizer: Speech content desensitization and voice anonymization. IEEE Transactions on Dependable
and Secure Computing, 2019.

[183] David Sundermann and Hermann Ney. Vtln-based voice conversion. In Proceedings of the 3rd
IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.
03EX795), pages 556–559. IEEE, 2003.

[184] Brij Mohan Lal Srivastava, Aurélien Bellet, Marc Tommasi, and Emmanuel Vincent. Privacy-
preserving adversarial representation learning in ASR: Reality or illusion? arXiv preprint
arXiv:1911.04913, 2019.

[185] Andreas Nautsch, Abelino Jiménez, Amos Treiber, Jascha Kolberg, Catherine Jasserand, Els Kindt,
Héctor Delgado, Massimiliano Todisco, Mohamed Amine Hmani, Aymen Mtibaa, et al. Preserving
privacy in speaker and speech characterisation. Computer Speech & Language, 58:441–480, 2019.

[186] Yunhui Long, Vincent Bindschaedler, Lei Wang, Diyue Bu, Xiaofeng Wang, Haixu Tang, Carl A
Gunter, and Kai Chen. Understanding membership inferences on well-generalized learning models.
arXiv preprint arXiv:1802.04889, 2018.

[187] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine
learning: Analyzing the connection to overfitting. In Proceedings of the 2018 IEEE 31st Computer
Security Foundations Symposium (CSF), pages 268–282. IEEE, 2018.

[188] Faysal Hossain Shezan, Hang Hu, Jiamin Wang, Gang Wang, and Yuan Tian. Read between the
lines: An empirical measurement of sensitive applications of voice personal assistant systems. In
Proceedings of the Web Conference, WWW ’20. ACM, 2020.

126
[189] Yu-Chih Tung and Kang G Shin. Exploiting sound masking for audio privacy in smartphones. In
Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security, pages
257–268, 2019.

[190] Hafiz Malik. Securing voice-driven interfaces against fake (cloned) audio attacks. In Proceedings
of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pages
512–517. IEEE, 2019.

[191] Francis Tom, Mohit Jain, and Prasenjit Dey. End-to-end audio replay attack detection using
deep convolutional networks with attention. In Proceedings of the Interspeech Conference, pages
681–685, 2018.

[192] Nan Zhang, Xianghang Mi, Xuan Feng, XiaoFeng Wang, Yuan Tian, and Feng Qian. Dangerous
skills: Understanding and mitigating security risks of voice-controlled third-party functions on
virtual personal assistant systems. In Proceedings of the 40th IEEE Symposium on Security and
Privacy (S&P’19), pages 1381–1396. IEEE, 2019.

[193] Pedro Saleiro, Benedict Kuester, Loren Hinkson, Jesse London, Abby Stevens, Ari Anisfeld,
Kit T Rodolfa, and Rayid Ghani. Aequitas: A bias and fairness audit toolkit. arXiv preprint
arXiv:1811.05577, 2018.

[194] Peter Schulam and Suchi Saria. Can you trust this prediction? Auditing pointwise reliability after
learning. arXiv preprint arXiv:1901.00403, 2019.

[195] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions.
In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages
1885–1894. JMLR. org, 2017.

[196] Philip Adler, Casey Falk, Sorelle A Friedler, Tionney Nix, Gabriel Rybeck, Carlos Scheidegger,
Brandon Smith, and Suresh Venkatasubramanian. Auditing black-box models for indirect influence.
Knowledge and Information Systems, 54(1):95–122, 2018.

[197] Yuantian Miao, Ben Zi Hao Zhao, Minhui Xue, Chao Chen, Lei Pan, Jun Zhang, Dali Kaafar, and
Yang Xiang. The audio auditor: Participant-level membership inference in voice-based IoT. CCS
Workshop of Privacy Preserving Machine Learning, 2019.

[198] S. Wildstrom. Nuance exec on iphone 4s, siri, and the future of speech, 2011.

[199] Deepak Kumar, Riccardo Paccagnella, Paul Murley, Eric Hennenfent, Joshua Mason, Adam Bates,
and Michael Bailey. Skill squatting attacks on amazon alexa. In Proceedings of the 27th USENIX
Security Symposium (USENIX Security 18), pages 33–47, 2018.

[200] Nathan Malkin, Joe Deatrick, Allen Tong, Primal Wijesekera, Serge Egelman, and David Wag-
ner. Privacy attitudes of smart speaker users. Proceedings on Privacy Enhancing Technologies,
2019(4):250–271, 2019.

[201] G. Benjamin. Amazon echo’s privacy issues go way beyond voice recordings, 2020.

127
[202] Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Héctor Delgado, Andreas
Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee. Asvspoof 2019:
Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441, 2019.

[203] Christopher A Choquette Choo, Florian Tramer, Nicholas Carlini, and Nicolas Papernot. Label-only
membership inference attacks. arXiv preprint arXiv:2007.14321, 2020.

[204] Zheng Li and Yang Zhang. Label-leaks: Membership inference attack with label. arXiv preprint
arXiv:2007.15528, 2020.

[205] python. Fuzzywuzzy: Fuzzy string matching in python, 2020.

[206] Chuck Martin. Voice assistant usage seen growing to 8.4 billion devices, April 2020.

[207] Hadi Abdullah, Kevin Warren, Vincent Bindschaedler, Nicolas Papernot, and Patrick Traynor. Sok:
The faults in our asrs: An overview of attacks against automatic speech recognition and speaker
identification systems. In Proceedings of the 42nd IEEE Symposium on Security and Privacy
(S&P’21). IEEE, 2021.

[208] Minghao Wang, Tianqing Zhu, Tao Zhang, Jun Zhang, Shui Yu, and Wanlei Zhou. Security and
privacy in 6g networks: New areas and new challenges. Digital Communications and Networks,
2020, DOI: 10.1016/j.dcan.2020.07.003.

[209] Liu Liu, Olivier De Vel, Qing-Long Han, Jun Zhang, and Yang Xiang. Detecting and preventing
cyber insider threats: A survey. IEEE Communications Surveys & Tutorials, 20(2):1397–1417,
2018.

[210] Yuantian Miao, Xue Minhui, Chao Chen, Lei Pan, Jun Zhang, Benjamin Zi Hao Zhao, Dali Kaafar,
and Yang Xiang. The audio auditor: user-level membership inference in internet of things voice
services. Proceedings on Privacy Enhancing Technologies, 2021:209–228, 2021.

[211] Guanjun Lin, Sheng Wen, QingLong Han, Jun Zhang, and Yang Xiang. Software vulnerability
detection using deep neural networks: A survey. Proceedings of the IEEE, 108(10):1825–1848,
2020.

[212] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,
and Rob Fergus. Intriguing properties of neural networks. In Proceedings of the International
Conference on Learning Representations (ICLR’14), 2014.

[213] Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep
structured prediction models. In Proceedings of the 31st Annual Conference on Neural Information
Processing Systems (NeurIPS’17), Long Beach, CA, USA, 2017.

[214] Rory Coulter, Qing-Long Han, Lei Pan, Jun Zhang, and Yang Xiang. Data-driven cyber security in
perspective–intelligent traffic analysis. IEEE Transactions on Cybernetics, 50(7):3081–3093, 2020.

[215] Nan Sun, Jun Zhang, Paul Rimba, Shang Gao, Leo Yu Zhang, and Yang Xiang. Data-driven cyberse-
curity incident prediction: A survey. IEEE Communications Surveys & Tutorials, 21(2):1744–1772,
2019.

128
[216] Xiao Chen, Chaoran Li, Derui Wang, Sheng Wen, Jun Zhang, Surya Nepal, Yang Xiang, and Kui
Ren. Android hiv: A study of repackaging malware for evading machine-learning detection. IEEE
Transactions on Information Forensics and Security, 15:987–1001, 2020.

[217] Junyang Qiu, Jun Zhang, Lei Pan, Wei Luo, Surya Nepal, and Yang Xiang. A survey of android
malware detection with deep neural models. ACM Computing Survey, 53(6), article no. 126, 2020.

[218] Yuantian Miao, Chao Chen, Lei Pan, Qing-Long Han, Jun Zhang, and Yang Xiang. Machine
learning based cyber attacks targeting on controlled information: A survey. ACM Computing
Survey, accepted, 23/04/2021.

[219] Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Logan Blue, Kevin Warren,
Anurag Swarnim Yadav, Tom Shrimpton, and Patrick Traynor. Hear “no evil", see “kenansville":
Efficient and transferable black-box attacks on speech recognition and voice identification systems.
In Proceedings of the 42nd IEEE Symposium on Security and Privacy (S&P’21). IEEE, 2021.

[220] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg,
Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech
2: End-to-end speech recognition in english and mandarin. In Proceedings of the International
Conference on Machine Learning (ICML’16), pages 173–182, 2016.

[221] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist
temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In
Proceedings of the 23rd International Conference on Machine Learning (ICML’06), pages 369–376,
2006.

[222] Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang,
Heqing Huang, XiaoFeng Wang, and Carl A Gunter. Commandersong: A systematic approach for
practical adversarial voice recognition. In Proceedings of the 27th USENIX Security Symposium
(USENIX Security’18), pages 49–64, 2018.

[223] Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating
adversarial examples with adversarial networks. In Proceedings of the Twenty-Seventh International
Joint Conference on Artificial Intelligence (IJCAI’18), pages 3905–3911, 2018.

[224] Yuan Gong and Christian Poellabauer. Crafting adversarial examples for speech paralinguistics
applications. In Proceedings of the 2018 DYnamic and Novel Advances in Machine Learning and
Intelligent Cyber Security Workshop (DYNAMICS’18), 2018.

[225] Felix Kreuk, Yossi Adi, Moustapha Cisse, and Joseph Keshet. Fooling end-to-end speaker verifi-
cation with adversarial examples. In Proceedings of the 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP’18), pages 1962–1966. IEEE, 2018.

[226] Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. Targeted adversarial examples
for black box audio systems. In Proceedings of the 2019 IEEE Security and Privacy Workshops
(SPW’19), pages 15–20. IEEE, 2019.

129
[227] Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. Did you hear that? adversarial examples
against automatic speech recognition. arXiv preprint arXiv:1801.00554, 2018.

[228] Healthwise. Harmful noise levels, 2019.

[229] Mozilla. Deepspeech 0.4.1, January 2019.

[230] Jason Cipriani. The complete list of ’ok, google’ commands, July 2016.

[231] Jason Cipriani and J. Sarah Purewal. The complete list of siri commands, November 2017.

[232] Melanie Weir. A comprehensive list of siri voice commands you can use on an iphone, November
2020.

[233] Snehkumar Shahani, Jibi Abraham, and R Venkateswaran. Distributed data aggregation with privacy
preservation at endpoint. In Proceedings of the IEEE International Conference on Management of
Data, pages 1–9, Chennai, India, 2017. IEEE.

[234] Farah Chanchary, Yomna Abdelaziz, and Sonia Chiasson. Privacy concerns amidst oba and the
need for alternative models. IEEE Internet Computing, 22(Apr):52–61, 2018.

[235] Rory Coulter, Qing-Long Han, Lei Pan, Jun Zhang, and Yang Xiang. Data driven cyber security in
perspective — intelligent traffic analysis. IEEE Transactions on Cybernetics, accepted, to appear
in 2020.

[236] Ambika Kaul, Saket Maheshwary, and Vikram Pudi. Autolearn—automated feature generation and
selection. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM),
pages 217–226, New Orleans, LA, USA, 2017. IEEE.

[237] Jun Zhang, Xiao Chen, Yang Xiang, Wanlei Zhou, and Jie Wu. Robust network traffic classification.
IEEE/ACM Transactions on Networking (TON), 23(4):1257–1270, 2015.

[238] Jun Zhang, Yang Xiang, Yu Wang, Wanlei Zhou, Yong Xiang, and Yong Guan. Network traffic
classification using correlation information. IEEE Transactions on Parallel and Distributed Systems,
24(1):104–117, 2013.

[239] Jun Zhang, Chao Chen, Yang Xiang, Wanlei Zhou, and Yong Xiang. Internet traffic classification
by aggregating correlated naive bayes predictions. IEEE Transactions on Information Forensics
and Security, 8(1):5–15, 2013.

[240] Shigang Liu, Jun Zhang, Yang Xiang, and Wanlei Zhou. Fuzzy-based information decomposition
for incomplete and imbalanced data learning. IEEE Transactions on Fuzzy Systems, 25(6):1476–
1490, 2017.

[241] Chao Chen, Yu Wang, Jun Zhang, Yang Xiang, Wanlei Zhou, and Geyong Min. Statistical features-
based real-time detection of drifted twitter spam. IEEE Transactions on Information Forensics and
Security, 12(4):914–925, 2017.

[242] Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, Yang Xiang, Olivier De Vel, and Paul Montague.
Cross-project transfer representation learning for vulnerable function discovery. IEEE Transactions
on Industrial Informatics, 14(7):3289–3297, 2018.

130
Swinburne Research

Appendix A. Authorship Indication Form


For HDR students

NOTE
This Authorship Indication form is a statement detailing the percentage of the contribution of each author
in each published ‘paper’. This form must be signed by each co-author and the Principal Supervisor.
This form must be added to the publication of your final thesis as an appendix. Please fill out a separate
form for each published paper to be included in your thesis.

DECLARATION
We hereby declare our contribution to the publication of the ‘paper’ entitled:

The Audio Auditor: User-Level Membership Inference in Internet of Things Voice Services

First Author

Name Yuantian Miao Signature: _

Percentage of contribution: 50 % Date: _2_


9/ 0
_ 4_ / _2_02
_ 1_

Brief description of contribution to the ‘paper’ and your central responsibilities/role on project:

Conception and design


System design and implementation
Writing the manuscript

Second Author

Name: Minhui Xue Signature:

Percentage of contribution: 10 % Date: _29


_ / _04
_ / _2_0_
21_

Brief description of your contribution to the ‘paper’:


Conception and design
Revising the manuscript
Third Author

Name: Chao Chen Signature:

Percentage of contribution: 10 % Date: 2


_ 9_ / _
04_ / _2_02_1_
Brief description of your contribution to the ‘paper’:
Conception and design
Revising the manuscript
131
Fourth Author

Name: Lei Pan Signature:

Percentage of contribution: 10 % Date: 0 _ 5_ / _20


_ 4_ / 0 _ 2_1_

Brief description of your contribution to the ‘paper’:


Conception and design
Revising the manuscript

Fifth Author

Name: Jun Zhang Signature:

Percentage of contribution: 5 % 30_ / _0_4 / _20_2_1_


Date: _

Brief description of your contribution to the ‘paper’:


Conception and design
Proofreading

Sixth Author

Name: Benjamin Zi Hao Zhao Signature:

Percentage of contribution: 5 % _ 4_ / _2_02


30_ / 0
Date: _ _ 1_

Brief description of your contribution to the ‘paper’:


Conception and design
Proofreading

Seventh Author

Name: Dali Kaafar Signature:

Percentage of contribution: 5 % Date: 0


_ 3_ / _05
_/2
_0_2_1_

Brief description of your contribution to the ‘paper’:


Conception and design
Proofreading

Eighth Author

Name: Yang Xiang Signature:

Percentage of contribution: 5 % Date: _ 0_4/ 0


_5_ /2_0_
21_ _

Brief description of your contribution to the ‘paper’:


Conception and design
132
Proofreading
Principal Supervisor:

Name: Yang Xiang Signature:

Date: _0_4 / 0
_5_ / _20_2_1_

In the case of more than four authors please attach another sheet with the names, signatures and
contribution of the authors.

Authorship Indication Form

133

You might also like