You are on page 1of 21

Privacy against Real-Time Speech Emotion

Detection via Acoustic Adversarial Evasion of


Machine Learning

BRIAN TESTA, YI XIAO, HARSHIT SHARMA, AVERY


GUMP, and ASIF SALEKIN
Motivation
• Emotions can be stolen via smart speakers!
• Emotion has a profound impact upon decision
making.
• Large technology firms may utilize
data(emotion) for profit (surveillance
capitalism)
• About 52% of the population in the UK are not
interested to use VA due privacy concern.
Objectives

• Protect smart speakers user’s private emotion


information
• Deceive a black-box SER classifier without
compromising speech-to-text transcription on
unheard utterances.
• Compare the proposed approach with state-of-the-art
audio evasion techniques.
• Assess the defendability of the proposed technique
by a knowledgeable SER operator.
• Evaluate the feasibility of implementing privacy
protection in real-world smart speaker scenarios
Proposed Solution

• DARE-GP is designed to be non-invasive and does not


require any privileged access to the smart speaker or any
assumptions about the black-box SER classifier.
• The solution is also robust against various state-of-the-art
audio evasion defenses employed by a knowledgeable
adversary.
• DARE-GP's evaluations demonstrate its superior
performance compared to existing SER evasion
techniques and its effectiveness in a real-world, over-air
deployment scenario against commercial smart speakers.
• Future work includes evaluating the effectiveness of
DARE-GP in a broader range of in-the-wild situations to
determine its effectiveness and any long-term EON
Proposed Solution

• DARE-GP is designed to be non-invasive and does not require any


privileged access to the smart speaker or any assumptions about the
black-box SER classifier.
• The solution is also robust against various state-of-the-art audio evasion
defenses employed by a knowledgeable adversary.
• DARE-GP's evaluations demonstrate its superior performance compared
to existing SER evasion techniques and its effectiveness in a real-world,
over-air deployment scenario against commercial smart speakers.
• Future work includes evaluating the effectiveness of DARE-GP in a
broader range of in-the-wild situations to determine its effectiveness
and any long-term EON invasiveness concerns.
Methodology

• EON:
– Universal spectral perturbations
– combining 𝑁 different tones, each with a different
frequency, amplitude, and temporal variation.
– Mask the spectral attributes of speech that depict
emotional information
– EONs are non-invasive and can be played simultaneously
with users' speech, ensuring real-time protection
• Generate EON using High Level GP approach
– Constraint :
• surrogate classifier should classify correctly
• Transcription service extract text
Methodology

• Fitness :
– ranks each individual in the population
– based upon the ability of an EON to mislead the surrogate SER classifier C*, and its ability to do so without
compromising the underlying audio’s transcription (see Equations 1-3).
• Selection
– selects a subset of individuals from a population to carry forward into the next generation before crossover and
mutation. Selection is performed using a tournament selection method [54] with a tournament size of 𝑛𝑆𝑒𝑙.
– guaranteed that at least the 𝑛𝑆𝑒𝑙 − 1 weakest individuals were eliminated from each generation.
• Crossover
– are created by exchanging the genes of parents among themselves”
– is a method to generate new individuals (i.e., offspring) from previous selected ones (i.e., parents), thus generating
new EONs, by combining the parameters of two existing individuals to create two new individuals..
• Mutation
– Used to prevent population stagnation
– New individuals are generated by randomly modifying select individuals’ EON-generation-parameters. Mutation
introduces the greatest amount of variability in the population; a given individual can undergo any number of
changes, leading to significant improvement or degradation. It is performed by randomly shuffling EON parameters
(scaled to [0,1]) with probability 𝑝𝑀𝑈 𝑋.
• Final EON Selection
– After iterating for 𝐾 generations, the EON with the highest evasion success rate (ESR) when mixed with the validation
dataset.
• Hyperparameters
– 𝑛𝑆𝑒𝑙, 𝑝𝐶𝑋, and 𝑝𝑀𝑈 𝑋 are hyperparameters, identified empirically through grid-search.
Results & Findings
Calculations
𝑓𝑖𝑡𝑛𝑒𝑠𝑠(𝑖𝑛𝑑) = 𝑑𝑒𝑐𝑒𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑) × 𝑡𝑟𝑎𝑛𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑)

𝑑𝑒𝑐𝑒𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑) = + (𝑖𝑛𝑑, x)
𝑏𝑜𝑛𝑢𝑠(𝑖𝑛𝑑, 𝑥) =

𝑡𝑟𝑎𝑛𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑)=
Datasets
• 1. **RAVDESS Dataset**: - The RAVDESS dataset consists of 1,440 samples from 24 speakers, with an
equal split between male and female actors. - The spoken part of the dataset includes two separate
utterances by each speaker, demonstrating 8 different emotions. - The dataset serves as a valuable
resource for training and evaluating Emotion Obfuscating Noises (EONs) in the DARE-GP methodology.
• 2. **TESS Dataset**: - The TESS dataset comprises 1,800 audio samples generated by two actresses,
covering five emotions: neutral, angry, happy, sad, and fearful. - A subset of the TESS data is used for
training and evaluating EONs, providing diverse emotional content for the development and
assessment of the methodology.
• 3. **EON Training and Evaluation**: - The RAVDESS data is utilized for training "factory default" EONs
without tailoring for specific target users or environments. - A portion of the TESS data is used for
additional training on any black-box classifiers that underperformed on the original TESS data,
contributing to the adaptability and robustness of the EONs.
• 4. **Evaluation Metrics**: - The success of DARE-GP is evaluated using metrics such as Evasion
Success Rate (ESR) and False Label Independence, ensuring the protection of emotional privacy and
the utility of the modified audio samples. - The dataset split for black box evaluation involves training
EONs on specific data subsets and evaluating their performance using dedicated evaluation datasets.
• 5. **Role in Methodology**: - Both the RAVDESS and TESS datasets play a crucial role in training,
tailoring, and evaluating the effectiveness of EONs in obfuscating emotional content and evading
Speech Emotion Recognition (SER) classifiers. - The datasets provide diverse emotional speech
samples, enabling the development and validation of EONs for real-world deployment scenarios.
Discussions
• 1. **Research Questions**: - The experiments aim to address specific research
questions, including the ability of the approach to deceive unseen black-box SER
classifiers without compromising speech-to-text transcription, comparison of
performance with state-of-the-art audio evasion techniques, and the potential for
defense against the technique by knowledgeable SER operators. - Real-world
deployment scenarios involving off-the-shelf smart speakers, variable user locations,
and SWAP (Size, Weight, and Power) constraints are also considered.
• 2. **EON Generation Process**: - The methodology involves digitally fine-tuning
"factory default" generic EONs to users' speech through iterations/generations of the
Genetic Programming (GP) approach. - The final EON selection process involves
evaluating the fitness of EONs with different loudness levels and selecting the most
suitable EON for deployment in the target environment.
• 3. **Acoustic Evaluation Recordings**: - Fine-tuning of EONs involves pre-training
using a "canonical" dataset (RAVDESS) and subsequent fine-tuning with a target
household’s user data (TESS) to limit in-home training time. - The challenge of
collecting user recordings using smart speakers due to the unavailability of specific
APIs is highlighted.
Discussions
• 4. **Evasion Success Metrics**: - The success of DARE-GP is
evaluated using metrics such as Evasion Success Rate (ESR) and
False Label Independence, ensuring the protection of emotional
privacy and the utility of the modified audio samples.
• 5. **Role of Datasets**: - The RAVDESS and TESS datasets play a
crucial role in training, tailoring, and evaluating the effectiveness
of EONs in obfuscating emotional content and evading Speech
Emotion Recognition (SER) classifiers. 6. **Real-World
Deployment Considerations**: - The discussion extends to the
potential deployment of EONs in acoustic, real-world scenarios
involving off-the-shelf smart speakers, variable user locations, and
SWAP constraints, highlighting the practical implications of the
methodology.
Strong Points
Limitations
Conclusion
Question & Answer
Components of Sound WaveForm
• 1. Frequency: This refers to how many times the wave repeats itself
per unit time, measured in Hertz (Hz). Higher frequencies
correspond to higher-pitched sounds, while lower frequencies
correspond to lower-pitched sounds. For instance, middle C on a
piano vibrates at approximately 261.6 Hz.
• 2. Amplitude: This refers to the height of the wave peaks, indicating
the sound's loudness or intensity. Larger amplitudes represent
louder sounds, while smaller amplitudes represent quieter sounds.
• 3. Timbre: This refers to the quality or "color" of the sound, which
distinguishes it from other sounds even at the same pitch and
loudness. Timbre is determined by the presence and relative
strengths of harmonics, which are additional frequencies related to
the fundamental frequency.
About EON
• Universal spectral perturbations
• Generated by combining 𝑁 different tones, each
with a different frequency, amplitude, and
temporal variation.
• Mask the spectral attributes of speech that depict
emotional information
• EONs are non-invasive and can be played
simultaneously with users' speech, ensuring real-
time protection
Kernel SHAP (SHapley Additive exPlanations)

• Kernel SHAP (SHapley Additive exPlanations) is a method used for


explaining the output of machine learning models.
• It approximates the conditional expectations of Shapley values, which
measure the importance of input attributes in a model's output.
• In the context of the DARE-GP methodology, Kernel SHAP is used to
analyze the impact of spectral perturbations on the output of the
Speech Emotion Recognition (SER) classifier.
• By masking specific frequency bands in the input audio, Kernel SHAP
helps measure the significance of these frequency bands in influencing
the classifier's predictions.
• This analysis aids in understanding how the Emotion Obfuscating
Noises (EONs) cause misclassification by perturbing specific spectral
attributes related to emotional information in speech.
• Overall, Kernel SHAP provides insights into the importance of different
spectral attributes in the SER classifier's decision-making process,
helping to explain how EONs effectively mask emotional information in
users' speech.

You might also like