Professional Documents
Culture Documents
Abstract—Identifying the model of a cell-phone with which an utilized a combination of features from time domain, frequency
audio recording is made serves an important forensic purpose. domain and cepstral domain to identify the traces left by mi-
In this paper, we propose a channel attention mechanism based crophones and the environments [4]. In [5], Fourier coefficients
on subband awareness to focus on the most relevant parts of
the frequency bands to produce more efficient feature represen- proved to be effective in microphone classification. With the
tations for cell-phone recognition task. A multi-stream network awareness that the audio spectrum through a bank of filters may
is introduced to fully exploit difference among frequency bands provide more oriented feature representations, Hanilci et al.
that provide critical clues to identify the fingerprints from the leveraged the Mel frequency cepstrum features on the task of
built-in microphones, and thus is able to recognize cell-phones cell-phone identification [6]. As the silence region might be less
from different manufacturers and even different models from the
same manufacturer. The effectiveness of the proposed design is affected by speakers and speech content, Hanilci et al. extended
validated by a collection of speaker-independent audio recordings their work by utilizing the cepstral features extracted from the
from 20 models of cell-phones made by 5 major manufacturers. In speech-free regions [7]. In [8], blind channel estimation was
particular, the fusion of data augmentation and attention strategy introduced to model the frequency response of microphone, and
greatly increases the robustness of the scheme when additive noise is was further used for microphone classification. To characterize
present in the recorded audios. Finally, the salient regions infer the
critical subbands for the recognition of different cell-phone models. the frequency response of different models of cell-phones, Luo
et al. proposed a feature descriptor based on band energy dif-
Index Terms—Cell-phone recognition, convolutional neural ference [9]. Separating the device fingerprint from the recorded
networks, frequency subbands, attention mechanism.
audio signal can be formulated as an underdetermined problem
that is typically solved by exploiting source sparsity [10], [11].
I. INTRODUCTION Motivated by this, kernel-based projection has fostered a number
EVICE attribution provides an important clue to track of cell-phone verification and clustering systems, where high
D down the source of multimedia data. For example, it can
be used for the verification of the ownership, or assisting in
dimensional supervector derived from Gaussian mixture models
are involved [12]–[15]. In recent years, driven by the success of
authenticating the multimedia data. Lots of efforts have been deep learning in a variety of recognition tasks, automatic feature
made to reliably link an image to the specific model of its source learning with labeled audio samples has become an emerging
camera by exploiting the unique sensor pattern noise [1]–[3]. trend for cell-phone recognition. An attempt to identify mi-
For the same purpose, microphone identification is proposed to crophones was made through training a CNN model [16]. The
associate an audio recording with its acquisition device. In recent effectiveness was validated by a set of single tones at 1000 Hz.
years, the focus of audio source attribution has been transferred A similar CNN architecture was presented in [17] to recognize
from microphone identification to cell-phone identification due the cell-phone models, but with voiced signals to compose the
to the massive availability of handheld mobile devices. data set and with consideration of speaker independence.
Cell-phone identification with audio recordings bolts down A common phenomenon from the aforementioned studies is
to the identification of the built-in microphones. In general, the that better performance can be achieved for the identification
major clues for cell-phone identification using audio recordings of cell-phone models of different brands, while the performance
could be cast into the estimation of the channel response of degrades when it comes to the recognition of different models by
the microphone, and therefore the footprints left by various the same manufacturer. This is understandable, since different
microphones can be reflected by the distortion of the recorded designs of circuits are implemented by various manufacturers,
audio. Most of the existing works make use of the features from leading to more distinct frequency responses among devices.
the audio signals directly. A pioneering work by Kraetzer et al. In addition to this, the vulnerability of cell-phone recognition
systems can be easily exposed by introducing external noise.
Manuscript received December 8, 2019; revised February 25, 2020; accepted
To address these challenges, we introduce a novel multi-stream
March 26, 2020. Date of publication April 6, 2020; date of current version April CNN model with channel attention in each stream to compose
30, 2020. This work was supported by the NSFC under Grant 61976098 and by more relevant feature representations. Further, by incorporating
the research fund of Huaqiao University. The associate editor coordinating the
review of this manuscript and approving it for publication was Prof. Dezhong
data augmentation strategy in the training process, the recogni-
Peng. (Corresponding author: Xiaodan Lin.) tion rate for noisy audios could be significantly improved.
Xiaodan Lin and Donghua Chen are with the School of Information Sci-
ence and Engineering, Huaqiao University, Xiamen 361011, China (e-mail:
xd_lin@hqu.edu.cn; dhchen@hqu.edu.cn). II. PROPOSED ATTENTION AUGMENTED NETWORK
Jianqing Zhu is with the School of Engineering, Huaqiao University,
Quanzhou 362021, China (e-mail: jqzhu@hqu.edu.cn). To capture the subband relevant information for the task
Digital Object Identifier 10.1109/LSP.2020.2985594 of cell-phone recognition, we propose a multi-stream network
1070-9908 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 01:37:23 UTC from IEEE Xplore. Restrictions apply.
606 IEEE SIGNAL PROCESSING LETTERS, VOL. 27, 2020
A. Audio Spectrogram
Audio acquisition by a cell-phone normally follows the
pipeline: converting the mechanical vibration of the sound into
the electrical signal, A/D processing and encoding. The major
difference among various cell-phone models lies in the first two
stages where various forms of circuitry for transducer, AD con-
verter and pre-amplifier are employed. Therefore, the artifacts
introduced in these stages are regarded as intrinsic fingerprints
for cell-phones and coupled as part of the audio.
As audio spectrogram depicts the energy variation of the audio
signal across both the frequency and time span, it should be
model that incorporates the attention mechanism is expected to
able to visualize some of the distinctions caused by different
capture the device fingerprints more effectively. Therefore, we
mobile phones. Fig. 1 presents the audio spectrograms of an
propose a multi-stream CNN design with channel attention in
audio clip recorded by different cell-phones. It is evident that
each stream as shown in Fig. 2. We use three basic blocks that
the artifact is reflected by different regions of the spectrograms.
contain two convolutional layers and a pooling layer, whose
For example, more intense energy distribution around 0–1 kHz is
parameters are listed in Table I. Rectified linear unit is used for
observed from the spectrogram produced by Huawei, while the
the non-linearity of each convolutional layer. A sgdm optimizer
energy distribution difference could be observed around 0–1 kHz
with 80 mini-batch size is used.
and 7–8 kHz for iPhone X and iPhone 6s. This implies that
The attention module is essentially a channel-attention layer
the footprint left by different models of cell-phones could be
inspired by [21] that weights the significance among the feature
traced out through energy distribution across different frequency
maps. Given the features from the previous layer Ak , a set of
subbands, thus motivates us to learn features separately from
weights are imposed on the given features to yield the output of
different parts of the spectrogram.
the attention layer as shown in (4). The weights are computed
according to (1)–(3), where f(·) denotes an average pooling over
B. CNN Model With Subband Attention the given feature map as shown in (3), assuming that the feature
The attention mechanism is introduced to focus on the most map Ak has a size of H × W . σ indicates non-linear operations
relevant features among the tons of features in a deep learning and enables to exploit dependencies among the feature maps.
framework and has proved to be successful in computer vision In practice, σ is composed by a dimension reduction layer with
and natural language processing tasks [18]–[20]. For cell-phone reduction ratio set to 8, a non-linear ReLU unit, and then a dimen-
recognition task, as the frequency band weighs different, a CNN sion increasing layer, where dimension reduction and increasing
Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 01:37:23 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: SUBBAND AWARE CNN FOR CELL-PHONE RECOGNITION 607
TABLE II
RECORDING DEVICES USED
Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 01:37:23 UTC from IEEE Xplore. Restrictions apply.
608 IEEE SIGNAL PROCESSING LETTERS, VOL. 27, 2020
TABLE IV
RECOGNITION ACCURACY OF CELL-PHONES FROM THE SAME MANUFACTURER
UNDER NOISY CONDITIONS
Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 01:37:23 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: SUBBAND AWARE CNN FOR CELL-PHONE RECOGNITION 609
Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 01:37:23 UTC from IEEE Xplore. Restrictions apply.