You are on page 1of 5

IEEE SIGNAL PROCESSING LETTERS, VOL.

27, 2020 605

Subband Aware CNN for Cell-Phone Recognition


Xiaodan Lin , Jianqing Zhu, and Donghua Chen

Abstract—Identifying the model of a cell-phone with which an utilized a combination of features from time domain, frequency
audio recording is made serves an important forensic purpose. domain and cepstral domain to identify the traces left by mi-
In this paper, we propose a channel attention mechanism based crophones and the environments [4]. In [5], Fourier coefficients
on subband awareness to focus on the most relevant parts of
the frequency bands to produce more efficient feature represen- proved to be effective in microphone classification. With the
tations for cell-phone recognition task. A multi-stream network awareness that the audio spectrum through a bank of filters may
is introduced to fully exploit difference among frequency bands provide more oriented feature representations, Hanilci et al.
that provide critical clues to identify the fingerprints from the leveraged the Mel frequency cepstrum features on the task of
built-in microphones, and thus is able to recognize cell-phones cell-phone identification [6]. As the silence region might be less
from different manufacturers and even different models from the
same manufacturer. The effectiveness of the proposed design is affected by speakers and speech content, Hanilci et al. extended
validated by a collection of speaker-independent audio recordings their work by utilizing the cepstral features extracted from the
from 20 models of cell-phones made by 5 major manufacturers. In speech-free regions [7]. In [8], blind channel estimation was
particular, the fusion of data augmentation and attention strategy introduced to model the frequency response of microphone, and
greatly increases the robustness of the scheme when additive noise is was further used for microphone classification. To characterize
present in the recorded audios. Finally, the salient regions infer the
critical subbands for the recognition of different cell-phone models. the frequency response of different models of cell-phones, Luo
et al. proposed a feature descriptor based on band energy dif-
Index Terms—Cell-phone recognition, convolutional neural ference [9]. Separating the device fingerprint from the recorded
networks, frequency subbands, attention mechanism.
audio signal can be formulated as an underdetermined problem
that is typically solved by exploiting source sparsity [10], [11].
I. INTRODUCTION Motivated by this, kernel-based projection has fostered a number
EVICE attribution provides an important clue to track of cell-phone verification and clustering systems, where high
D down the source of multimedia data. For example, it can
be used for the verification of the ownership, or assisting in
dimensional supervector derived from Gaussian mixture models
are involved [12]–[15]. In recent years, driven by the success of
authenticating the multimedia data. Lots of efforts have been deep learning in a variety of recognition tasks, automatic feature
made to reliably link an image to the specific model of its source learning with labeled audio samples has become an emerging
camera by exploiting the unique sensor pattern noise [1]–[3]. trend for cell-phone recognition. An attempt to identify mi-
For the same purpose, microphone identification is proposed to crophones was made through training a CNN model [16]. The
associate an audio recording with its acquisition device. In recent effectiveness was validated by a set of single tones at 1000 Hz.
years, the focus of audio source attribution has been transferred A similar CNN architecture was presented in [17] to recognize
from microphone identification to cell-phone identification due the cell-phone models, but with voiced signals to compose the
to the massive availability of handheld mobile devices. data set and with consideration of speaker independence.
Cell-phone identification with audio recordings bolts down A common phenomenon from the aforementioned studies is
to the identification of the built-in microphones. In general, the that better performance can be achieved for the identification
major clues for cell-phone identification using audio recordings of cell-phone models of different brands, while the performance
could be cast into the estimation of the channel response of degrades when it comes to the recognition of different models by
the microphone, and therefore the footprints left by various the same manufacturer. This is understandable, since different
microphones can be reflected by the distortion of the recorded designs of circuits are implemented by various manufacturers,
audio. Most of the existing works make use of the features from leading to more distinct frequency responses among devices.
the audio signals directly. A pioneering work by Kraetzer et al. In addition to this, the vulnerability of cell-phone recognition
systems can be easily exposed by introducing external noise.
Manuscript received December 8, 2019; revised February 25, 2020; accepted
To address these challenges, we introduce a novel multi-stream
March 26, 2020. Date of publication April 6, 2020; date of current version April CNN model with channel attention in each stream to compose
30, 2020. This work was supported by the NSFC under Grant 61976098 and by more relevant feature representations. Further, by incorporating
the research fund of Huaqiao University. The associate editor coordinating the
review of this manuscript and approving it for publication was Prof. Dezhong
data augmentation strategy in the training process, the recogni-
Peng. (Corresponding author: Xiaodan Lin.) tion rate for noisy audios could be significantly improved.
Xiaodan Lin and Donghua Chen are with the School of Information Sci-
ence and Engineering, Huaqiao University, Xiamen 361011, China (e-mail:
xd_lin@hqu.edu.cn; dhchen@hqu.edu.cn). II. PROPOSED ATTENTION AUGMENTED NETWORK
Jianqing Zhu is with the School of Engineering, Huaqiao University,
Quanzhou 362021, China (e-mail: jqzhu@hqu.edu.cn). To capture the subband relevant information for the task
Digital Object Identifier 10.1109/LSP.2020.2985594 of cell-phone recognition, we propose a multi-stream network

1070-9908 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 01:37:23 UTC from IEEE Xplore. Restrictions apply.
606 IEEE SIGNAL PROCESSING LETTERS, VOL. 27, 2020

Fig. 1. Spectrograms of audios recorded by different models of cell-phones


(Top: Huawei mate9; middle: iPhone X; bottom: iPhone 6S).

architecture with channel attention. The audio spectrogram is


spliced into eight segments along the frequency axis, represent- Fig. 2. The proposed network architecture.
ing the energy variation across different subbands, to form the
input to each stream. Besides, an attention module is applied TABLE I
in each stream to capture the most relevant information that is PARAMETERS OF THE PROPOSED CNN MODEL
device-related by adjusting the attention channel.

A. Audio Spectrogram
Audio acquisition by a cell-phone normally follows the
pipeline: converting the mechanical vibration of the sound into
the electrical signal, A/D processing and encoding. The major
difference among various cell-phone models lies in the first two
stages where various forms of circuitry for transducer, AD con-
verter and pre-amplifier are employed. Therefore, the artifacts
introduced in these stages are regarded as intrinsic fingerprints
for cell-phones and coupled as part of the audio.
As audio spectrogram depicts the energy variation of the audio
signal across both the frequency and time span, it should be
model that incorporates the attention mechanism is expected to
able to visualize some of the distinctions caused by different
capture the device fingerprints more effectively. Therefore, we
mobile phones. Fig. 1 presents the audio spectrograms of an
propose a multi-stream CNN design with channel attention in
audio clip recorded by different cell-phones. It is evident that
each stream as shown in Fig. 2. We use three basic blocks that
the artifact is reflected by different regions of the spectrograms.
contain two convolutional layers and a pooling layer, whose
For example, more intense energy distribution around 0–1 kHz is
parameters are listed in Table I. Rectified linear unit is used for
observed from the spectrogram produced by Huawei, while the
the non-linearity of each convolutional layer. A sgdm optimizer
energy distribution difference could be observed around 0–1 kHz
with 80 mini-batch size is used.
and 7–8 kHz for iPhone X and iPhone 6s. This implies that
The attention module is essentially a channel-attention layer
the footprint left by different models of cell-phones could be
inspired by [21] that weights the significance among the feature
traced out through energy distribution across different frequency
maps. Given the features from the previous layer Ak , a set of
subbands, thus motivates us to learn features separately from
weights are imposed on the given features to yield the output of
different parts of the spectrogram.
the attention layer as shown in (4). The weights are computed
according to (1)–(3), where f(·) denotes an average pooling over
B. CNN Model With Subband Attention the given feature map as shown in (3), assuming that the feature
The attention mechanism is introduced to focus on the most map Ak has a size of H × W . σ indicates non-linear operations
relevant features among the tons of features in a deep learning and enables to exploit dependencies among the feature maps.
framework and has proved to be successful in computer vision In practice, σ is composed by a dimension reduction layer with
and natural language processing tasks [18]–[20]. For cell-phone reduction ratio set to 8, a non-linear ReLU unit, and then a dimen-
recognition task, as the frequency band weighs different, a CNN sion increasing layer, where dimension reduction and increasing

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 01:37:23 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: SUBBAND AWARE CNN FOR CELL-PHONE RECOGNITION 607

TABLE II
RECORDING DEVICES USED

Fig. 3. Confusion matrix for the recognition of 5 brands of cell-phones.


are implemented by 1 × 1 convolution layers. Softmax mapping
in (2) guarantees that the weights are summed up to 1, with c TABLE III
denoting the number of channels. Finally, outputs from each of RECOGNITION ACCURACY OF CELL-PHONES FROM THE SAME MANUFACTURER
the streams are concatenated and flowed to the fully connected
layer to form the combined representation. A dropout and a
fully connected layer follow before yielding the final output to
the softmax classifier.
vk = σ(f (Ak )) (1)
exp(vk )
wk =  c (2)
i=1 exp(vi ) to compose its test set. The confusion matrix is displayed in
H 
 W Fig. 3. It is observed that the recognition accuracies for all the
1
f (Ak ) = Ak (i, j) (3) five brands of cell-phones all exceed 95%, demonstrating the
H ×W i=1 j=1 effectiveness of the proposed scheme.
A∗k = wk · Ak (4)
C. Classification on Cell-Phones of the Same Brand
III. EXPERIMENTS In this subsection, we transfer the network architecture in
Fig. 2 to the task of recognizing different models of cell-phones
A. Experimental Settings by the same manufacturer. For this experiment, the network
The audio files are captured with the cell-phones whose architecture remains the same except that the number of neurons
models are listed in Table II. For each cell-phone model, 60 on the softmax layer is set to 5, 5, 3, 4, 3 for intra classification of
minutes of audio recordings are collected. Each audio file is Huawei, iPhone, Samsung, Mi and Oppo, based on the available
sampled at 16 kHz mono and cut into clips of 2 seconds. In the models of each brand. A balanced training set is employed to
preprocessing, an amplitude spectrogram with a size of 513 × 71 classify each model, i.e., 1200 audio clips per model are selected
is extracted from each audio file, and the input to each stream to compose the training set while 600 audio clips per model
of the CNN is composed by splitting the spectrogram into eight are used for testing. To avoid bias towards random splitting,
different subbands, each with a size of 64 × 71. Note that the DC three-fold cross validation is applied and the average recognition
component is discarded as it represents the average energy level rate for each cell-phone brand is reported in Table III. It is
of the audio signal. The window size employed for audio framing seen from Table III that the accuracy for recognizing different
is 30 ms with an overlap of 2 ms. In all the experiments, the models of the same brand remains satisfactory. However, we
speakers and audio contents in the training set and the test set are also find that mis-classification is likely to occur between similar
non-overlapped. For data augmentation, audio samples polluted models such as iPhone 6 and iPhone 6S. Further, performance
by 20 dB white Gaussian noise are generated to complement the comparison with two existing systems is conducted using the
training set so that the number of audio clips in the training set same training and test set. The average recognition rates over
is doubled. three folds are given in Table III. As the result shows, our
proposed method consistently outperforms the other two for
B. Classification on Cell-Phones of Different Brands each cell-phone brand.
In this test, the number of audio clips that goes for training
D. Evaluations Under Noisy Conditions
and testing is 2400 and 1200 per class respectively. Due to the
unbalanced number of audio clips for each brand, we randomly In real-life applications, audio recordings are usually con-
choose equal number of audio clips from the same model. taminated by noise of varying levels. In order to evaluate the
For example, 480 audio clips are selected from each iPhone robustness of the proposed method under noisy conditions, white
model to compose its training set and 240 audio clips per model gaussian noise with an SNR of 15 dB and 25 dB is added to the

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 01:37:23 UTC from IEEE Xplore. Restrictions apply.
608 IEEE SIGNAL PROCESSING LETTERS, VOL. 27, 2020

TABLE IV
RECOGNITION ACCURACY OF CELL-PHONES FROM THE SAME MANUFACTURER
UNDER NOISY CONDITIONS

Fig. 4. Recognition rates of 5 cell-phone brands at different noise levels.

audio recordings. For data augmentation in the training phase,


20 dB white gaussian noise is imposed on the audio clips. In
addition, we also compare the performance of a variant network
by abandoning the attention layer. Therefore, three scenarios are
considered for recognizing cell-phones of five brands.
1) The network in Fig. 2 and data augmentation is applied.
2) A variant network of Fig. 2 by abandoning the attention
layer but data augmentation is applied. Fig. 5. Visual comparisons on feature activation in each stream (Top: iPhone
3) Apply the network in Fig. 2 but without data augmentation. X, middle: Huawei mate9, bottom: iPhone 6S).
We repeat the experiments for inter-brand recognition three
times by randomly selecting different samples for training and manufacturer. Nevertheless, the results shown in Table IV again
testing, where the number of training and test instances remains validate the effectiveness of data augmentation by noisy samples
the same as Subsection B. The average recognition rates of and channel attention to enhance the feature representation by
5 cell-phone brands for the above scenarios are displayed in weighting on channels of the extracted features.
Fig. 4. As the figure shows, cell-phone recognition under noisy To examine the features learned by different frequency sub-
conditions can be far more difficult, with a significant drop bands, Fig. 5 visualizes the attended regions in each stream
of recognition rate even when the added noise is mild, e.g., when audio clips captured by Huawei and iPhone are provided
SNR of 25 dB. This suggest that cell-phone recognition could as inputs. As the figure shows, for the same audio recorded
be negatively affected when disturbance is appended to the by different cell-phone brands, the attended regions differ. For
recorded audio samples. However, the attention mechanism and example, the activations at the 3rd , 4th , 6th stream indicate the
data augmentation both contribute to the improvement of perfor- most discriminative frequency bands to distinguish the two
mance under noisy conditions. Specifically, data augmentation cell-phone brands. Besides, we also find that for intra-brand
could boost the recognition rate dramatically, indicating that recognition task, e.g., iPhone X and iPhone 6S, the lowest and
the proposed network could benefit a lot from training with highest subbands tell the difference. The results suggest that for
noisy samples. In addition to this, the incorporated attention the identification of different cell-phone models, the proposed
model also provides robustness to the given task, e.g., raising model can be adjusted to focus on different frequency bands.
the accuracy of about 5% for the two noisy conditions, as the
channel attention mechanism helps to filter out the most relevant
IV. CONCLUSION
feature representation while ignoring the interference such as
the background noise and speaker characteristics. Moreover, it In this paper, we propose a generic framework for cell-phone
should be noted that for clean audio signals, data augmentation recognition from the built-in microphones. The proposed ap-
degrades the performance slightly. A reasonable explanation proach fully exploits the significance of different subbands
is that the noisy training samples could interfere the feature for the identification of cell-phones by learning feature rep-
extraction in the shallow layers as the noise has some overlap in resentations from multiple streams. Introducing the attention
the spectrum with the device fingerprint. model to filter out more relevant features and incorporating
In the following test, only scenario 1 is considered, i.e., both noisy audio samples for data augmentation significantly enhance
strategies are involved for intra-brand classification under noisy the robustness of cell-phone recognition under noisy situations.
conditions. Audio signals are corrupted by gaussian noise of The proposed scheme works favorably not only for recognizing
two SNRs, 15 dB and 25 dB. Table IV presents the average cell-phones of different brands, but also for different models of
recognition rate for cell-phones of the same brand over three the same brand. Finally, an insight shed by the visualization of
folds. Comparing Table III and Table IV, the following conclu- the attended regions is that low-to-middle frequency subbands
sions can be drawn. Firstly, intra-brand cell-phone recognition play an important role in identifying cell-phones of different
is hampered by noise, similar to the distinction of different cell- brands, while the low and high frequency subbands are of greater
phone brands. Secondly, recognizing cell-phones of the same significance to the recognition of different models of cell-phones
make is more challenging than recognizing phones of different from the same manufacturer.

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 01:37:23 UTC from IEEE Xplore. Restrictions apply.
LIN et al.: SUBBAND AWARE CNN FOR CELL-PHONE RECOGNITION 609

REFERENCES [11] L. Zhen et al., “Underdetermined mixing matrix estimation by exploiting


sparsity of sources,” Measurement, vol. 152, no. 2, pp. 1–6, 2020.
[1] J. Lukáš, J. Fridrich, and M. Goljan, “Digital camera identification from [12] D. Garcia and C. Espy, “Automatic acquisition device identification from
sensor pattern noise,” IEEE Trans. Inf. Forensics Secur., vol. 1, no. 2, speech recordings,” J. Acoust. Soc. America, vol. 125, no. 4, 2009,
pp. 205–214, Jun. 2006. Art. no. 2530.
[2] C.-T. Li, “Source camera identification using enhanced sensor pattern [13] C. Kotropoulos and S. Samaras, “Mobile phone identification using
noise,” IEEE Trans. Inf. Forensics Secur., vol. 5, no. 2, pp. 280–287, recorded speech signals,” in Proc. IEEE Int. Conf. Digit. Signal Process.,
Jun. 2010. 2014, pp. 586–591.
[3] B. Luca et al., “Improving PRNU compression through preprocessing, [14] L. Zou, Q. He, and X. Feng, “Cell phone verification from speech record-
quantization and coding,” IEEE Trans. Inf. Forensics Secur., vol. 14, no. 3, ings using sparse representation,” in Proc. IEEE Int. Conf. Acoust. Speech
pp. 608–620, 2019. Signal Process., 2015, pp. 1787–1791.
[4] C. Kraetzer, A. Oermann, J. Dittmann, and A. Lang, “Digital audio [15] Y. Li, X. Zhang, X. Li, and J. Yang, “Mobile phone clustering from speech
forensics: A first practical evaluation on microphone and environment recordings using deep representation and spectral clustering,” IEEE Trans.
classification,” in Proc. Workshop Multimedia Secur., 2007, pp. 63–74. Inf. Forensics Secur., vol. 13, no. 4, pp. 965–977, 2018.
[5] R. Buchholz, C. Kraetzer, and J. Dittmann, “Microphone classification [16] G. Baldini, I. Amerini, and C. Gentile, “Microphone identification using
using Fourier coefficients,” in Proc. Int. Workshop Inf. Hiding, 2009, convolutional neural networks,” IEEE Sensors Lett., vol. 3, no. 7, Jul. 2019,
pp. 235–246. Art. no. 6001504.
[6] C. Hanilçi, F. Erta¸s, T. Erta¸s, and Ö. Eskidere, “Recognition of brand [17] V. Vinay, K. Preet, and K. Nitin, “CNN-based system for speaker inde-
and models of cell-phones from recorded speech signals,” IEEE Trans. pendent cell-phone identification from recorded audio,” in Proc. CVPR
Inf. Forensics Secur., vol. 7, no. 2, pp. 625–634, 2012. Workshop, 2019, pp. 53–61.
[7] C. Hanilci and T. Kinnunen, “Source cell-phone recognition from recorded [18] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale:
speech using non-speech segments,” Digit. Signal Process., vol. 35, no. 12, Scale-aware semantic image segmentation,” in Proc. IEEE Conf. Comput.
pp. 75–85, 2014. Vision Pattern Recognit., 2016, pp. 3640–3649.
[8] L. Cuccovillo et al., “Audio tampering detection via microphone classi- [19] A. Vaswani et al., “Attention is all you need,” in Proc. Neural Inf. Process.
fication,” in Proc. IEEE 15th Int. Workshop Multimedia Signal Process, Syst., 2017, pp. 5998–6008.
2013, pp. 177–182. [20] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals “Listen, attend and spell: A
[9] D. Luo, P. Korus, and J. Huang, “Band energy difference for source neural network for large vocabulary conversational speech recognition,”
attribution in audio forensics,” IEEE Trans. Inf. Forensics Secur., vol. 13, in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Mar. 2016,
no. 9, pp. 2179–2189, 2018. pp. 20–25.
[10] P. Bofill and M. Zibulevsky, “Underdetermined blind source separation [21] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
using sparse representations,” Signal Process., vol. 81, no. 11, pp. 2353– IEEE Comput. Vision Pattern Recognit., 2018, pp. 7132–7141.
2362, 2001.

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 01:37:23 UTC from IEEE Xplore. Restrictions apply.

You might also like