You are on page 1of 12

Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

Recurrent Convolutional Neural Networks for AMR Steganalysis


Based on Pulse Position
Chen Gong Xiaowei Yi*
State Key Laboratory of Information Security, State Key Laboratory of Information Security,
Institute of Information Engineering, Chinese Institute of Information Engineering, Chinese
Academy of Sciences, Beijing, China 100093 Academy of Sciences, Beijing, China 100093
School of Cyber Security, University of Chinese School of Cyber Security, University of Chinese
Academy of Sciences, Beijing, China 100093 Academy of Sciences, Beijing, China 100093
gongchen@iie.ac.cn yixiaowei@iie.ac.cn

Xianfeng Zhao Yi Ma
State Key Laboratory of Information Security, Beijing Information Technology Institute,
Institute of Information Engineering, Chinese Beijing, China 100094
Academy of Sciences, Beijing, China 100093 mayi 5501@126.com
School of Cyber Security, University of Chinese
Academy of Sciences, Beijing, China 100093
zhaoxianfeng@iie.ac.cn
ABSTRACT CCS CONCEPTS
With the rapid development of stream multimedia, the adap- • Computing methodologies → Learning latent repre-
tive multi-rate (AMR) audio steganography are emerging sentations; • Security and privacy → Authentication.
recently. However, the traditional steganalysis methods face
great challenges in detecting short time speech at low embed- KEYWORDS
ding rates. To address this problem, we propose a steganalytic steganalysis, adaptive multi-rate, fixed c odebook, p ulse posi-
scheme by combining Recurrent Neural Network (RNN) and tion, recurrent neural network, convolutional neural network
Convolutional Neural Network (CNN), SRCNet. AMR fixed
codebook (FCB) steganography embed messages by mod- ACM Reference Format:
ifying the pulse positions, which would destroy the FCB Chen Gong, Xiaowei Yi, Xianfeng Zhao, and Yi Ma. 2019. Re-
current Convolutional Neural Networks for AMR Steganalysis
correlation. Firstly we analyzed the FCB correlations at dif-
Based on Pulse Position. In ACM Information Hiding and Mul-
ferent distances, and summarized these correlations into four timedia Security Workshop (IH&MMSec ’19), July 3–5, 2019,
categories. Furthermore, we utilizes RNN to extract higher Paris, France. ACM, New York, NY, USA, 12 pages. https: //
level contextual representations of FCBs and CNN to fuse doi.org/10.1145/3335203.3335708
spatial-temporal features for the steganalysis. The proposed
approach was evaluated on a public data-set. The experiment
results validate that the proposed framework greatly out-
1 INTRODUCTION
performs the existing state-of-the-art methods. The correct Steganography is the technique of hiding secret messages into
detection rate of SRCNet has been improved above at least innocent-looking multimedia, such as, image, audio and video.
10% when the sample is as short as 100ms at the 20% embed- Steganalysis, as the counter-technology of steganography,
ding rate. In particular, the network achieves the significant aims to expose the presence of secret messages hidden in the
improvements for detecting the STCs based adaptive AMR multimedia.
steganography. Recently, emerging audio steganographic algorithms [8, 20,
21, 28] in various audio domains have been proposed. Specif-
*
Corresponding author ically, the low-bit-rate compressed speech steganography can
be classified i nto t wo c ategories, c ompression independent
Permission to make digital or hard copies of all or part of this work methods and compression-dependent methods. The former is
for personal or classroom use is granted without fee provided that to directly modify the compressed speech stream after speech
copies are not made or distributed for profit o r c ommercial advantage
and that copies bear this notice and the full citation on the first
encoding [12, 14, 26]. The latter [7, 9, 16, 18, 20, 21, 27] inte-
page. Copyrights for components of this work owned by others than grating information hiding into the process of speech encoding.
ACM must be honored. Abstracting with credit is permitted. To copy By contrast, the latter attract more interest for highly qual-
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific p ermission a nd/or a f ee. R equest permissions
ity and strong security. Among so many compressed speech
from permissions@acm.org. coding, AMR become a novel carrier for steganography for
IH&MMSec ’19, July 3–5, 2019, Paris, France its widespread usage in mobile communication. There are
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6821-6/19/06. . . $15.00
three feasible embedding domains in AMR codec, includ-
https://doi.org/10.1145/3335203.3335708 ing linear predictive coding (LPC) [15, 27], fixed codebook

2
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

(FCB) [7, 16, 21] and pitch delay (PD) [9, 18, 20]. What’s
more, FCB attract more attention accounting for the large
embedding capacity and good imperceptibility. At 12.2 kb/s
mode, AMR-NB encoder processes on each speech frames of
20 ms and produces 244 bits code stream. FCB takes up 140
bits which is a significant proportion (140/244 = 57.38%) of
the total frame bits.
FCB is implemented by interleaved single-pulse permuta-
tion (ISPP) and adopts the non-exhaustive depth-first tree
search, that is, there are still enough redundancy in the FCB.
So FCB is a most suitable match for information hiding
even if we change the current FCB into another alterna-
tive FCB without degrading synthetic speech quality. The
existing FCB based steganography [7, 16, 21] come from
this idea. Geiser et al. [7] proposed a method for hiding
digital data with a comparatively high rate in the AMR-
NB speech codec. The hidden bits can be directly extracted
from the bitstream. Miao et al. [16] further introduced an
AMR-WB steganography which is realized by searching a sub-
optimal code-vector whose pulse combination meets adaptive
suboptimal-pulse-combination constrained (ASOPCC). Ren
et al. [21] proposed an AMR FCB adaptive steganographic
algorithm (AFA) based on minimizing embedding distortion.
A content-aware distortion function is designed to effectively
resist the detecting of the existing steganalysis.
To detect FCB based embedding schemes, some steganaly-
sis studies which use a well-designed hand crafted steganalysis
feature combined with Support Vector Machine (SVM) have Figure 1: A typical scenario of AMR stream steganal-
been proposed in the recent years. Existing FCB steganogra- ysis and steganography.
phy [7, 16, 21] destroy the distribution statistics on pulses
positions. Ding et al. [4, 5] first presented histogram based
steganalysis features, including histogram flatness, the center Our goal is to design a data-driven method, which aimed to
of mass of histogram characteristic function and histogram make the steganalytic method more generally applicable in-
variance. Miao et al. [17] introduced Markov transition proba- dependently of the specific embedding domain. Such as Pitch
bilities and entropy to evaluate the interrelationships between delay (PD) [8, 20, 22], since PD-based steganography and
adjacent pulses. Specially, entropy based feature consist of FCB-based steganography share the same idea by changing
the joint entropy and conditional entropy. Ren et al. [19] speech coding parameters. So the pre-processing and the prior
designed a set of features based on the probability of the feature selection are not adopted in our proposed method.
pulse position being the same in the same track. Tian et For example, the effectiveness of the high-pass filters has been
al. [24] expanded previous research further by employing the already analyzed and proven in [3, 25]. However, from the
probability distributions of paired pulses, the Markov transi- perspective of deep learning, all parameters of the network
tion probabilities of paired pulses and the joint probability should be automatically learned rather than artificially be
matrices of pulse pairs, and a feature dimension is introduced constructed. Based on this spot, the origin decoded pulses
by AdaBoost prior to the training. positions are adopted as the input data of the network.
However, existing steganalysis method are limited in real Additionally, a detailed analysis of FCB correlation at
scenarios. There remain challenges in the detection in live various frame level was conducted by summarizing the cor-
streaming audio and the strong security steganography, espe- relations into four categories. The experimental results are
cially for short time audio at low embedding rates. The high directive significance to the design of network. We evaluated
user-concurrency and fast-throughput nature of the stream- different correlation by co-occurrence matrix which could
ing media request the streaming media steganalysis should measure FCB correlation at different distances. Existing
be responded instantly in a highly accuracy. Moreover, the FCB steganalysis only consider the distributions of pulse
steganographers remain inactive for a long time and embed pairs in intra sub-frame or inter sub-frame [4, 5, 17, 19]. Ob-
message in a short time. A typical scenario of AMR stream viously, current methods are thereby not complete enough to
steganalysis and steganography can be represented by Fig- characterize FCB.
ure 1. To address the limitation of current FCB steganalysis, We first propose a unified RNN-CNN for FCB steganalysis
we propose a data-driven recurrent convolutional steganalysis framework which could make the model able to automati-
neural network (SRCNet). cally hold the features not only in time dimension but also

3
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

in the space dimension inspired by the previous research Table 1: Searching positions in each track at mode
achievements [3, 13, 25]. RNN and CNN play the roles of 12.2 kb/s. 𝑖𝑡 and 𝑖𝑡+5 represent the two pulses in the
temporal analyst and feature extractor respectively. RNN is same track 𝑡 (0 ≤ 𝑡 ≤ 4) respectively.
exploited to model the sequential inputs of pulses positions.
CNN is used to construct the representation of FCB, also Track Pulse Position
accelerating the convergence and decreasing the danger of
0 𝑖0 , 𝑖5 0, 5, 10, 15, 20, 25, 30, 35
over fitting. Specifically, the decoded pulses positions are
1 𝑖1 , 𝑖6 1, 6, 11, 16, 21, 26, 31, 36
feeded into an RNN. The RNN is used to extract temporal
2 𝑖2 , 𝑖7 2, 7, 12, 17, 22, 27, 32, 37
context features of FCB. The outputs of RNN are used as
3 𝑖3 , 𝑖8 3, 8, 13, 18, 23, 28, 33, 38
the input data of a CNN. The goal of the CNN is to extract
4 𝑖4 , 𝑖9 4, 9, 14, 19, 24, 29, 34, 39
the key FCB features from low level features to the greatest
extent possible and complete speech classification.
The rest of the paper is organized as follows. In Section
II, related background is briefly introduced. Section III intro- The FCB is constructed by placing 2 non-zero-amplitude
duces the proposed end-to-end approach to FCB steganalysis pulses in each track, each pulse can be placed on eight dif-
in this paper. Several experimental evaluation and analysis ferent positions, so totally 10 non-zero pulse positions are
are discussed in Section IV. Finally, Section V concludes with encoded per sub-frame. Generally speaking, FCB search is
a summary and directions for future work. to find 10 pulses’ optimal position in 40 candidate position.
The available searching positions in each track at AMR-NB
2 BACKGROUND 12.2kbit/s mode is given in Table 1.
The FCB search is to find the optimal combination of
determine 10 pulse by the depth-first tree search method.
Each pulse has 8 potential position. If each position of the
all pulses is searched, (8 × 8)4 = 16777216 times would be
required to get the optimal one. In practice, sub-optimal
algorithms are often adopted in consideration of the real-
time requirement of voice communications,. The FCB search
algorithm of AMR adopts the depth-first tree search instead
of the nested-loop focused search. The number of loops can
be reduce to 4 × (4 × 8 × 8) = 1024 times. Different modes
is different in the number of non-zero pulses and the corre-
sponding search levels. We take the 12.2kbit/s mode as an
Figure 2: AMR optimal codebook search diagram. example to illustrate the search process combined with the
figure 3. Briefly, there are two-layer loops. The outer loop
is for pulse 𝑖1 iterates over all of local maximum in each
2.1 AMR codec track, and the inner loop is for 4 groups of paired pulses.
AMR-WB and AMR-NB [1] work on the same principle Given a fixed 𝑖1 for outer loop, the body of the outer loop
of Algebraic Code Excited Linear Prediction (ACELP) [23] is shown in Figure 3. There are six levels in the search tree.
aiming to modulate the process of human voice. The encoding In addition, 𝑖0 and 𝑖1 do not need search. Specially, at level
diagram is shown as Figure 2. The encoder process on speech 1, the position of pulse 𝑖0 is determined based on the global
frames of 20ms as the basic unit. For each frame, the frame maximum of the reference signals in all tracks. Then the
is segmented into four sub-frames of 5ms each. In each sub- four outermost iterations are carried out to determine pulse
frame, a linear prediction synthesis filter is used to synthesize 𝑖1 position at level 2. 𝑖1 is tentatively assigned to the local
the output signal by filtering the result of the sum of the maximum within each of the other four tracks except the
appropriate adaptive and the fixed codebook. The weighted track occupied by the first pulse. Next, the paired pulses are
error which between the synthesized speech and the original searched in the last four levels : (𝑖2 , 𝑖3 ), (𝑖4 , 𝑖5 ), (𝑖6 , 𝑖7 ), (𝑖8 , 𝑖9 ).
speech is minimized by using a so-called analysis-by-synthesis At level 3, pulse 𝑖2 and 𝑖3 are searched in their respective
approach is used to determine the optimal excitation signal. tracks. During the 8 × 8 nested-loop search, pulse 𝑖2 with all
Finally, the encoded parameters are obtained and transmitted 8 admissible positions is tested together with pulse 𝑖3 with its
through a public channel. corresponding all 8 admissible positions. The position of 𝑖2
FCB search is the key part of the AMR encoding, only a and 𝑖3 are determined once the current target signal reaches
small subset of the optimal non-zero pulse positions are cho- its maximum value. The number of tested combinations is
sen and encoded. FCB structure is based on interleaved single 8 × 8 = 64. In the subsequent three levels, the search is the
pulse permutation (ISPP) design. Different coding mode has same as level 3.
different codebook distribution. With 8 kHz sampling rate, Though the depth-first tree search reduces the computation
per 5ms sub-frame, the 40 positions are divided into five complexity, the result of fixed codebook vector is just local
tracks of interleaved positions, with 8 positions in each track. optimal because the search range is restricted. But great

4
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

Figure 3: The overview of AMR FCB search process for a fixed pulse 𝑖1 at 12.2 kbit/s mode. Assuming that,
𝑖1 , 𝑖2 , 𝑖3 , 𝑖4 , 𝑖5 , 𝑖6 , 𝑖7 , 𝑖8 , 𝑖8 , 𝑖9 are in track 1, track 2, track 3, track 4, track 0, track 1, track 2, track 3, track 4
respectively. Paired pulses 𝑖𝑡 , 𝑖𝑡+1 are searched together in a 8 × 8 loop. The color of each circle corresponds
to one pulse’s a searchable position.

for steganography. There are still redundancies in the fixed Table 2: FCB bits allocation in the first sub-frame
codebook search space, that is, minor modifications of FCB
would cause little significant loss to speech quality. Bits allocation Content description
S52 sign information for 𝑖0 and 𝑖5 pulses
2.2 FCB Steganography S53 - S55 position of 𝑖0 pulse
S56 sign information for 𝑖1 and 𝑖6 pulses
This section first review the FCB structure for better under-
S57 - S59 position of 𝑖1 pulse
stand how the FCB based steganography designed. This part
S60 sign information for 𝑖2 and 𝑖7 pulses
will take AMR 12.2 kb/s mode as an example.
S61 - S63 position of 𝑖2 pulse
The detailed allocation of the bits of FCB is shown in
S64 sign information for 𝑖3 and 𝑖8 pulses
Table 2. In each sub-frame, 40 potential positions are divided
S65 - S67 position of 𝑖3 pulse
into 5 tracks, each pulse has 8 candidate positions. Once
S68 sign information for 𝑖4 and 𝑖9 pulses
the pulses positions are determined, all the pulse positions
S69 - S71 position of 𝑖4 pulse
use Gray code to enhance the robustness. Each two pulse
S72 - S74 position of 𝑖5 pulse
positions in one track are encoded with 6 bits (total of 30
S75 - S77 position of 𝑖6 pulse
bits, 3 bits for the position of every pulse). For two pulses in
S78 - S80 position of 𝑖7 pulse
the same track, another one sign bit is needed. This sign bit
S81 - S83 position of 𝑖8 pulse
indicates the sign of the first pulse, the sign of the second
S84 - S86 position of 𝑖9 pulse
pulse depends on the first pulse. If the position of the second
S87 - S91 fixed codebook gain
pulse is smaller than the first pulse, it has an opposite sign,
otherwise it has the same sign, and vice versa. So FCB take
up 35 bits. Typically, FCB-based steganography takes the
two pulses in the same track as the basic processing units. The existing methods for AMR audio steganography [7,
The FCBs are searched by the depth-first tree, that is, 16, 21]share the same idea, they embed secret messages in
only a small subset of available positions are examined, the the last pulse position of each track by restricting the search
final FCBs may not be optimal. Based on the redundancies space for pulse position.
of FCB, it is possible to embed messages by modifying other Geiser [7] first proposed a steganographic FCB strategy
pulse position with no speech quality. for the AMR-NB 12.2kbit/s mode. For track 𝑡(0 ≤ 𝑡 ≤ 4),

5
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

assume 𝑖𝑡 is the first pulse position and 𝑖𝑡+5 is the second 𝑝𝑓,𝑘 = 1, otherwise 𝑝𝑓,𝑘 = 0. The pulse correlation in the
pulse position in this track. (𝑚)2𝑘,2𝑘+1 is denoted as binary same track of each sub-frame is calculated as Equation 6
2 bit message to be embedded the track 𝑡. The two possible
positions for 𝑖𝑡+5 depend on 𝑖𝑡 and the message (𝑚)2𝑘,2𝑘+1 .
The calculation formula of two possible positions for 𝑖𝑡+5 is 𝑀 𝜇,𝜈 (𝑡) = 𝑃 (𝑖𝑡 = 𝜇, 𝑖𝑡+5 = 𝜈 ‖ 𝑖𝑡 = 𝜈, 𝑖𝑡+5 = 𝜇) (5)
shown as Equation 1:
{︃ 𝑖
𝑔𝑟𝑎𝑦 −1 (𝑔𝑟𝑎𝑦(⌊ 𝑡+5
5
⌋) ⊕ (𝑚)0,1 ) · 5 + 𝑡 ∑︀𝑁𝑓
𝑀 𝜇,𝜈 (𝑡)
𝑖𝑡+5 = −1 𝑖 (1) 𝑓 =1
𝑔𝑟𝑎𝑦 (𝑔𝑟𝑎𝑦(⌊ 𝑡+5 ⌋) ⊕ (𝑚)0,1 + 4) · 5 + 𝑡 𝑀𝑡 (𝜇, 𝜈) = (6)
5 𝑁𝑓
−1
where 𝑔𝑟𝑎𝑦 and 𝑔𝑟𝑎𝑦 are respectively the gray encoding where 𝑓 is the index of a sub-frame within in a frame, 𝜇, 𝜈
and the gray decoding by table lookups, ⊕ is the bitwise is the index of pulse label. If the first pulse position 𝑖𝑡 is 𝜇
exclusive operation of two binary strings, ⌊𝑥⌋ = 𝑚𝑎𝑥{𝑛 ∈ and the second pulse position 𝑖𝑡+5 is 𝜈 in the same track,
Z | 𝑛 ≤ 𝑥} let 𝑥 rounded down, Additionally, the 3 bit pulse 𝑃 (𝑖𝑡 = 𝜇, 𝑖𝑡+5 = 𝜈 ‖ 𝑖𝑡 = 𝜈, 𝑖𝑡+5 ) = 1, otherwise 𝑃 (𝑖𝑡 =
position index in the track 𝑡 can be obtained by ⌊ 𝑖5𝑡 ⌋ . 𝜇, 𝑖𝑡+5 = 𝜈 ‖ 𝑖𝑡 = 𝜈, 𝑖𝑡+5 ) = 0.
After obtaining the pulse positions per sub-frame, the ex-
traction of the 2 bit hidden message (𝑚)2𝑘,2𝑘+1 (𝑘 ∈ [0, 4]) [7]
is performed by Equation 2.
𝑖𝑡 𝑖𝑡+5
(𝑚)2𝑘,2𝑘+1 = (𝑔𝑟𝑎𝑦(⌊ ⌋ ⊕ 𝑔𝑟𝑎𝑦(⌊ ⌋)%4 (2)
5 5
Miao et al. [16] further introduced an adaptive suboptimal
pulse combination constrained (ASOPP) method in the AMR-
WB speech codec. Similar in principle to [7], incorporating
message embedding during the FCB search by controlling
the pulse positions in same track. Different from [7]. Miao
introduces an embedding factor 𝜂 to achieve a better trade-
off between speech quality and embedding capacity. For the
second pulse positions 𝑖𝑡+5 in the track 𝑡, its search space
should satisfy the Equation 3.
𝑃𝑡
∑︁ 𝑃𝑡𝑖
𝑚𝑡 = ( 𝑔𝑟𝑎𝑦(⌊ ⌋) ⊕ 𝜂 (3)
𝑖=0
𝑁
where 𝑔𝑟𝑎𝑦 is the encoding operation, 𝑁 is the total number
of tracks. 𝜂 is the dynamic factor that controls embedded
bits. 𝑡 is the track index. 𝑃𝑡𝑖 is 𝑖-th pulse position in track
𝑡. 𝑃𝑡 is the number of pulses in the track 𝑡. 𝑚𝑡 denotes the
secret data to be embedded. The embedding principle is to
control the second pulse position in track 𝑡. From Equation 3,
⌊𝑙𝑜𝑔2 ⌋ + 1 bit data can be embedded per track. According
to [16], 𝜂 is usually set 1, 2, 3 and 4 at AMR 12.2 kb/s mode.
The extraction of message can be obtained by Equation 3 as
well.
Ren et al. [21] first proposed an adaptive steganography
scheme (AFA) implemented by [6]. The key contribution of
the scheme is to design of the cost function and the additive
distortion function in the FCB embedding domain. The op-
timal probability of pulse and the pulse correlation in the
same track were considered into the cost function to resist
the existing steganalysis detecting. The optimal probability
of the pulse is calculated as Equation 4.
∑︀𝑁𝑓
𝑓 =1 𝑝𝑓,𝑘
𝑃𝑓,𝑘 = (4)
𝑁𝑓
where 𝑁𝑓 is the total number of sub-frames, 𝑓 is the index of Figure 4: The flow chart of the proposed SRCNet
a sub-frame within in a frame, 𝑘 is the index of pulse label in neural network for FCB steganalysis.
a sub-frame 0 ≤ 𝑘 ≤ 9. If the current pulse 𝑖𝑘 is optimal with
the global maximum of the reference signals in all tracks,

6
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

3 PROPOSED METHOD Table 3: The detection accuracy of each variant net-


work (AFA, 12.2 kb/s mode, 100% embedding rate,
Existing FCB based steganography destroy the strong cor-
100ms Chinese samples).
relation among the pulses positions (FCB) by introducing
the information hiding. Consequently, the modified correla-
tion can be regarded as an indicator of steganography and ID Model setting Accuracy (%)
could be used for steganalysis. In our proposed method, the a The proposed network (BLSTM + 83.21
pulses positions coefficients matrix is decoded as the input GAP)
data of the network. A bidirectional long short-term mem- b LSTM 81.86
ory (BLSTM) are introduced for analyzing pulses positions c LSTM without return sequences 81.42
sequential patterns in temporal dimension. Then, the convo- d LSTM + LSTM 81.57
lutional neural network model is used to capture the most e LSTM + LSTM without return se- 82.07
important features in spatial dimension. The whole flow of quences
the proposed method is illustrated in Figure 4. In the follow- f BLSTM 82.21
ing paragraphs, we further introduce the details of each step g BLSTM without return sequences 82.50
of the network. h BLSTM + LSTM 82.03
i BLSTM + LSTM without return se- 82.38
3.1 Pulse Position Matrix (Step I) quences
j BLSTM + Global Max Pooling (GMP) 81.33

For an AMR encoded speech signal which contains 𝑇 sub-


frames, the pulse position matrix 𝐹 is denoted as
𝑓0,1 𝑓0,𝑗 𝑓0,𝑇
⎡ ⎤
⎢ .. .. ⎥

⎢ . . ⎥

𝐹 =⎢ 𝑓
⎢ 𝑖,1 𝑓 𝑖,𝑗 𝑓 𝑖,𝑇 ⎥
⎥ (7)

⎣ . .. . ..


Figure 5: An example of ten pulses positions in a 𝑓9,1 𝑓9,𝑗 𝑓9,𝑇
sub-frame. 𝑖𝑡 and 𝑖𝑡+5 are the first pulse position and where 𝑓𝑖,𝑗 stands for the 𝑖-th pulse position in the 𝑗-th sub-
the second pulse position in track 𝑡 respectively. For frame.
the pulse 𝑖𝑡 , its pulse position index is calculated by
⌊𝑖𝑡 /5⌋, the pulse track index is calculated by 𝑖𝑡 %5. 3.2 Bidirectional LSTM (Step II)
RNN can model sequence tasks well. In the ACELP model,
First we explain why the pulses positions are chosen as the an adaptive codebook and a fixed codebook play a very
input data. Figure 5 shows the examples of ten pulses in one important role in the simulation of an excitation signal. The
sub-frame at AMR-NB 12.2 kb/s. For better understanding, role of an adaptive filter is to remove long-term correlation
we list pulse position and their corresponding position index from a voice residual signal. After the long-term correlation
respectively. from the voice residual signal is removed, the voice residual
The state of the art steganalysis algorithms [17, 19, 24] signal becomes similar to a white noise. Therefore, fixed
shared the same principle: extracting correlation features codebook possesses timing characteristics. LSTM is a more
based on paired pulses positions and then feeding the fea- advanced version of RNN. The LSTM was proposed to tackle
tures to SVM classifiers. [17] presented the Markov transi- this problem gradient explosion or gradient vanish existing
tion probabilities method and joint entropy and conditional in the RNN by introducing a memory cell that is able to
entropy method (MTJCE), [19] is based on the statistical preserve state over long periods of time. LSTM converts the
probability of two pulses in the same position in the same input data into vectors which form a matrix. This matrix
track (SPP) [24] extended the former method by considering includes two dimensions: the time-step dimension and the
probability distributions of pulse pairs, Markov transition feature dimension, and it will be updated in the process
probabilities and track-to-track correlation (PMT). Actu- of learning feature representation. In bidirectional LSTM,
ally, [17, 19] only emphasize the position index of the paired the output is not only dependent on the previous frames in
pulse in the same track but neglect the useful information the sequence, but also on the upcoming frames. One LSTM
existed in the origin pulse position. The origin pulse position goes in the forward direction and another one goes in the
better reflects FCB correlations. As shown in Figure 9, FCB backward direction. The combined output is then computed
also has strong correlations at different distances level, while based on the hidden state of both LSTMs.
method [17, 19, 24] considered the FCB statistics feature in Actually, we designed multiple sets of comparative ex-
one frame or between successive two frames. periments to determine LSTM structure. Table 3 records

7
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

the results of ten different model settings. Specifically, the


training set contains 32,000 stego segments and 32,000 cover
segments. The testing set contains 2,000 stego segments and
2,000 cover segments. We run each test for 30 epochs. The
comparison provide some intuitive guidance to design model’s
structure.
The adaptive FCB steganography [21] AFA use STC [6]
to global search for optimal pulse position. BLSTM have the
advantage of modeling the sequence in global scope. For each
pulse, the network takes the information not only from the
pulse but also from the future pulse. As shown in Table 3,
the results closely coincided with our analysis.

3.3 Global Average Pooling (Step III)


The pulse position matrix 𝐹 will grow large over time. In- Figure 7: The structure of SRCNet network. Here,
creasing the time length will cause the training and testing the input audio is 100ms (20 sub-frames), the time
process of the model all to slow down. At the same time, more steps is 20 and the feature dimension is 10 (10
parameters have to be learned which will raise the possibility pulses). The number of hidden units in LSTM is 50,
of over fitting. Moreover, the precision and application of the and the batch size is 32.
trained model are limited if the model is dependent on the
length of input audios. In order to address this varying size,
we use a global average pooling (GAP) [11] layer to replace This data-set contains 41 hours of Chinese speech, 72 hours
the last fully connected layer. Figure 6 is the illustration of of English speech and different gender. The speech data-set
GAP. The GAP is more computationally efficient, capable of includes different male and female speakers. However, as ex-
taking input of any size and parameter-free. At the end of the plained in Section I, steganalysis algorithm should be able
net, a global average pooling is performed and then a sigmoid in order to test steganalysis performance on detecting short
classifier is attached. Figure 7 illustrates the architecture samples of different lengths. The origin speech samples have
of SRCNet with 100ms speech in a simplified sequence flow been cut into 0.1s, 0.2s, . . . , 10s segments. The same length
diagram. segments are successive and non-overlapped. All the segments
are converted from their origin WAV format to PCM format
with mono, 8 kHz, 16 bits quantization by FFmpeg 1 . Those
segments are used to generate the cover segment data-set and
stego segment data-set respectively. There are three meth-
ods involved in the steganographic experiments, Geiser [7],
Miao [16] at the modes of 𝜂 =1, 2 and 3 respectively, one
Figure 6: An illustration of global average pooling adaptive steganography AFA [21] implemented by the STCs
(GAP). The input shape is (batch, steps, features), (with constraint height h = 7) to. In the experiments, we use
the output shape is (batch, features) the RBR (Relative embedding rate) to denote the embedding
ratio, which represents the ratio of length of message m to
the length of cover audio n. 20%, 40%, . . . , 100% RBR are
4 EXPERIMENTS used to generate the corresponding stego.
The classification accuracy 𝑃𝐴 is used to measure the
To evaluate the detection performance of the proposed SRC-
detection performance, which is defined as the proportion
Net steganalysis method, several experiments are conducted,
of the tested AMR audio correctly classified for each of the
the results are compared with some state of the art FCB ste-
categories. The calculation of 𝑃𝐴 is shown in (8)
ganalysis method to verify the validity of the proposed model.
This section starts with an introduction of the data-set used 𝑃𝑇 𝑃 + 𝑃𝑇 𝑁
𝑃𝐴 = (8)
in this work, followed by the structure and the training details 2
of the model. Then, we evaluate the importance of four kinds where 𝑃𝑇 𝑃 is the probability of the true positives correctly dis-
of FCB correlations. Finally, we compare the performance of tinguished, that is the stego samples correctly distinguished,
the proposed model and other state of the art FCB steganal- 𝑃𝑇 𝑁 is the probability of the true negatives correctly dis-
ysis models under different conditions, including different tinguished, that is the cover samples correctly distinguished.
embedding rates, different durations and so on. 𝑃𝐴 shows the accuracy rate of the steganalysis scheme which
is used to detect the tested steganography schemes. Larger
4.1 Setup
The proposed SRCNet was primarily evaluated and con- 1
FFmpeg is a leading multimedia framework (available at:
trasted on the speech data-set published by Lin et al. [13]. https://www.ffmpeg.org/)

8
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

values of 𝑃𝐴 correspond to better steganalysis and thus more


effective detectability but lower security.
For each experiment on SRCNet, randomly chosen 32000
of the samples for training, the remaining 8000 of the samples
were used for testing, and we applied those parameters to
all later experiments. The number of cover samples and the
number of stego are in pair, that is the ratio of cover and
stego in all experiments is half and half. Besides that, each
experiment picked up the cover and stego samples respectively
according to the required language, embedding rate and
sample length.
In order to compare SRCNet with other methods, we also
conducted the comparison tasks on the two state of the art
FCB steganalysis methods: Fast-SPP [19] and MTJCE [17].
We used the soft-margin support vector machines (C-SVM) [2]
with Gaussian kernel and default parameters as the classi-
fiers. All samples of stego speech and cover speech may not
be directly used in the SVM. Because SVM has quadratic Figure 8: FCB Correlations at different sub-frames.
time complexity, highly dimensional features would not only
cause huge computational costs but also induce over fitting.
According to experimental settings in [19], for each test on sub-frame. There are 10 pulse distributed into 5 tracks
Fast-SPP and MTJCE, we randomly pick up 2000 samples in a subframe at 12.2 kb/s mode. For example, Miao
from stego data-set and 2000 samples from cover data-set. et al. [17] employed transition probabilities of 2 pulses
The performance is evaluated using the median value of 𝑃𝐴 in the same track to model the intra-frame based cor-
conducted over ten random 2000/2000 splits both on the relation.
training set and testing set. * Intra-frame level correlation. The phoneme is the basic
All code used to produce the results in this paper, includ- unit of human speech. One word is usually composed
ing the data-sets and other relevant code is available from of several discrete units. AMR-NB operates each frame
https://github.com/VOIPsteganalysis. on a short time frame (20ms), which is comparable
to the length of a phoneme in a word. Since succes-
4.2 Training Details sive phonemes in a word are correlated, so that the
successive frame in the coding streams are correlated.
Specifically, all experiments about SRCNet are based on Therefore, we try to analyse successive frame, that is,
Keras 2.1.2 + CUDA 8.0.44 + CuDNN 5.1.10 + Python inter-frame based correlation.
3.6.3. For both network architectures we used the binary * Phoneme level correlation. Since multiple phonemes ex-
cross entropy loss as a loss function, and employed the Adam isted in a word. Different words have different phoneme
optimizer [10] with the default parameters. The training transition patterns. Therefore, current phoneme cannot
database was shuffled after each epoch. On our data-set, the be fully determined by the previous phoneme. Instead,
training was run for 30 epochs, the mini-batch size is 32, the we should take all previous appeared phonemes in this
dimension of input data is 10, and the hidden units of LSTM word into consideration. Cross frame correlation means
is 50. The experiments in this paper performed leveraging the correlations between nonadjacent codewords in a
GeForce GTX TITNA X. In addition, more finer tuning, word.
such as using different numbers of hidden units of LSTM * Word level correlation. The goal here is to analysis the
layer, or using a deeper network, could further improve the correlation depended on the context. It is obvious that
performance. words in a sentence are highly correlated with each
other. AMR encoded streams are essentially generated
4.3 FCB Correlation Analysis from sentences. Therefore, their corresponding FCBs
As we analysis above, current steganalysis methods had limi- are also correlated. In other words, a FCB from a word
tations. They only considered the FCB statistics feature in is not only determined by other FCBs from the same
one frame or between successive two frames. However, speech word, but also determined by FCBs from other words
signals are highly correlated in a long time interval. Current in the whole context.
pulse position is not only determined by the previous code- Therefore, we perform a statistical analysis of the pulse
word, but also influenced by the pulse appeared long before. joint occurrence matrix (POM) of paired pulses in different
In this part, we summarize four correlations of FCB. Figure 8 tracks at different distances throughout the whole audio
explains the four kinds of correlations: frames for both cover and stego AMR audio. First we clarify
* Inter-frame level correlation. This correlation is used what codeword correlation is. Assuming that the total number
to depict the correlation among all pulses with in a of sub-frames is 𝑇 , each sub-frame contains a number of pulse

9
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

positions 𝑁 , and the number of tracks for current coding


mode is 𝐿. 𝑖𝜇𝑚,𝑘 is defined as the 𝑚-th(𝑚 ∈ [0, 𝑁 − 1]) pulse
in the track 𝜇 (𝜇 ∈ [0, 𝐿 − 1]) at sub-frame 𝑘 (𝑘 ∈ [0, 𝑇 − 1]).
Therefore, POM can be formulated as Equation 9.

𝑃 𝑂𝑀 (𝛼, 𝛽) = 𝑃 (𝑖𝜇 𝜈
𝑚,𝑘 = 𝛼, 𝑗𝑛,𝑙 = 𝛽) (9)
where 𝛼, 𝛽 are the admissible positions and satisfy 0 ≤ 𝛼, 𝛽 ≤
𝑁 − 1.
In order to explain the importance of correlations, some
experiments were designed. We collect a AMR stream with
300,000 sub-frames for both cover and stego audio at AMR- (a) Cover at inter level. (b) Stego at inter level.
NB 12.2kbit/s mode and evaluate the codeword correlations
according to Equation 9. The steganography algorithm uses
Geisers method [7] at an embedding bit rate of 100%. The
parameters in the experiments of 𝑃 𝑂𝑀 are set as follows:
* For inter-frame level correlation. we set 𝜇 = 1, 𝜈 =
2, 𝑚 = 1, 𝑛 = 2, 𝑘 = 𝑙. Both the two pulses are in the
same sub-frame. The former one is the first non-zero
pulse position in the track 1, the latter one is the second
non-zero pulse position in the track 2.
* For intra-frame level correlation. we set 𝜇 = 1, 𝜈 =
1, 𝑚 = 1, 𝑛 = 2, 𝑘 = 𝑙 + 4. The two pulses are in (c) Cover at intra level. (d) Stego at intra level.
the successive sub-frames (One frame contains 4 sub-
frames). The former one is the first non-zero pulse
position, the latter one is the second non-zero pulse
position. They are in the track 1.
* For phoneme level correlation. we set 𝜇 = 1, 𝜈 = 1, 𝑚 =
1, 𝑛 = 2, 𝑘 = 𝑙 + 8. The fixed distance between the two
pulses is 2 sub-frames. The former one is the first non-
zero pulse position, the latter one is the second non-zero
pulse position. They are in the track 1.
* For word level correlation. we set 𝜇 = 1, 𝜈 = 1, 𝑚 =
1, 𝑛 = 2, 𝑘 = 𝑙 + 1600. The fixed distance between the (e) Cover at phoneme level. (f) Stego at phoneme level.
two pulses is 400 sub-frames. The former one is the first
non-zero pulse position, the latter one is the second
non-zero pulse position. They are in the track 1.
Figure 9 is the analysis of the four kinds of FCB corre-
lations discussed above. The horizontal axis presents the
first admissible non-zero pulse position and the vertical axis
presents the second admissible non-zero pulse position, the
depth of block color indicates the value of 𝑃 𝑂𝑀 (𝛼, 𝛽). As the
figure shows, the operation of embedding destroy the pulse
position statistical distribution in the four different FCB
correlations. In this example, successive frame correlation is
(g) Cover at word level. (h) Stego at word level.
the strongest one. Intra-frame correlation and cross frame
correlation are tying with each other. Crossword correlation
is the weakest one. Moreover, even though cross word corre- Figure 9: Distribution of two pulses position at four
lation is the weakest, it can still provide classification clues. level correlation estimated from 300,000 sub-frames.
SRCNet has the potential to consider all four correlations at For the limited space, we only display two digits after
the same time, therefore it is more likely for SRCNet to have the decimal point here
better results.
tested the detection performance of each steganalysis at dif-
4.4 Experiments Under Different ferent embedding rates (20% to 100% with step size of 20%)
Embedding Rates when the length of the speech samples is 100ms. Different
Existing handcrafted features based steganalysis at low em- languages are tested separately. The result on is shown in
bedding rates is far away from practical application. We Table 4.

10
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

Table 4: Detection results (%) of 100ms samples under different embedding rate and under different language
at 12.2 kb/s mode.

Embedding Rate (RBR)


Lanuage Steganalysis Scheme Steganography Scheme
0.2 0.4 0.6 0.8 1.0
AFA [21] 53.08 58.12 62.44 69.12 74.13
Geiser [7] 80.71 92.45 96.72 98.81 99.77
Fast-SPP [19] Miao (𝜂=1) [16] 62.04 69.66 75.07 79.11 84.61
Miao (𝜂=2) [16] 61.40 69.75 76.28 81.78 85.36
Miao (𝜂=3) [16] 61.34 70.08 76.60 81.71 85.43
AFA 53.11 57.75 63.29 73.37 81.85
Geiser 88.31 98.52 99.89 100.00 100.00
Chinese MTJCE [17] Miao (𝜂=1) 63.02 75.34 83.88 91.07 95.53
Miao (𝜂=2) 60.37 69.20 77.24 82.93 88.47
Miao (𝜂=3) 60.56 70.55 77.28 83.65 88.38
AFA 51.40 56.15 63.85 74.73 83.21
Geiser 99.89 99.99 100.00 100.00 100.00
SRCNet Miao (𝜂=1) 80.48 88.65 93.60 96.49 97.98
Miao (𝜂=2) 68.34 74.71 79.24 83.49 87.78
Miao (𝜂=3) 68.21 75.04 79.33 83.19 86.21
AFA 52.90 56.10 60.48 65.19 71.09
Geiser 77.82 89.82 95.91 98.40 99.62
Fast-SPP Miao (𝜂=1) 60.58 67.44 71.66 76.78 81.92
Miao (𝜂=2) 58.55 65.62 71.77 77.48 81.60
Miao (𝜂=3) 58.78 65.24 72.24 76.71 80.74
AFA 52.74 56.08 60.97 70.05 78.19
Geiser 87.82 97.94 99.84 100.00 100.00
English MTJCE Miao (𝜂=1) 60.95 72.35 81.94 89.24 94.43
Miao (𝜂=2) 58.58 66.40 73.73 80.79 86.00
Miao (𝜂=3) 58.55 66.32 73.66 80.08 85.25
AFA 52.85 56.95 62.58 74.18 82.13
Geiser 99.92 100.00 100.00 100.00 100.00
SRCNet Miao (𝜂=1) 80.80 89.55 93.89 97.06 98.25
Miao (𝜂=2) 68.80 76.08 82.41 85.83 88.29
Miao (𝜂=3) 71.08 75.38 83.16 85.29 89.48

As the table shows, the accuracy of each steganalysis to model than English. Therefore it is more difficult to detect
method increased with increased embedding rates. The reason Chinese speech steganography.
is obvious. Higher embedding rate causes more impacts on We also compare the results with Fast-SPP and MTJCE.
FCB correlations. Thus, the difference between the cover For English, SRCNet is obviously better than Fast-SPP and
correlation and stego correlation is more obvious, and greatly MTJCE. For Chinese speech, except the case of Miao(𝜂=2)
improve detecting efficiency. at 1.0 RBR, and Miao(𝜂=3) at 0.8 and 1.0 RBR. SRCNet is
For different languages, the result is less different. The close to MTJCE. SRCNet has better accuracy than MTJCE.
SRCNet obtains better detection for English speech sam- For English speech, SRCNet has better accuracy than both
ples than Chinese speech samples. With the embedding rate the MTJCE and Fast-SPP.
increases, the difference between the Chinese and English These results indicate that SRCNet can provide competi-
decreased gradually. One possible explanation for these re- tive detection in the low embedding rate samples compared
sults is the characteristic difference of the speech signal itself with other state-of-the-art methods. Fast-SPP and MTJCE
between languages. Chinese language is more complicated analyze the FCB correlation between intra frame correlation
than English language. English consist of 20 vowels and 28 and adjacent frame correlation. The available information is
consonants, Chinese has 412 kinds of syllables. In the per- limited because the sample is short. However, SRCNet have
spective of deep learning, Chinese language is more difficult

11
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

Table 5: Detection results (%) of different time length samples and under different embedding rate for English
language at 12.2 kb/s mode.

Sample Length (ms)


Embedding Rate (RBR) Steganalysis Scheme Steganography Scheme
200 400 600 800 1000
AFA [21] 81.13 88.18 93.42 96.15 96.39
Geiser [7] 99.99 99.99 100.00 100.00 100.00
Fast-SPP [19] Miao (𝜂=1) [16] 90.68 96.61 98.65 99.47 99.69
Miao (𝜂=2) [16] 89.35 95.10 98.30 99.09 99.36
Miao (𝜂=3) [16] 89.42 95.23 98.24 99.10 99.26
AFA 89.08 96.04 98.83 99.36 99.66
Geiser 100.00 100.00 100.00 100.00 100.00
100% MTJCE [17] Miao (𝜂=1) 98.93 99.95 99.99 100.00 100.00
Miao (𝜂=2) 93.70 98.38 99.55 99.87 99.91
Miao (𝜂=3) 93.61 98.37 99.52 99.10 99.91
AFA 91.99 97.40 98.96 99.66 99.74
Geiser 100.00 100.00 100.00 100.00 100.00
SRCNet Miao (𝜂=1) 98.23 100.00 100.00 100.00 100.00
Miao (𝜂=2) 94.83 98.23 99.51 99.85 99.93
Miao (𝜂=3) 94.70 98.40 99.55 99.79 99.91
AFA 52.26 53.21 54.16 55.26 54.99
Geiser 74.07 82.63 88.23 92.12 92.50
Fast-SPP Miao (𝜂=1) 59.29 62.97 66.89 70.94 69.86
Miao (𝜂=2) 57.90 61.55 66.72 70.11 69.19
Miao (𝜂=3) 57.63 61.91 66.90 69.68 68.73
AFA 51.74 52.92 54.38 55.55 55.61
Geiser 81.13 89.83 94.26 97.12 97.61
10% MTJCE Miao (𝜂=1) 59.61 65.29 71.75 75.97 76.12
Miao (𝜂=2) 56.46 60.59 66.74 70.13 69.86
Miao (𝜂=3) 56.45 60.92 66.86 69.68 69.48
AFA 51.70 50.56 50.95 50.89 50.69
Geiser 99.94 100.00 100.00 100.00 100.00
SRCNet Miao (𝜂=1) 81.45 89.65 93.50 95.85 96.80
Miao (𝜂=2) 69.65 76.46 78.49 81.95 85.44
Miao (𝜂=3) 69.04 75.16 79.56 81.50 83.39

enough capacity to comprehensively analyze FCB correla- of the proposed model. we test each method with multiple
tions at different level frames. Therefore, it can detect short lengths. The results are listed in Table 5.
samples better. For all different sample lengths, SRCNet has the ability
We remind that the performance of the network may be to handle different length samples, especially for the low
optimized via fine-tuning, this network meant to serve as embedding rate samples. For AFA at 0.1 RBR, SRCNet
a generic framework that can deal with in other speech outperforms does not have the best accuracy. This results
embedding domain. So no optimization tricks are adopted in is easy to explain, low embedding rate provides less useful
our proposed network. information on embedding impact. In stead, SRCNet learns
the feature more about audio content itself compared with
hand crafted feature based method. Again, all these results
show that SRCNet can effectively detect samples at different
4.5 Experiments Under Different Time lengths and different embedding rates.
Length
As we discussed above, the training and testing process of 5 CONCLUSIONS
the model will be slowed with the time length increase. We In this paper, we propose an effective end-to-end FCB ste-
address the problem of different time length by applying the ganalysis algorithm based on combination of Recurrent Neu-
global average pooling (GAP). To examine the sample length ral Networks and Convolutional Neural Networks (SRCNet).

12
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France

RNN and CNN collaborate each other and are trained si- and Multimedia Signal Processing (IIH-MSP), 2015 Interna-
multaneously. Experimental results demonstrate that the tional Conference on. IEEE, 37–40.
[13] Zinan Lin, Yongfeng Huang, and Jilong Wang. 2018. RNN-SM:
SRCNet achieves state-of-the-art performance and has a bet- Fast steganalysis of voip streams using recurrent neural network.
ter detection accuracy for short sample at low embedding IEEE Transactions on Information Forensics and Security 13,
7 (2018), 1854–1868.
rates. In addition, our network can be used for detecting the [14] Jin Liu, Ke Zhou, and Hui Tian. 2012. Least-significant-digit
AFA algorithm, an adaptive FCB steganographic method, steganography in low bitrate speech. In Communications (ICC),
which is hard to detect by classical handcrafted features. 2012 IEEE International Conference on. IEEE, 1133–1137.
[15] Peng Liu, Songbin Li, and Haiqiang Wang. 2017. Steganography
Moreover, the global average pooling is used to steganalyze integrated into linear predictive coding for low bit-rate speech
different time length audio. codec. Multimedia Tools and Applications 76, 2 (2017), 2837–
Our work indicates that the combination of RNN and CNN 2859.
[16] Haibo Miao, Liusheng Huang, Zhili Chen, Wei Yang, and Ammar
is a practical method which could inspire other researchers to Al-Hawbani. 2012. A new scheme for covert communication via
design better deep neural networks for steganalysis along this 3G encoded speech. Computers & Electrical Engineering
38, 6 (2012), 1490–1501.
orientation. In the future, we will explore the potentiality of [17] Haibo Miao, Liusheng Huang, Yao Shen, Xiaorong Lu, and Zhili
RNN and CNN to further improve the detection of adaptive Chen. 2013. Steganalysis of compressed speech based on Markov
FCB steganography at low embedding rates. and entropy. In International Workshop on Digital Watermark-
ing. Springer, 63–76.
[18] Akira Nishimura. 2009. Data hiding in pitch delay data of the
ACKNOWLEDGMENTS adaptive multi-rate narrow-band speech codec. In Intelligent
Information Hiding and Multimedia Signal Processing, 2009.
This work was supported by NSFC under U1636102, U1736214, IIH-MSP’09. Fifth International Conference on. IEEE, 483–486.
61802393 and 61872356, National Key Technology R&D [19] Yanzhen Ren, Tingting Cai, Ming Tang, and Lina Wang. 2015.
Program under 2016QY15Z2500 and 2016YFB0801003, and AMR steganalysis based on the probability of same pulse position.
IEEE Transactions on Information Forensics and Security 10,
Project of Beijing Municipal Science & Technology Commis- 9 (2015), 1801–1811.
sion under Z181100002718001. [20] Yanzhen Ren, Dengkai Liu, Jing Yang, and Lina Wang. 2018. An
AMR adaptive steganographic scheme based on the pitch delay
The author thank Yuntao Wang and Weike You for useful of unvoiced speech. Multimedia Tools and Applications (2018),
suggestions on the paper. 1–21.
[21] Yanzhen Ren, Hongxia Wu, and Lina Wang. 2018. An AMR
adaptive steganography algorithm based on minimizing distortion.
REFERENCES Multimedia Tools and Applications 77, 10 (2018), 12095–12110.
[1] Bruno Bessette, Redwan Salami, Roch Lefebvre, Milan Jelinek, [22] Yanzhen Ren, Jing Yang, Jinwei Wang, and Lina Wang. 2017.
Jani Rotola-Pukkila, Janne Vainio, Hannu Mikkola, and Kari AMR steganalysis based on second-order difference of pitch delay.
Jarvinen. 2002. The adaptive multirate wideband speech codec IEEE Transactions on Information Forensics and Security 12,
(AMR-WB). IEEE transactions on speech and audio processing 6 (2017), 1345–1357.
10, 8 (2002), 620–636. [23] Johan Sjoberg, Magnus Westerlund, Ari Lakaniemi, and Qiaobing
[2] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library Xie. 2002. Real-time transport protocol (RTP) payload format
for support vector machines. ACM transactions on intelligent and file storage format for the adaptive multi-rate (AMR) and
systems and technology (TIST) 2, 3 (2011), 27. adaptive multi-rate wideband (AMR-WB) audio codecs. Techni-
[3] Bolin Chen, Weiqi Luo, and Haodong Li. 2017. Audio steganalysis cal Report.
with convolutional neural network. In Proceedings of the 5th [24] Hui Tian, Yanpeng Wu, Chin-Chen Chang, Yongfeng Huang,
ACM Workshop on Information Hiding and Multimedia Security. Yonghong Chen, Tian Wang, Yiqiao Cai, and Jin Liu. 2017. Ste-
ACM, 85–90. ganalysis of adaptive multi-rate speech using statistical character-
[4] Qi Ding and Xijian Ping. 2010. Steganalysis of analysis-by- istics of pulse pairs. Signal Processing 134 (2017), 9–22.
synthesis compressed speech. In Multimedia Information Net- [25] Yuntao Wang, Kun Yang, Xiaowei Yi, Xianfeng Zhao, and Zhoujun
working and Security (MINES), 2010 International Conference Xu. 2018. CNN-based Steganalysis of MP3 Steganography in the
on. IEEE, 681–685. Entropy Code Domain. In Proceedings of the 6th ACM Workshop
[5] Qi Ding and Xijian Ping. 2010. Steganalysis of compressed speech on Information Hiding and Multimedia Security. ACM, 55–65.
based on histogram features. In Wireless Communications Net- [26] Zhijun Wu, Haijuan Cao, and Douzhe Li. 2015. An approach
working and Mobile Computing (WiCOM), 2010 6th Interna- of steganography in G. 729 bitstream based on matrix coding
tional Conference on. IEEE, 1–4. and interleaving. Chinese Journal of Electronics 24, 1 (2015),
[6] Tomáš Filler, Jan Judas, and Jessica Fridrich. 2011. Minimizing 157–165.
additive distortion in steganography using syndrome-trellis codes. [27] Bo Xiao, Yongfeng Huang, and Shanyu Tang. 2008. An ap-
IEEE Transactions on Information Forensics and Security 6, 3 proach to information hiding in low bit-rate speech stream. In
(2011), 920–935. Global Telecommunications Conference, 2008. IEEE GLOBE-
[7] Bernd Geiser and Peter Vary. 2008. High rate data hiding in COM 2008. IEEE. IEEE, 1–5.
ACELP speech codecs. In Acoustics, Speech and Signal Process- [28] Xiaowei Yi, Kun Yang, Xianfeng Zhao, Yuntao Wang, and Haibo
ing, 2008. ICASSP 2008. IEEE International Conference on. Yu. 2019. AHCM: Adaptive Huffman Code Mapping for Audio
IEEE, 4005–4008. Steganography Based on Psychoacoustic Model. IEEE Transac-
[8] Chen Gong, Xiaowei Yi, and Xianfeng Zhao. 2018. Pitch De- tions on Information Forensics and Security (2019).
lay Based Adaptive Steganography for AMR Speech Stream. In
International Workshop on Digital Watermarking. Springer,
275–289.
[9] Yongfeng Huang, Chenghao Liu, Shanyu Tang, and Sen Bai. 2012.
Steganography integration into a low-bit rate speech codec. IEEE
transactions on information forensics and security 7, 6 (2012),
1865–1875.
[10] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[11] Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in
network. arXiv preprint arXiv:1312.4400 (2013).
[12] Rong-San Lin. 2015. An Imperceptible Information Hiding in
Encoded Bits of Speech Signal. In Intelligent Information Hiding

13

You might also like