Professional Documents
Culture Documents
Xianfeng Zhao Yi Ma
State Key Laboratory of Information Security, Beijing Information Technology Institute,
Institute of Information Engineering, Chinese Beijing, China 100094
Academy of Sciences, Beijing, China 100093 mayi 5501@126.com
School of Cyber Security, University of Chinese
Academy of Sciences, Beijing, China 100093
zhaoxianfeng@iie.ac.cn
ABSTRACT CCS CONCEPTS
With the rapid development of stream multimedia, the adap- • Computing methodologies → Learning latent repre-
tive multi-rate (AMR) audio steganography are emerging sentations; • Security and privacy → Authentication.
recently. However, the traditional steganalysis methods face
great challenges in detecting short time speech at low embed- KEYWORDS
ding rates. To address this problem, we propose a steganalytic steganalysis, adaptive multi-rate, fixed c odebook, p ulse posi-
scheme by combining Recurrent Neural Network (RNN) and tion, recurrent neural network, convolutional neural network
Convolutional Neural Network (CNN), SRCNet. AMR fixed
codebook (FCB) steganography embed messages by mod- ACM Reference Format:
ifying the pulse positions, which would destroy the FCB Chen Gong, Xiaowei Yi, Xianfeng Zhao, and Yi Ma. 2019. Re-
current Convolutional Neural Networks for AMR Steganalysis
correlation. Firstly we analyzed the FCB correlations at dif-
Based on Pulse Position. In ACM Information Hiding and Mul-
ferent distances, and summarized these correlations into four timedia Security Workshop (IH&MMSec ’19), July 3–5, 2019,
categories. Furthermore, we utilizes RNN to extract higher Paris, France. ACM, New York, NY, USA, 12 pages. https: //
level contextual representations of FCBs and CNN to fuse doi.org/10.1145/3335203.3335708
spatial-temporal features for the steganalysis. The proposed
approach was evaluated on a public data-set. The experiment
results validate that the proposed framework greatly out-
1 INTRODUCTION
performs the existing state-of-the-art methods. The correct Steganography is the technique of hiding secret messages into
detection rate of SRCNet has been improved above at least innocent-looking multimedia, such as, image, audio and video.
10% when the sample is as short as 100ms at the 20% embed- Steganalysis, as the counter-technology of steganography,
ding rate. In particular, the network achieves the significant aims to expose the presence of secret messages hidden in the
improvements for detecting the STCs based adaptive AMR multimedia.
steganography. Recently, emerging audio steganographic algorithms [8, 20,
21, 28] in various audio domains have been proposed. Specif-
*
Corresponding author ically, the low-bit-rate compressed speech steganography can
be classified i nto t wo c ategories, c ompression independent
Permission to make digital or hard copies of all or part of this work methods and compression-dependent methods. The former is
for personal or classroom use is granted without fee provided that to directly modify the compressed speech stream after speech
copies are not made or distributed for profit o r c ommercial advantage
and that copies bear this notice and the full citation on the first
encoding [12, 14, 26]. The latter [7, 9, 16, 18, 20, 21, 27] inte-
page. Copyrights for components of this work owned by others than grating information hiding into the process of speech encoding.
ACM must be honored. Abstracting with credit is permitted. To copy By contrast, the latter attract more interest for highly qual-
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific p ermission a nd/or a f ee. R equest permissions
ity and strong security. Among so many compressed speech
from permissions@acm.org. coding, AMR become a novel carrier for steganography for
IH&MMSec ’19, July 3–5, 2019, Paris, France its widespread usage in mobile communication. There are
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6821-6/19/06. . . $15.00
three feasible embedding domains in AMR codec, includ-
https://doi.org/10.1145/3335203.3335708 ing linear predictive coding (LPC) [15, 27], fixed codebook
2
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France
(FCB) [7, 16, 21] and pitch delay (PD) [9, 18, 20]. What’s
more, FCB attract more attention accounting for the large
embedding capacity and good imperceptibility. At 12.2 kb/s
mode, AMR-NB encoder processes on each speech frames of
20 ms and produces 244 bits code stream. FCB takes up 140
bits which is a significant proportion (140/244 = 57.38%) of
the total frame bits.
FCB is implemented by interleaved single-pulse permuta-
tion (ISPP) and adopts the non-exhaustive depth-first tree
search, that is, there are still enough redundancy in the FCB.
So FCB is a most suitable match for information hiding
even if we change the current FCB into another alterna-
tive FCB without degrading synthetic speech quality. The
existing FCB based steganography [7, 16, 21] come from
this idea. Geiser et al. [7] proposed a method for hiding
digital data with a comparatively high rate in the AMR-
NB speech codec. The hidden bits can be directly extracted
from the bitstream. Miao et al. [16] further introduced an
AMR-WB steganography which is realized by searching a sub-
optimal code-vector whose pulse combination meets adaptive
suboptimal-pulse-combination constrained (ASOPCC). Ren
et al. [21] proposed an AMR FCB adaptive steganographic
algorithm (AFA) based on minimizing embedding distortion.
A content-aware distortion function is designed to effectively
resist the detecting of the existing steganalysis.
To detect FCB based embedding schemes, some steganaly-
sis studies which use a well-designed hand crafted steganalysis
feature combined with Support Vector Machine (SVM) have Figure 1: A typical scenario of AMR stream steganal-
been proposed in the recent years. Existing FCB steganogra- ysis and steganography.
phy [7, 16, 21] destroy the distribution statistics on pulses
positions. Ding et al. [4, 5] first presented histogram based
steganalysis features, including histogram flatness, the center Our goal is to design a data-driven method, which aimed to
of mass of histogram characteristic function and histogram make the steganalytic method more generally applicable in-
variance. Miao et al. [17] introduced Markov transition proba- dependently of the specific embedding domain. Such as Pitch
bilities and entropy to evaluate the interrelationships between delay (PD) [8, 20, 22], since PD-based steganography and
adjacent pulses. Specially, entropy based feature consist of FCB-based steganography share the same idea by changing
the joint entropy and conditional entropy. Ren et al. [19] speech coding parameters. So the pre-processing and the prior
designed a set of features based on the probability of the feature selection are not adopted in our proposed method.
pulse position being the same in the same track. Tian et For example, the effectiveness of the high-pass filters has been
al. [24] expanded previous research further by employing the already analyzed and proven in [3, 25]. However, from the
probability distributions of paired pulses, the Markov transi- perspective of deep learning, all parameters of the network
tion probabilities of paired pulses and the joint probability should be automatically learned rather than artificially be
matrices of pulse pairs, and a feature dimension is introduced constructed. Based on this spot, the origin decoded pulses
by AdaBoost prior to the training. positions are adopted as the input data of the network.
However, existing steganalysis method are limited in real Additionally, a detailed analysis of FCB correlation at
scenarios. There remain challenges in the detection in live various frame level was conducted by summarizing the cor-
streaming audio and the strong security steganography, espe- relations into four categories. The experimental results are
cially for short time audio at low embedding rates. The high directive significance to the design of network. We evaluated
user-concurrency and fast-throughput nature of the stream- different correlation by co-occurrence matrix which could
ing media request the streaming media steganalysis should measure FCB correlation at different distances. Existing
be responded instantly in a highly accuracy. Moreover, the FCB steganalysis only consider the distributions of pulse
steganographers remain inactive for a long time and embed pairs in intra sub-frame or inter sub-frame [4, 5, 17, 19]. Ob-
message in a short time. A typical scenario of AMR stream viously, current methods are thereby not complete enough to
steganalysis and steganography can be represented by Fig- characterize FCB.
ure 1. To address the limitation of current FCB steganalysis, We first propose a unified RNN-CNN for FCB steganalysis
we propose a data-driven recurrent convolutional steganalysis framework which could make the model able to automati-
neural network (SRCNet). cally hold the features not only in time dimension but also
3
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France
in the space dimension inspired by the previous research Table 1: Searching positions in each track at mode
achievements [3, 13, 25]. RNN and CNN play the roles of 12.2 kb/s. 𝑖𝑡 and 𝑖𝑡+5 represent the two pulses in the
temporal analyst and feature extractor respectively. RNN is same track 𝑡 (0 ≤ 𝑡 ≤ 4) respectively.
exploited to model the sequential inputs of pulses positions.
CNN is used to construct the representation of FCB, also Track Pulse Position
accelerating the convergence and decreasing the danger of
0 𝑖0 , 𝑖5 0, 5, 10, 15, 20, 25, 30, 35
over fitting. Specifically, the decoded pulses positions are
1 𝑖1 , 𝑖6 1, 6, 11, 16, 21, 26, 31, 36
feeded into an RNN. The RNN is used to extract temporal
2 𝑖2 , 𝑖7 2, 7, 12, 17, 22, 27, 32, 37
context features of FCB. The outputs of RNN are used as
3 𝑖3 , 𝑖8 3, 8, 13, 18, 23, 28, 33, 38
the input data of a CNN. The goal of the CNN is to extract
4 𝑖4 , 𝑖9 4, 9, 14, 19, 24, 29, 34, 39
the key FCB features from low level features to the greatest
extent possible and complete speech classification.
The rest of the paper is organized as follows. In Section
II, related background is briefly introduced. Section III intro- The FCB is constructed by placing 2 non-zero-amplitude
duces the proposed end-to-end approach to FCB steganalysis pulses in each track, each pulse can be placed on eight dif-
in this paper. Several experimental evaluation and analysis ferent positions, so totally 10 non-zero pulse positions are
are discussed in Section IV. Finally, Section V concludes with encoded per sub-frame. Generally speaking, FCB search is
a summary and directions for future work. to find 10 pulses’ optimal position in 40 candidate position.
The available searching positions in each track at AMR-NB
2 BACKGROUND 12.2kbit/s mode is given in Table 1.
The FCB search is to find the optimal combination of
determine 10 pulse by the depth-first tree search method.
Each pulse has 8 potential position. If each position of the
all pulses is searched, (8 × 8)4 = 16777216 times would be
required to get the optimal one. In practice, sub-optimal
algorithms are often adopted in consideration of the real-
time requirement of voice communications,. The FCB search
algorithm of AMR adopts the depth-first tree search instead
of the nested-loop focused search. The number of loops can
be reduce to 4 × (4 × 8 × 8) = 1024 times. Different modes
is different in the number of non-zero pulses and the corre-
sponding search levels. We take the 12.2kbit/s mode as an
Figure 2: AMR optimal codebook search diagram. example to illustrate the search process combined with the
figure 3. Briefly, there are two-layer loops. The outer loop
is for pulse 𝑖1 iterates over all of local maximum in each
2.1 AMR codec track, and the inner loop is for 4 groups of paired pulses.
AMR-WB and AMR-NB [1] work on the same principle Given a fixed 𝑖1 for outer loop, the body of the outer loop
of Algebraic Code Excited Linear Prediction (ACELP) [23] is shown in Figure 3. There are six levels in the search tree.
aiming to modulate the process of human voice. The encoding In addition, 𝑖0 and 𝑖1 do not need search. Specially, at level
diagram is shown as Figure 2. The encoder process on speech 1, the position of pulse 𝑖0 is determined based on the global
frames of 20ms as the basic unit. For each frame, the frame maximum of the reference signals in all tracks. Then the
is segmented into four sub-frames of 5ms each. In each sub- four outermost iterations are carried out to determine pulse
frame, a linear prediction synthesis filter is used to synthesize 𝑖1 position at level 2. 𝑖1 is tentatively assigned to the local
the output signal by filtering the result of the sum of the maximum within each of the other four tracks except the
appropriate adaptive and the fixed codebook. The weighted track occupied by the first pulse. Next, the paired pulses are
error which between the synthesized speech and the original searched in the last four levels : (𝑖2 , 𝑖3 ), (𝑖4 , 𝑖5 ), (𝑖6 , 𝑖7 ), (𝑖8 , 𝑖9 ).
speech is minimized by using a so-called analysis-by-synthesis At level 3, pulse 𝑖2 and 𝑖3 are searched in their respective
approach is used to determine the optimal excitation signal. tracks. During the 8 × 8 nested-loop search, pulse 𝑖2 with all
Finally, the encoded parameters are obtained and transmitted 8 admissible positions is tested together with pulse 𝑖3 with its
through a public channel. corresponding all 8 admissible positions. The position of 𝑖2
FCB search is the key part of the AMR encoding, only a and 𝑖3 are determined once the current target signal reaches
small subset of the optimal non-zero pulse positions are cho- its maximum value. The number of tested combinations is
sen and encoded. FCB structure is based on interleaved single 8 × 8 = 64. In the subsequent three levels, the search is the
pulse permutation (ISPP) design. Different coding mode has same as level 3.
different codebook distribution. With 8 kHz sampling rate, Though the depth-first tree search reduces the computation
per 5ms sub-frame, the 40 positions are divided into five complexity, the result of fixed codebook vector is just local
tracks of interleaved positions, with 8 positions in each track. optimal because the search range is restricted. But great
4
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France
Figure 3: The overview of AMR FCB search process for a fixed pulse 𝑖1 at 12.2 kbit/s mode. Assuming that,
𝑖1 , 𝑖2 , 𝑖3 , 𝑖4 , 𝑖5 , 𝑖6 , 𝑖7 , 𝑖8 , 𝑖8 , 𝑖9 are in track 1, track 2, track 3, track 4, track 0, track 1, track 2, track 3, track 4
respectively. Paired pulses 𝑖𝑡 , 𝑖𝑡+1 are searched together in a 8 × 8 loop. The color of each circle corresponds
to one pulse’s a searchable position.
for steganography. There are still redundancies in the fixed Table 2: FCB bits allocation in the first sub-frame
codebook search space, that is, minor modifications of FCB
would cause little significant loss to speech quality. Bits allocation Content description
S52 sign information for 𝑖0 and 𝑖5 pulses
2.2 FCB Steganography S53 - S55 position of 𝑖0 pulse
S56 sign information for 𝑖1 and 𝑖6 pulses
This section first review the FCB structure for better under-
S57 - S59 position of 𝑖1 pulse
stand how the FCB based steganography designed. This part
S60 sign information for 𝑖2 and 𝑖7 pulses
will take AMR 12.2 kb/s mode as an example.
S61 - S63 position of 𝑖2 pulse
The detailed allocation of the bits of FCB is shown in
S64 sign information for 𝑖3 and 𝑖8 pulses
Table 2. In each sub-frame, 40 potential positions are divided
S65 - S67 position of 𝑖3 pulse
into 5 tracks, each pulse has 8 candidate positions. Once
S68 sign information for 𝑖4 and 𝑖9 pulses
the pulses positions are determined, all the pulse positions
S69 - S71 position of 𝑖4 pulse
use Gray code to enhance the robustness. Each two pulse
S72 - S74 position of 𝑖5 pulse
positions in one track are encoded with 6 bits (total of 30
S75 - S77 position of 𝑖6 pulse
bits, 3 bits for the position of every pulse). For two pulses in
S78 - S80 position of 𝑖7 pulse
the same track, another one sign bit is needed. This sign bit
S81 - S83 position of 𝑖8 pulse
indicates the sign of the first pulse, the sign of the second
S84 - S86 position of 𝑖9 pulse
pulse depends on the first pulse. If the position of the second
S87 - S91 fixed codebook gain
pulse is smaller than the first pulse, it has an opposite sign,
otherwise it has the same sign, and vice versa. So FCB take
up 35 bits. Typically, FCB-based steganography takes the
two pulses in the same track as the basic processing units. The existing methods for AMR audio steganography [7,
The FCBs are searched by the depth-first tree, that is, 16, 21]share the same idea, they embed secret messages in
only a small subset of available positions are examined, the the last pulse position of each track by restricting the search
final FCBs may not be optimal. Based on the redundancies space for pulse position.
of FCB, it is possible to embed messages by modifying other Geiser [7] first proposed a steganographic FCB strategy
pulse position with no speech quality. for the AMR-NB 12.2kbit/s mode. For track 𝑡(0 ≤ 𝑡 ≤ 4),
5
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France
assume 𝑖𝑡 is the first pulse position and 𝑖𝑡+5 is the second 𝑝𝑓,𝑘 = 1, otherwise 𝑝𝑓,𝑘 = 0. The pulse correlation in the
pulse position in this track. (𝑚)2𝑘,2𝑘+1 is denoted as binary same track of each sub-frame is calculated as Equation 6
2 bit message to be embedded the track 𝑡. The two possible
positions for 𝑖𝑡+5 depend on 𝑖𝑡 and the message (𝑚)2𝑘,2𝑘+1 .
The calculation formula of two possible positions for 𝑖𝑡+5 is 𝑀 𝜇,𝜈 (𝑡) = 𝑃 (𝑖𝑡 = 𝜇, 𝑖𝑡+5 = 𝜈 ‖ 𝑖𝑡 = 𝜈, 𝑖𝑡+5 = 𝜇) (5)
shown as Equation 1:
{︃ 𝑖
𝑔𝑟𝑎𝑦 −1 (𝑔𝑟𝑎𝑦(⌊ 𝑡+5
5
⌋) ⊕ (𝑚)0,1 ) · 5 + 𝑡 ∑︀𝑁𝑓
𝑀 𝜇,𝜈 (𝑡)
𝑖𝑡+5 = −1 𝑖 (1) 𝑓 =1
𝑔𝑟𝑎𝑦 (𝑔𝑟𝑎𝑦(⌊ 𝑡+5 ⌋) ⊕ (𝑚)0,1 + 4) · 5 + 𝑡 𝑀𝑡 (𝜇, 𝜈) = (6)
5 𝑁𝑓
−1
where 𝑔𝑟𝑎𝑦 and 𝑔𝑟𝑎𝑦 are respectively the gray encoding where 𝑓 is the index of a sub-frame within in a frame, 𝜇, 𝜈
and the gray decoding by table lookups, ⊕ is the bitwise is the index of pulse label. If the first pulse position 𝑖𝑡 is 𝜇
exclusive operation of two binary strings, ⌊𝑥⌋ = 𝑚𝑎𝑥{𝑛 ∈ and the second pulse position 𝑖𝑡+5 is 𝜈 in the same track,
Z | 𝑛 ≤ 𝑥} let 𝑥 rounded down, Additionally, the 3 bit pulse 𝑃 (𝑖𝑡 = 𝜇, 𝑖𝑡+5 = 𝜈 ‖ 𝑖𝑡 = 𝜈, 𝑖𝑡+5 ) = 1, otherwise 𝑃 (𝑖𝑡 =
position index in the track 𝑡 can be obtained by ⌊ 𝑖5𝑡 ⌋ . 𝜇, 𝑖𝑡+5 = 𝜈 ‖ 𝑖𝑡 = 𝜈, 𝑖𝑡+5 ) = 0.
After obtaining the pulse positions per sub-frame, the ex-
traction of the 2 bit hidden message (𝑚)2𝑘,2𝑘+1 (𝑘 ∈ [0, 4]) [7]
is performed by Equation 2.
𝑖𝑡 𝑖𝑡+5
(𝑚)2𝑘,2𝑘+1 = (𝑔𝑟𝑎𝑦(⌊ ⌋ ⊕ 𝑔𝑟𝑎𝑦(⌊ ⌋)%4 (2)
5 5
Miao et al. [16] further introduced an adaptive suboptimal
pulse combination constrained (ASOPP) method in the AMR-
WB speech codec. Similar in principle to [7], incorporating
message embedding during the FCB search by controlling
the pulse positions in same track. Different from [7]. Miao
introduces an embedding factor 𝜂 to achieve a better trade-
off between speech quality and embedding capacity. For the
second pulse positions 𝑖𝑡+5 in the track 𝑡, its search space
should satisfy the Equation 3.
𝑃𝑡
∑︁ 𝑃𝑡𝑖
𝑚𝑡 = ( 𝑔𝑟𝑎𝑦(⌊ ⌋) ⊕ 𝜂 (3)
𝑖=0
𝑁
where 𝑔𝑟𝑎𝑦 is the encoding operation, 𝑁 is the total number
of tracks. 𝜂 is the dynamic factor that controls embedded
bits. 𝑡 is the track index. 𝑃𝑡𝑖 is 𝑖-th pulse position in track
𝑡. 𝑃𝑡 is the number of pulses in the track 𝑡. 𝑚𝑡 denotes the
secret data to be embedded. The embedding principle is to
control the second pulse position in track 𝑡. From Equation 3,
⌊𝑙𝑜𝑔2 ⌋ + 1 bit data can be embedded per track. According
to [16], 𝜂 is usually set 1, 2, 3 and 4 at AMR 12.2 kb/s mode.
The extraction of message can be obtained by Equation 3 as
well.
Ren et al. [21] first proposed an adaptive steganography
scheme (AFA) implemented by [6]. The key contribution of
the scheme is to design of the cost function and the additive
distortion function in the FCB embedding domain. The op-
timal probability of pulse and the pulse correlation in the
same track were considered into the cost function to resist
the existing steganalysis detecting. The optimal probability
of the pulse is calculated as Equation 4.
∑︀𝑁𝑓
𝑓 =1 𝑝𝑓,𝑘
𝑃𝑓,𝑘 = (4)
𝑁𝑓
where 𝑁𝑓 is the total number of sub-frames, 𝑓 is the index of Figure 4: The flow chart of the proposed SRCNet
a sub-frame within in a frame, 𝑘 is the index of pulse label in neural network for FCB steganalysis.
a sub-frame 0 ≤ 𝑘 ≤ 9. If the current pulse 𝑖𝑘 is optimal with
the global maximum of the reference signals in all tracks,
6
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France
7
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France
8
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France
9
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France
𝑃 𝑂𝑀 (𝛼, 𝛽) = 𝑃 (𝑖𝜇 𝜈
𝑚,𝑘 = 𝛼, 𝑗𝑛,𝑙 = 𝛽) (9)
where 𝛼, 𝛽 are the admissible positions and satisfy 0 ≤ 𝛼, 𝛽 ≤
𝑁 − 1.
In order to explain the importance of correlations, some
experiments were designed. We collect a AMR stream with
300,000 sub-frames for both cover and stego audio at AMR- (a) Cover at inter level. (b) Stego at inter level.
NB 12.2kbit/s mode and evaluate the codeword correlations
according to Equation 9. The steganography algorithm uses
Geisers method [7] at an embedding bit rate of 100%. The
parameters in the experiments of 𝑃 𝑂𝑀 are set as follows:
* For inter-frame level correlation. we set 𝜇 = 1, 𝜈 =
2, 𝑚 = 1, 𝑛 = 2, 𝑘 = 𝑙. Both the two pulses are in the
same sub-frame. The former one is the first non-zero
pulse position in the track 1, the latter one is the second
non-zero pulse position in the track 2.
* For intra-frame level correlation. we set 𝜇 = 1, 𝜈 =
1, 𝑚 = 1, 𝑛 = 2, 𝑘 = 𝑙 + 4. The two pulses are in (c) Cover at intra level. (d) Stego at intra level.
the successive sub-frames (One frame contains 4 sub-
frames). The former one is the first non-zero pulse
position, the latter one is the second non-zero pulse
position. They are in the track 1.
* For phoneme level correlation. we set 𝜇 = 1, 𝜈 = 1, 𝑚 =
1, 𝑛 = 2, 𝑘 = 𝑙 + 8. The fixed distance between the two
pulses is 2 sub-frames. The former one is the first non-
zero pulse position, the latter one is the second non-zero
pulse position. They are in the track 1.
* For word level correlation. we set 𝜇 = 1, 𝜈 = 1, 𝑚 =
1, 𝑛 = 2, 𝑘 = 𝑙 + 1600. The fixed distance between the (e) Cover at phoneme level. (f) Stego at phoneme level.
two pulses is 400 sub-frames. The former one is the first
non-zero pulse position, the latter one is the second
non-zero pulse position. They are in the track 1.
Figure 9 is the analysis of the four kinds of FCB corre-
lations discussed above. The horizontal axis presents the
first admissible non-zero pulse position and the vertical axis
presents the second admissible non-zero pulse position, the
depth of block color indicates the value of 𝑃 𝑂𝑀 (𝛼, 𝛽). As the
figure shows, the operation of embedding destroy the pulse
position statistical distribution in the four different FCB
correlations. In this example, successive frame correlation is
(g) Cover at word level. (h) Stego at word level.
the strongest one. Intra-frame correlation and cross frame
correlation are tying with each other. Crossword correlation
is the weakest one. Moreover, even though cross word corre- Figure 9: Distribution of two pulses position at four
lation is the weakest, it can still provide classification clues. level correlation estimated from 300,000 sub-frames.
SRCNet has the potential to consider all four correlations at For the limited space, we only display two digits after
the same time, therefore it is more likely for SRCNet to have the decimal point here
better results.
tested the detection performance of each steganalysis at dif-
4.4 Experiments Under Different ferent embedding rates (20% to 100% with step size of 20%)
Embedding Rates when the length of the speech samples is 100ms. Different
Existing handcrafted features based steganalysis at low em- languages are tested separately. The result on is shown in
bedding rates is far away from practical application. We Table 4.
10
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France
Table 4: Detection results (%) of 100ms samples under different embedding rate and under different language
at 12.2 kb/s mode.
As the table shows, the accuracy of each steganalysis to model than English. Therefore it is more difficult to detect
method increased with increased embedding rates. The reason Chinese speech steganography.
is obvious. Higher embedding rate causes more impacts on We also compare the results with Fast-SPP and MTJCE.
FCB correlations. Thus, the difference between the cover For English, SRCNet is obviously better than Fast-SPP and
correlation and stego correlation is more obvious, and greatly MTJCE. For Chinese speech, except the case of Miao(𝜂=2)
improve detecting efficiency. at 1.0 RBR, and Miao(𝜂=3) at 0.8 and 1.0 RBR. SRCNet is
For different languages, the result is less different. The close to MTJCE. SRCNet has better accuracy than MTJCE.
SRCNet obtains better detection for English speech sam- For English speech, SRCNet has better accuracy than both
ples than Chinese speech samples. With the embedding rate the MTJCE and Fast-SPP.
increases, the difference between the Chinese and English These results indicate that SRCNet can provide competi-
decreased gradually. One possible explanation for these re- tive detection in the low embedding rate samples compared
sults is the characteristic difference of the speech signal itself with other state-of-the-art methods. Fast-SPP and MTJCE
between languages. Chinese language is more complicated analyze the FCB correlation between intra frame correlation
than English language. English consist of 20 vowels and 28 and adjacent frame correlation. The available information is
consonants, Chinese has 412 kinds of syllables. In the per- limited because the sample is short. However, SRCNet have
spective of deep learning, Chinese language is more difficult
11
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France
Table 5: Detection results (%) of different time length samples and under different embedding rate for English
language at 12.2 kb/s mode.
enough capacity to comprehensively analyze FCB correla- of the proposed model. we test each method with multiple
tions at different level frames. Therefore, it can detect short lengths. The results are listed in Table 5.
samples better. For all different sample lengths, SRCNet has the ability
We remind that the performance of the network may be to handle different length samples, especially for the low
optimized via fine-tuning, this network meant to serve as embedding rate samples. For AFA at 0.1 RBR, SRCNet
a generic framework that can deal with in other speech outperforms does not have the best accuracy. This results
embedding domain. So no optimization tricks are adopted in is easy to explain, low embedding rate provides less useful
our proposed network. information on embedding impact. In stead, SRCNet learns
the feature more about audio content itself compared with
hand crafted feature based method. Again, all these results
show that SRCNet can effectively detect samples at different
4.5 Experiments Under Different Time lengths and different embedding rates.
Length
As we discussed above, the training and testing process of 5 CONCLUSIONS
the model will be slowed with the time length increase. We In this paper, we propose an effective end-to-end FCB ste-
address the problem of different time length by applying the ganalysis algorithm based on combination of Recurrent Neu-
global average pooling (GAP). To examine the sample length ral Networks and Convolutional Neural Networks (SRCNet).
12
Session: Video & Audio Steganography IH&MMSec ’19, July 3–5, 2019, Paris, France
RNN and CNN collaborate each other and are trained si- and Multimedia Signal Processing (IIH-MSP), 2015 Interna-
multaneously. Experimental results demonstrate that the tional Conference on. IEEE, 37–40.
[13] Zinan Lin, Yongfeng Huang, and Jilong Wang. 2018. RNN-SM:
SRCNet achieves state-of-the-art performance and has a bet- Fast steganalysis of voip streams using recurrent neural network.
ter detection accuracy for short sample at low embedding IEEE Transactions on Information Forensics and Security 13,
7 (2018), 1854–1868.
rates. In addition, our network can be used for detecting the [14] Jin Liu, Ke Zhou, and Hui Tian. 2012. Least-significant-digit
AFA algorithm, an adaptive FCB steganographic method, steganography in low bitrate speech. In Communications (ICC),
which is hard to detect by classical handcrafted features. 2012 IEEE International Conference on. IEEE, 1133–1137.
[15] Peng Liu, Songbin Li, and Haiqiang Wang. 2017. Steganography
Moreover, the global average pooling is used to steganalyze integrated into linear predictive coding for low bit-rate speech
different time length audio. codec. Multimedia Tools and Applications 76, 2 (2017), 2837–
Our work indicates that the combination of RNN and CNN 2859.
[16] Haibo Miao, Liusheng Huang, Zhili Chen, Wei Yang, and Ammar
is a practical method which could inspire other researchers to Al-Hawbani. 2012. A new scheme for covert communication via
design better deep neural networks for steganalysis along this 3G encoded speech. Computers & Electrical Engineering
38, 6 (2012), 1490–1501.
orientation. In the future, we will explore the potentiality of [17] Haibo Miao, Liusheng Huang, Yao Shen, Xiaorong Lu, and Zhili
RNN and CNN to further improve the detection of adaptive Chen. 2013. Steganalysis of compressed speech based on Markov
FCB steganography at low embedding rates. and entropy. In International Workshop on Digital Watermark-
ing. Springer, 63–76.
[18] Akira Nishimura. 2009. Data hiding in pitch delay data of the
ACKNOWLEDGMENTS adaptive multi-rate narrow-band speech codec. In Intelligent
Information Hiding and Multimedia Signal Processing, 2009.
This work was supported by NSFC under U1636102, U1736214, IIH-MSP’09. Fifth International Conference on. IEEE, 483–486.
61802393 and 61872356, National Key Technology R&D [19] Yanzhen Ren, Tingting Cai, Ming Tang, and Lina Wang. 2015.
Program under 2016QY15Z2500 and 2016YFB0801003, and AMR steganalysis based on the probability of same pulse position.
IEEE Transactions on Information Forensics and Security 10,
Project of Beijing Municipal Science & Technology Commis- 9 (2015), 1801–1811.
sion under Z181100002718001. [20] Yanzhen Ren, Dengkai Liu, Jing Yang, and Lina Wang. 2018. An
AMR adaptive steganographic scheme based on the pitch delay
The author thank Yuntao Wang and Weike You for useful of unvoiced speech. Multimedia Tools and Applications (2018),
suggestions on the paper. 1–21.
[21] Yanzhen Ren, Hongxia Wu, and Lina Wang. 2018. An AMR
adaptive steganography algorithm based on minimizing distortion.
REFERENCES Multimedia Tools and Applications 77, 10 (2018), 12095–12110.
[1] Bruno Bessette, Redwan Salami, Roch Lefebvre, Milan Jelinek, [22] Yanzhen Ren, Jing Yang, Jinwei Wang, and Lina Wang. 2017.
Jani Rotola-Pukkila, Janne Vainio, Hannu Mikkola, and Kari AMR steganalysis based on second-order difference of pitch delay.
Jarvinen. 2002. The adaptive multirate wideband speech codec IEEE Transactions on Information Forensics and Security 12,
(AMR-WB). IEEE transactions on speech and audio processing 6 (2017), 1345–1357.
10, 8 (2002), 620–636. [23] Johan Sjoberg, Magnus Westerlund, Ari Lakaniemi, and Qiaobing
[2] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library Xie. 2002. Real-time transport protocol (RTP) payload format
for support vector machines. ACM transactions on intelligent and file storage format for the adaptive multi-rate (AMR) and
systems and technology (TIST) 2, 3 (2011), 27. adaptive multi-rate wideband (AMR-WB) audio codecs. Techni-
[3] Bolin Chen, Weiqi Luo, and Haodong Li. 2017. Audio steganalysis cal Report.
with convolutional neural network. In Proceedings of the 5th [24] Hui Tian, Yanpeng Wu, Chin-Chen Chang, Yongfeng Huang,
ACM Workshop on Information Hiding and Multimedia Security. Yonghong Chen, Tian Wang, Yiqiao Cai, and Jin Liu. 2017. Ste-
ACM, 85–90. ganalysis of adaptive multi-rate speech using statistical character-
[4] Qi Ding and Xijian Ping. 2010. Steganalysis of analysis-by- istics of pulse pairs. Signal Processing 134 (2017), 9–22.
synthesis compressed speech. In Multimedia Information Net- [25] Yuntao Wang, Kun Yang, Xiaowei Yi, Xianfeng Zhao, and Zhoujun
working and Security (MINES), 2010 International Conference Xu. 2018. CNN-based Steganalysis of MP3 Steganography in the
on. IEEE, 681–685. Entropy Code Domain. In Proceedings of the 6th ACM Workshop
[5] Qi Ding and Xijian Ping. 2010. Steganalysis of compressed speech on Information Hiding and Multimedia Security. ACM, 55–65.
based on histogram features. In Wireless Communications Net- [26] Zhijun Wu, Haijuan Cao, and Douzhe Li. 2015. An approach
working and Mobile Computing (WiCOM), 2010 6th Interna- of steganography in G. 729 bitstream based on matrix coding
tional Conference on. IEEE, 1–4. and interleaving. Chinese Journal of Electronics 24, 1 (2015),
[6] Tomáš Filler, Jan Judas, and Jessica Fridrich. 2011. Minimizing 157–165.
additive distortion in steganography using syndrome-trellis codes. [27] Bo Xiao, Yongfeng Huang, and Shanyu Tang. 2008. An ap-
IEEE Transactions on Information Forensics and Security 6, 3 proach to information hiding in low bit-rate speech stream. In
(2011), 920–935. Global Telecommunications Conference, 2008. IEEE GLOBE-
[7] Bernd Geiser and Peter Vary. 2008. High rate data hiding in COM 2008. IEEE. IEEE, 1–5.
ACELP speech codecs. In Acoustics, Speech and Signal Process- [28] Xiaowei Yi, Kun Yang, Xianfeng Zhao, Yuntao Wang, and Haibo
ing, 2008. ICASSP 2008. IEEE International Conference on. Yu. 2019. AHCM: Adaptive Huffman Code Mapping for Audio
IEEE, 4005–4008. Steganography Based on Psychoacoustic Model. IEEE Transac-
[8] Chen Gong, Xiaowei Yi, and Xianfeng Zhao. 2018. Pitch De- tions on Information Forensics and Security (2019).
lay Based Adaptive Steganography for AMR Speech Stream. In
International Workshop on Digital Watermarking. Springer,
275–289.
[9] Yongfeng Huang, Chenghao Liu, Shanyu Tang, and Sen Bai. 2012.
Steganography integration into a low-bit rate speech codec. IEEE
transactions on information forensics and security 7, 6 (2012),
1865–1875.
[10] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[11] Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in
network. arXiv preprint arXiv:1312.4400 (2013).
[12] Rong-San Lin. 2015. An Imperceptible Information Hiding in
Encoded Bits of Speech Signal. In Intelligent Information Hiding
13