You are on page 1of 13

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
1

Robust CAPTCHAs towards Malicious OCR


Jiaming Zhang, Jitao Sang, Kaiyuan Xu, Shangxi Wu, Xian Zhao, Yanfeng Sun, Yongli Hu, Jian Yu

Abstract—Turing test was originally proposed to examine


whether machine’s behavior is indistinguishable from a human.
The most popular and practical Turing test is CAPTCHA, which
is to discriminate algorithm from human by offering recognition-
alike questions. The recent development of deep learning has
significantly advanced the capability of algorithm in solving
CAPTCHA questions, forcing CAPTCHA designers to increase
question complexity. Instead of designing questions difficult for
both algorithm and human, this study attempts to employ the
limitations of algorithm to design robust CAPTCHA questions
easily solvable to human. Specifically, our data analysis observes
that human and algorithm demonstrates different vulnerability
to visual distortions: adversarial perturbation is significantly
annoying to algorithm yet friendly to human. We are motivated
to employ adversarially perturbed images for robust CAPTCHA
design in the context of character-based questions. Four mod-
Fig. 1. Increasing content complexity of CAPTCHAs.
ules of multi-target attack, ensemble adversarial training, im-
age preprocessing differentiable approximation, and expectation
are proposed to address the characteristics of character-based
CAPTCHA cracking. Qualitative and quantitative experimental ditional character-based scheme involving with only numbers
results demonstrate the effectiveness of the proposed solution. We and English characters. With the fast progress of machine
hope this study can lead to the discussions around adversarial learning especially deep learning algorithms, simple character-
attack/defense in CAPTCHA design and also inspire the future
attempts in employing algorithm limitation for practical usage. based CAPTCHAs fail to distinguish between algorithm and
human [4]–[6]. CAPTCHA designers were therefore forced
Index Terms—adversarial example, CAPTCHA, OCR
to increase the complexity of content to be recognized. As
shown in Fig. 1, while the extremely complex CAPTCHAs
I. I NTRODUCTION reduce the risks to be cracked by algorithms, it also heavily
LAN Turing first proposed the Turing Test question “Can increase the burden of human recognition. It is noteworthy that
A machines think like human?” [1] Turing test was initially
designed to examine machine’s exhibited intelligent behavior
the effectiveness of simply increasing content complexity is
based on the assumption that human has consistently superior
that is indistinguishable from a human, and later developed recognition capability than algorithm. The last few years have
into a form of reverse Turing test with more practical goal of witnessed human-level AI in tasks like image recognition [7],
distinguishing between computer and human. Among reverse speech processing [8] and even reading comprehension [9].
Turing tests, CAPTCHA (Completely Automated Public Tur- It is easy to imagine that with the further development of
ing test to tell Computers and Humans Apart) turns out the algorithms, continuously increasing content complexity will
most well-known one used in anti-spam systems to prevent reach such a critical point that algorithm can recognize yet
abuse use of automated programs [2]. human cannot recognize.
Most early CAPTCHAs, like the reCAPTCHA [3] which Let’s review the initial goal of CAPTCHA: to discriminate
assists in the digitization of Google books, belong to the tra- human from algorithm by designing tasks unsolvable to algo-
rithms. Therefore, the straightforward solution is to employ
J. Zhang is with the Beijing University of Technology, Beijing, China (e-
mail: zhangjm@emails.bjut.edu.cn). the limitations of algorithms to facilitate CAPTCHA design.
J. Sang is with the School of Computer and Information Technology and While algorithms have advanced their performance in many
the Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing perspectives including visual/vocal recognition accuracy, they
Jiaotong University, Beijing, China (e-mail: jtsang@bjtu.edu.cn).
K. Xu is with the Beijing Jiaotong University, Beijing, China (e- remain some notorious limitations with regards to human.
mail:15281106@bjtu.edu.cn). Researchers and practitioners already employed such limita-
S. Wu is with the Beijing Jiaotong University, Beijing, China (e- tions to design new form of CAPTCHAs, e.g., developing
mail:kirinng@gmail.com).
X. Zhao is with the Beijing Jiaotong University, Beijing, China (e- cognitive [10] and sequentially related [11] questions to chal-
mail:lavender.zxshane@gmail.com). lenge algorithm’s lack of commonsense knowledge and poor
Y. Sun is with the Beijing University of Technology, Beijing, China (e-mail: contextual reasoning ability.
yfsun@bjut.edu.cn).
Y. Hu is with the Beijing University of Technology, Beijing, China (e-mail: Following this spirit, we are interested to explore the
huyongli@bjut.edu.cn). possibility of improving the robustness of CAPTCHA to-
J. Yu is with the School of Computer and Information Technology and the wards algorithm cracking without changing the traditional
Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong
University, Beijing, China (e-mail : jianyu@bjtu.edu.cn). character-based scheme. In other words, is it possible to
design character CAPTCHAs only friendly to human instead

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
2

of simply increasing content complexity? The key lies in security of CAPTCHAs have been studied. A fundamental
finding the algorithm limitation compatible to the scheme of requirement of CAPTCHAs necessitates that be designed to
character image. One candidate is the vulnerability to visual be easy for humans but difficult for computers. In traditional
distortions. We have conducted data analysis and observed CAPTCHA design, the trade-off between usability and se-
that human and algorithm exhibit different vulnerability to curity is difficult to balance. The three traditional designs
visual distortions (the observations are detailed in Section III). are most common: background confusion, using lines and
This inspires us to exploit those distortions friendly to human collapsing [18]. But there are some studies that use auto-
but obstructing algorithm to pollute the original character generated methods to synthesis CAPTCHA images, e.g., using
CAPTCHA. Specifically, adversarial perturbation [12] exactly GANs, instead of manual design [19]. These auto methods,
meets this requirement: adversarial attack1 and CAPTCHA which are applied to both character-based CAPTCHA and
share the common intention that human is imperceptible to but image-based CAPTCHA, are novel approaches for generating
algorithm is significantly affected by the same distortion. The CAPTCHAs, but they still attempt to increase content com-
notorious characteristic of adversarial perturbation for visual plexity of CAPTCHAs.
understanding turns out to be the desired one for CAPTCHA To overcome the limitations of traditional character-based
design. CAPTCHAs, other designs have been proposed, e.g., 3D-based
Inspired by this, we employ adversarial perturbation to CAPTCHAs, Animated CAPTCHAs [18]. 3D approaches to
design robust character-based CAPTCHA in this study. Cur- CAPTCHA design involve the rendering of 3D models to an
rent state-of-the-art cracking solution views CAPTCHA OCR image [20], [21]. However, it has been demonstrated that this
(Optical Character Recognition) as a sequential recognition approach is easy to attacks [22], [23]. Animated CAPTCHAs
problem [13]–[17]. To remove the potential distortions, further attempt to incorporate a time dimension into the design. The
image preprocessing operations are typically added before addition of a time dimension is assumed to increase the
OCR. Correspondingly in this study, we propose to simul- security of the resulting CAPTCHA. Nevertheless, techniques
taneously attack multiple targets to address the sequential that can successfully attack the CAPTCHAs design have been
recognition issue (Section IV-A), differentiably approximate developed [24].
image preprocessing operations (Section IV-C) and stochastic The last few years have witnessed deep learning plays
image transformation (Section IV-D) in the adversarial exam- an important role in the field of artificial intelligence. The
ple generation process to cancel out their potential influence. recognition rate of character-based CAPTCHAs increases
Moreover, since we have no knowledge about the detailed year by year. George et al. proposed a hierarchical model
algorithm the cracking solution used (e.g., neural network called the Recursive Cortical Network (RCN) that incorporates
structure), the generated adversarial examples are expected neuroscience insights in a structured probabilistic generative
to be resistant to unknown OCR algorithms in the black-box model framework, which significantly improved the recogni-
cracking. This study resorts this issue to ensemble adversarial tion rate [25]. To remove the interference in the background,
training by generating adversarial examples effective towards Ye et al. proposed the GAN-based approach for automati-
multiple algorithms (Section IV-B). In summary, the contribu- cally transforming training data and constructing solvers for
tions of this study are two-fold: character-based CAPTCHAs [26]. The convolutional neural
• We have discovered the different vulnerability between network shows a powerful performance in the recognition of
human and algorithm on visual distortions. Based on various characters, including Chinese characters [27]. But the
the observations, adversarial perturbation is employed to convolutional neural network has low recognition accuracy in
improve the robustness of character-based CAPTCHA. confusion class. To solve this problem, Chen et al. proposed
• Corresponding to the characteristics of typical OCR a novel method of selective learning confusion class for
cracking solutions, we proposed a novel methodology character-based CAPTCHAs recognition [28]. As the com-
addressing issues including sequential recognition, indif- plexity of character-based CAPTCHAs increases, the methods
ferentiable image preprocessing, stochastic image trans- based on combining convolutional neural network and recur-
formation and black-box cracking. rent neural network achieve state-of-the-art performance [13]–
[17]. In this paper, we employ the architecture consists of
convolution neural network (CNN) layers and long short-term
II. R ELATED W ORK
memory (LSTM) as the default OCR algorithm. We also test
A. character-based CAPTCHAs our CAPTCHAs on the latest method [17] in Section V-D,
In online services, character-based CAPTCHAs are the most which is an attention-based model that also consists of CNN
popular protection to deter character recognition programs. layers and LSTM.
Since the initial goal of CAPTCHA, friendly design and
1Adversarial attack refers to the process of adding small but B. adversarial example
specially crafted perturbation to generate adversarial examples mis- While deep learning has achieved great performance, it
leading algorithm. To avoid confusion with the process of attacking also has some security problems. Recent work has discovered
CAPTCHA, in this study, we use “adversarial attack” to indi-
cate the generation of adversarially distorted CAPTCHAs and use that the existing machine learning models, not just deep
“CAPTCHA crack” to indicate the attempt of passing CAPTCHA neural networks, are vulnerable to adversarial example [12].
with algorithms. Given a trained classifier F with model parameters W , a

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
3

valid input x and with corresponding ground truth prediction gradients lead adversarial attack methods using a single sample
y, i.e., y = F (x) with model parameters W . It is often of the randomness to incorrectly estimate the true gradient.
to get a similar input x0 is close to x according to some Goodfellow et al. [30] first proposed adversarial training
distance metric d(x, x0 ), and cause y 6= F (x0 ) with model method, adversarial examples are regarded as training samples
parameters W . An example x0 with this property is known to fit the model until these samples are classified correctly. The
as a untargeted adversarial example. A more powerful but idea is effective and general for all types of adversarial attacks.
difficult example called targeted adversarial example is more This makes the network more robust against the adversarial
than a misclassification example, i.e., t is a target label and examples, but cost expensive computation, especially at a large
t 6= y, t = F (x0 ) with model parameters W . scale, e.g., the ImageNet [36] scale. In general, the existing
Prior work that considers adversarial examples can be defensive methods cannot completely eliminate adversarial
generally classified into two categories: white-box attack and attacks.
black-box attack. White-box attack has full knowledge of Many researchers have found that adversarial example can
the trained classifier F including the model architecture and be applied in other tasks, such as semantic segmentation [37],
model parameters W . Black-box attack has no or limited face detection [38], and even speech recognition [39] and
knowledge of the trained classifier F . Black-box setting is translation [40]. The majority of the published papers have fo-
apparently harder than white-box setting for attackers because cused on how to eliminate the impact of adversarial examples
of the leaked gradient information. It seems that black-box in application. Li et al. [41] evaluated adversarial examples
attack is impossible, but adversarial examples that affect among different detection services, such as violence, politi-
one model can often affect another model, which is called cian, and pornography detection. Ling et al. [42] proposed a
transferability [29]. In the paper, we rely on the transferability uniform platform for comprehensive evaluation on adversarial
and deploy ensemble-based approaches to generate adversarial attacks and defenses in application, which can benefit future
CAPTCHAs. adversarial examples research. In contrast, studies on employ-
Szegedy et al. [12] first pointed out adversarial exam- ing adversarial examples against the malicious algorithm are
ples and proposed a box-constrained LBFGS method to find relatively limited. Osadchy et al. [43] employed adversarial
adversarial examples. To decrease expensive computation, examples to design CAPTCHAs and analyzed security and
Goodfellow et al. [30] proposed the fast gradient sign method good usability of CAPTCHAs. But they only considered these
(FGSM) to generate adversarial examples by performing a data types like MNIST and ImageNet instead of CAPTCHA
single gradient step. Kurakin et al. [31] extended this method data types. Zhang et al. [44] studied the robustness of adver-
to an iterative version, and find out that adversarial examples sarial examples on different types of CAPTCHA and gave the
can influence physical world. Dong et al. [32] further extended suggestions that how to improve the security of CAPTCHA
the fast gradient sign method family by proposing momentum- using adversarial examples. Shi et al. [45] improved the
based iterative algorithms. In addition, there are some more effectiveness of the adversarial example by using the Fourier
powerful methods called optimization-based attack methods. transform to generate CAPTCHA images in the frequency
Deepfool [33] is an attack technique optimized for the L2 domain. However, they only considered generating adversarial
distance metric. This method is based on the assumption that examples on CNN systems, which is essentially the adversarial
the decision boundary is partly linear, then the distance and attack algorithm based on the classification task. In contrast,
direction of the data points to the decision boundary can the current state-of-the-art CAPTCHA cracking system con-
be calculated approximately. C&W [34] is another targeted sists of feature extraction module and sequential recognition
optimization-based method. It achieves its goal by increasing module (CNN + LSTM). Shi et al. [46] deployed character-
the probability of target label. based adversarial CAPTCHAs on a large-scale online platform
To defend against adversarial examples, several adversarial and tested the proposed CAPTCHAs on convolutional recur-
defensive methods have been proposed, which has been an ac- rent neural networks [47]. However, they ignored experiments
tive field of AI research. Referring to [35], we generally divide and discussions on adversarial defense technologies, such as
adversarial defensive methods into two categories. Athalye image binarization and adversarial training. In Section V-D, we
et al. [35] identify gradient masking, or called obfuscated compare our method with ACs [45] to prove that considering
gradients, which leads to a false sense of security in defenses the sequential recognition is essential. In Section V-B, we
against adversarial examples. The authors addressed that the show the necessity of considering image preprocessing.
reason why many adversarial defenses can defend against
adversarial examples is that the fast and optimization-based III. DATA A NALYSIS
methods cannot succeed without useful gradient information. To justify the feasibility of employing algorithm limitations
The most common gradient masking methods include input for CAPTCHA design and motivate our detailed solution, this
transformation and stochastic gradients. Input transformation section conducts data analysis to answer two questions: (1)
techniques, e.g., image cropping and image binarization, cause Whether human and algorithm have different vulnerability to
the gradients to be non-existent or incorrect. In this paper, visual distortion? (2) What characteristics to consider when
image binarization will definitely result in non-differentiable employing distortions to design robust CAPTCHA?
if gradient masking is not overcome. Some adversarial defense Text-based CAPTCHA is the most widely deployed scheme
methods cause the network itself is randomized or the input requiring subjects to recognize characters from 0-9 and A-
is randomly transformed. These methods based on stochastic Z. Due to its simplicity, character-based CAPTCHA is very

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
4

(a)Character distorted by Gaussian white noise (b)Character distorted by adversarial perturbation

1.1 1.1
1 1
0.9 0.9
Recognition accuracy

Recognition accuracy
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 algorithm human 0.1 algorithm human
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Distortion level Distortion level

(c)Recognition accuracy on Gaussian distorted characters (d)Recognition accuracy on adversarially distorted characters

Fig. 2. Human v.s. algorithm vulnerability analysis results on Gaussian and adversarial distortions

A. Vulnerability Analysis to Visual Distortion


This subsection designs character recognition competition
between human and algorithm to analyze their vulnerabil-
ity to visual distortions. We employed two types of visual
distortions: (1) Gaussian white noise is one usual distortion
to generate CAPTCHAs. In this study, the added one-time
Gaussian white noise follows normal distribution with mean
(a) µ̃ = 0, variance σ̃ = 0.01 and constant power spectral
density. (2) Adversarial perturbation has been recognized as
1.1
1
imperceptible to human but significantly confusing algorithm.
0.9 We employ the widely used F GSM [30] to add adversarial
Recognition accuracy

0.8 perturbation, where one-time perturbation is constituted with


0.7
0.6 step size of 0.02. To examine the change of recognition
0.5 performance with increasing distortion difficulty, we added
0.4
0.3
8 levels of distortions onto the original character images
0.2 accumulatively: each level corresponds to 5 one-time Gaus-
0.1 before after
sian white noises and adversarial perturbations respectively.
0
0 1 2 3 4 5 6 7 8 9 10 Examples for derived distorted CAPTCHA images in different
Distortion level levels are illustrated in Fig. 2(a) and (b).
(b) Regarding the human side, we recruited 77 master workers
from Amazon Mechanical Turk (MTurk). Each subject was
Fig. 3. The affection of image preprocessing: (a) distorted characters before asked to recognize 450 character CAPTCHAs with Gaussian
(top row) and after (bottom row) image binarization; (b)Recognition accuracy
on adversarially distorted characters. and adversarial distortions in different levels respectively.
Regarding the algorithm side, we employed the state-of-the-
art OCR (Optical Character Recognition) algorithm, which
is the segmentation-based approach for OCR works by seg-
menting a text line image into individual character images
effective to examine the robustness towards cracking algo- and then recognizing the characters [48]. The resultant av-
rithm as well as friendliness to human. Therefore, this study erage recognition accuracies for Gaussian and adversarially
employs character-based CAPTCHA as the example scheme distorted CAPTCHAs are shown in Fig. 2(c) and (d). We
to conduct data analysis, develop solution and implement can see that, for Gaussian distorted CAPTCHAs, human’s
experiments. Specifically, during data analysis, we assume that recognition accuracy consistently declines as the distortion
each CAPTCHA question is constituted by single character level increases, indicating that Gaussian white noise tends
in an RGB image with unique resolution of 48 × 64px. The to undermine human’s vision. On the contrary, the examined
character font is fixed as DroidSansMono. The remainder of OCR algorithm demonstrates good immunity to Gaussian
the section will report the observations regarding human and white noise, possibly due to the noise removal effect by
character recognition performance in different scenarios. multiple convolutional layers [49]. It is easy to imagine that

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
5

if we design CAPTCHA by adding Gaussian white noise, generation framework consists of three modules: multi-target
as the noise level increases, the resultant CAPTCHAs will attack, ensemble adversarial training, and image preprocessing
critically confuse humans instead of obstructing the cracking differentiable approximation. The proposed framework and its
OCR algorithms. relation to CAPTCHA cracking are illustrated on the right of
For adversarially distorted CAPTCHAs, we observed quite Fig. 4.
opposite recognition results. Fig. 2(d) shows that humans
are more robust to the adversarial perturbations, while OCR A. Multi-target Attack towards Sequential Recognition
algorithm is highly vulnerable as the adversarial distortion Typical CAPTCHAs usually contain more than one char-
increases. This is not surprising since adversarial perturbation acter for recognition, e.g. the example CAPTCHAs contain
is specially crafted to change the algorithm decision under 4 characters. Therefore, state-of-the-art CAPTCHA cracking
the condition of not confusing human. This characteristic of solutions are forced to address a sequential character recogni-
adversarial perturbation demonstrates one important limitation tion problem at the OCR stage [48]. Specifically, OCR stage
of algorithm regards to human ability, which perfectly satisfies consists of three sub-modules as feature extraction, sequential
the requirement of robust CAPTCHA: algorithm tends to fail, recognition, and output decoding. Feature extraction is basi-
while human remains successful. Therefore, we are motivated cally realized by a convolutional neural network to encoding
to employ adversarial examples to design robust CAPTCHA the input image as neural feature. Sequential recognition is
to distinguish between algorithm and human. typically realized by a recurrent neural network to process the
issued image neural feature and output multiple tokens includ-
B. Characteristics Affecting Robust CAPTCHA Design ing characters (0-9, A-Z) and blank token ∅2 Output decoding
The previous subsection observes that adversarial pertur- serves to transform the sequential tokens into final character
bation is effective to mislead state-of-the-art OCR algorithm, recognition results, by merging sequentially duplicated tokens
which shows its potential to be employed to design robust and removing blank ∅ tokens. For example, the original token
CAPTCHA. However, typical CAPTCHA cracking solution sequence “aa∅b∅∅ccc∅dd” will be transformed to “abcd”.
involves beyond OCR, e.g., image preprocessing operations While CAPTCHA cracking views OCR as a sequential
like binarization and Gaussian filtering will be applied to recognition problem, standard adversarial perturbation is de-
remove distortions before issuing to the OCR module. Fig. 3(a) signed to attack single target. In this study, we propose to
illustrates the adversarially distorted CAPTCHA images before attack multiple targets corresponding to the multiple tokens
and after binarization preprocessing. It is easy to conceive that derived from OCR sequential recognition. The generated ad-
the effectiveness of adversarial perturbation will be critically versarial CAPTCHA image is expected to simultaneously mis-
affected by image preprocessing operations. classify all the character tokens. For specific token sequence
We further quantified this affection by analyzing the OCR t, all the characters appearing in t constitute the original set
performance on the same adversarially distorted CAPTCHA Θ, while the remaining characters from (0-9, A-Z) constitute
images from previous subsection. The recognition accuracies the adversary set Θ̄. Denoting the raw image as x and the
on the CAPTCHAs before and after binarization preprocessing corresponding adversary image as x0 , the multi-target attack
are plotted and compared in Fig. 3(b). It is shown that is formulated as the following optimization problem:
after removing most distortions via image binarization, OCR min d(x, x0 ) + λ ·
X
[max F (x0 )θj i − F (x0 )θθ̄i ]+ (1)
algorithm demonstrates basically stable performance in recog- 0
x
θi ∈Θ
j6=θ̄i i

nizing CAPTCHAs with different levels of adversarial pertur-


bation. This tells us that standard adversarial perturbation is where d(·, ·) is distance function to minimize the modification
insufficient to obstruct the cracking method. It is necessary from x to x03 , λ is the weight parameter balancing between
to design the robust CAPTCHA solution considering the the image modification and the misclassification confidence in
characteristics (like preprocessing operations) of CAPTCHA the second term. Within the second term, θi is the character
cracking method. appearing in the original set Θ, θ̄i is its one-to-one mapping
character in the adversary set Θ̄, F (x0 )θi denotes the output of
IV. M ETHODOLOGY the second-to-last layer, the logits, corresponding to token θi
after sequential recognition, F (x0 )θj i denotes its j dimension,
As shown on the left of Fig. 4, typical cracking of character- and [f ]+ is the positive part function denoting max(f, 0).
based CAPTCHA consists of two stages as image prepro- Note that the one-to-one mapping from θi to θ̄i can be
cessing and OCR. The above data analysis has demonstrated either random or fixed. Random one-to-one mapping leads to
that image preprocessing has the effect of distortion removal, targeted adversarial attack, and fixed mapping leads to non-
making it not possible to straightforwardly employ adversarial targeted adversarial attack4 .
perturbation for robust CAPTCHA design. In addition to the
2 For typical 4 character-based CAPTCHAs, recurrent neural net-
image preprocessing stage, the OCR stage also possesses char-
acteristics obstructing CAPTCHA: (1) sequential recognition, work usually outputs 12-token sequence to improve tolerance for
disabling the traditional single character-oriented adversarial segmentation and alignment [48].
3 Alternative choices for the distance function are allowed. In our
perturbation; and (2) black-box crack, making it ineffective to experiment, we use L2 distance.
attack one specific OCR model. To address the above char- 4 The reported experimental results in Section V are based on

acteristics of CAPTCHA cracking, our proposed CAPTCHA random one-to-one mapping.

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
6

recognized
characters “R4GQ” Output decoding ① Multi-target Attack ② Ensemble
Adversarial Training
“RR 444 GGQQ” 𝜃𝑖 𝜃𝑖 + 𝐾
min 𝑑 𝑥, 𝑥 ′ + 𝜆 ෍[max 𝐹 𝑥 ′ 𝑗 − 𝐹 𝑥′ ഥ𝑖 ]
𝜃
𝑥′ ഥ𝑖 𝐹෨ 𝑥 ′ = ෍ 𝛼𝑘 𝐽𝑘 (𝑥′)
Optical Character Sequential recognition 𝜃𝑖
𝑗≠𝜃
𝑘=1
Recognition “RR 444 G GQQ”

“AA BBB C CDD”


Feature extraction GoogLeNet ResNet DenseNet 4ConvNet

Gaussian ③ Differentiable ④ Expectation


filtering Approximation
Image Preprocessing
Image
binarization
original Stochastic adversarial
𝑥′2
CAPTCHA transformation s 𝑥′ =
1

1+𝑒 −(𝑥 −𝜏) g 𝑥′ =
1
2𝜋𝜎

𝑒 2𝜎2 CAPTCHA image
image CAPTCHA cracking CAPTCHA generation
“ABCD”
Fig. 4. The proposed robust CAPTCHA designing framework. The left represents the process of CAPTCHA cracking, including sequential recognition,
feature extraction, image binarization (Gaussian filtering) and stochastic transformation. The right represents our solution of CAPTCHA generation, including
the corresponding multi-target attack, ensemble adversarial training, differentiable approximation and expectation, respectively.

PK
When the original set Θ contains only one character, the where αk is the ensemble weight with k=1 αk = 1. In most
multi-target attack reduces to single-target attack as the stan- cases, αk = 1/K except that one model is more important than
dard adversarial perturbation. In fact, according to the mech- others. Among the three sub-modules of OCR stage, feature
anism of output decoding in CAPTCHA cracking, we only extraction has the most model choices (e.g. various CNN
need to misclassify any one of the character tokens to invalid structures as GoogLeNet [50], ResNet [51]) which can be eas-
the final recognition result. The above equation in Eqn. (1) ily implemented into different CAPTCHA cracking solutions.
provides a general case of attacking flexible numbers of Therefore, this study addresses the black-box cracking issue
character tokens. In practice, the number of attacked characters by attacking multiple feature extraction models. Specifically,
is one important parameter to control the model performance. the training data and basic structure of Ji (x0 ) and F (x0 ) are
More attacked characters guarantee higher success rate to identical except for the different CNN structures in the feature
resist crack, yet leading to more derived distortions and human extraction sub-module. On the number of CNN structures, the
recognition burden. The quantitative influence of attacked larger the value of K, the stronger the generalization capability
character number on the image distortion level and algorithm of the derived adversarial CAPTCHA images. However, an
recognition rate is discussed in Section V-C. excessive K value will lead to high computational complexity
and trivial weight αk to underemphasize single model. Refer-
ring to previous studies on ensemble adversarial attack [52],
B. Ensemble Adversarial Training towards Black-box Crack
3 ∼ 5 models achieve a good balance between transferability
As mentioned in Section I, CAPTCHA cracking may and practicality. In this study, we select K = 3 and evenly set
employ multiple OCR algorithms for character recognition. αk = 1/3. The experimental results in [52] show that under
At the stage of designing CAPTCHA, it is impractical to the same training set, the adversarial examples can achieve
target one specific OCR algorithm, which requires to design stronger transferability when the network structure is more
adversarial CAPTCHA images that are effective to as many similar, and it is reasonable to choose the model with large
OCR algorithms as possible. Fortunately, it is recognized that structure difference to employ ensemble adversarial training.
adversarial perturbation is transferable between models: if The performance of employing ensemble adversarial training
an adversarial image remains effective for multiple models, to resist different OCRs is reported in Section V-D.
it is more likely to transfer to other models as well [29].
Inspired by this, in order to improve the resistance to unknown
C. Differentiable Approximation towards Image Preprocess-
cracking models, we propose to generate adversarial images
ing
simultaneously misleading multiple models.
Specifically, given K white-box OCR models with their cor- The data observations in Section III-B demonstrate the
responding the output of the second-to-last layer as J1 , ..., JK , distortion removal consequences from binarization operation,
we re-formulate the objective function in Eqn. 1 by replacing requiring us to consider the affection of image preprocessing
F (x0 ) with F̃ (x0 ) defined as follows: in adversarial image generation. To address this, we regard
K
image preprocessing operation as part of the entire end-to-
F̃ (x0 ) =
X
αk Jk (x0 ) (2) end solution so that we can generate corresponding adversarial
k=1
images effectively to mislead the whole cracking solution.

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
7

According to the usability to be incorporated into the end-to- input x controlled by the adversary to the “true” input t(x)
end solution, image preprocessing operations can be roughly perceived by the OCR rather than optimizing the objective
divided into two categories as either differentiable or non- function of a single example. We then re-formulate the second
differentiable. For each category, we select one representative term in Eqn. (4) by replacing x, x0 with t(x), t(x0 ) defined as
operation to address in this study, i.e., Gaussian filtering follows:
and image binarization. Regarding the differentiable Gaussian X
0
x 2 Et∼T [max F̃ (φ(t(x0 )))θj i − F̃ (φ(t(x0 ))))θ̄θi ]+ (5)
filtering operation, g(x0 ) = √2πσ 1
e− 2σ2 , we can readily θi ∈Θ
j6=θ̄i i

incorporate it into the OCR model (Eqn. (1), Eqn. (2)) by


Furthermore, rather than simply taking the d(·, ·) to constrain
replacing the input image x0 with the preprocessing image
the solution space, we instead aim to constrain the expected
g(x0 ). Both forward and backward propagation are conducted
effective distance between the adversarial and original inputs.
on the replaced function F (g(x0 )), leading to the generated
The first term in Eqn. (4) is replaced by d(t(x), t(x0 )) defined
adversarial images expected to eliminate the affection from
as follows:
Gaussian filtering.
Et∼T [d(t(x), t(x0 ))] (6)
Regarding the non-differentiable image binarization , we
cannot straightforwardly incorporate it into the objective func- In practice, the distribution T can model perceptual trans-
tion. Instead, we find a differentiable approximation s(x0 ) to formation such as color change, image translation, random
image binarization and incorporate the approximated function rotation, or addition of noise. These transformations amount
into the end-to-end solution. In this study, s(x0 ) is defined as to a set of the random linear combinations, which are more
follows: thoroughly described in Section V-E. Then we can approx-
1
s(x0 ) = x0 −τ
(3) imate the gradient of the expected value through sampling
1 + e− ω transformations independently at each gradient descent step in
where τ denotes the threshold of image binarization, ω denotes optimizing the objective function and differentiating through
the degree of lateral expansion of the curve. Note that to the transformation. Given its ability to generate robust ad-
guarantee that the generated adversarial images are resistant to versarial CAPTCHA images, we use the gradient of the ex-
image binarization, we only employ the approximated s(x0 ) pected value to directly eliminate the affection from stochastic
at the backward propagation stage to update the generated transformation for differentiable image transformation. But
image, while the forward propagation still use the actual x0 to regarding some non-differentiable image transformation, we
calculate ∇x F (x). cannot straightforwardly differentiate through the transforma-
To simultaneously resist to the affections from Gaussian tion. Instead, we can use the same category as mentioned
filtering and image binarization, we concatenate s(·) and g(·) in Section IV-C that finding a differentiable approximation
in the final objective function. Therefore, the overall optimiza- and incorporate the approximated function into the end-to-end
tion problem incorporating the three proposed modules is as solution.
follows:
X
min d(x, x0 ) + λ · [max F̃ (φ(x0 ))θj i − F̃ (φ(x0 )))θθ̄i ]+ V. E XPERIMENTS
x 0 i
j6=θ̄i
θi ∈Θ We examined CAPTCHA images with 4 characters for ex-
(4) periments. The CAPTCHAs are RGB images with resolution
where F̃ (·) denotes the ensemble of multiple OCR models of 192 × 64px. Regarding the cracking method, we consid-
defined in Eqn. (2), and φ(x0 ) = s(g(x0 )) denotes the approx- ered image binarization and Gaussian filtering (kernel size:
imated image preprocessing operations defined in Eqn. (3). 3 × 3, σ = 0.8) at the image preprocessing stage. The OCR
stage is instantiated with CNN structures for feature extraction
D. Expectation towards Stochastic Image Transformation and LSTM+softmax for sequential recognition. Regarding our
The above three subsections are more than enough for gen- proposed CAPTCHA generation method, image binarization
eral CAPTCHAs generation towards OCR cracking solution. is approximated with τ = 0.8, ω = 0.05, and 4 CNN
However, a potential cracker could use a number of trans- structures are employed for ensemble adversarial training. All
formations to make the adversarial perturbations meaningless, experiments are conducted on Nvidia GTX 1080Ti GPU with
e.g., the cracker could slightly rotate the image, doing so 11G memory.
entirely bypasses general adversarial examples. Prior work
has shown that these adversarial examples fail to remain
effective under image transformations, and the gradient of the A. Qualitative Attention Analysis
expected value can increase adversarial even under various Visual attention has been widely used to explain which
image transformations [53]. To integrate image preprocessing region of image contributes much to the model decision [54].
issues with potential transformations, we compute expectation In this study, we extracted the attention map using Grad-
of stochastic image transformation including different angles CAM [55] to understand the change of recognition perfor-
rotation. The expectation allows us for the construction of mance under different visual distortions.
adversarial examples that remain adversarial over a chosen The first and second columns of Fig. 5 visualize the at-
transformation distribution T . More concretely, we use a tention map of the raw image and the image with Gaussian
chosen distribution T of transformation functions t taking an white noise. It can be found that Gaussian white noise brings

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
8

TABLE I
T HE RECOGNITION OF DIFFERENT COMPLEXITY LEVELS OF CAPTCHA S IN THE DIFFERENT SETTINGS . T HE RESULTS OF ALGORITHMS ARE OBTAINED
AFTER G AUSSIAN FILTERING AND IMAGE BINARIZATION .

Raw rCAPTCHA parallel rCAPTCHA w/o preprocessing rCAPTCHA


algorithm 100.0% 95.6% 68.4% 0.0%
Easy
human 99.0% 94.0% 94.0% 94.0%
algorithm 91.0% 88.0% 58.0% 0.0%
Medium
human 73.0% 51.0% 67.0% 65.0%
algorithm 81.0% 83.0% 45.0% 4.0%
Hard
human 56.0% 36.0% 51.0% 49.0%

B. Quantitative Performance Comparison


To compare the performance of the proposed robust
CAPTCHA (rCAPTCHA) designing method, we report the
recognition accuracies of state-of-the-art cracking solution
under the following settings:
• Raw: the original CAPTCHA images without adding
adversarial perturbations;
Fig. 5. Example images (top row) and theri attention maps (bottom row).
• rCAPTCHA parallel: the proposed solution to generated
From left to right, we show the original image, the image with Gaussian
white noise, the adversarial generated by our method and the adversarial image adversarial images, expect that the sequential recognition
generated by our method but without considering image preprocessing. sequential recognition sub-module of OCR is replaced
by 4 parallel recognition networks (each realized by
one fully-connected layer) to address one character’s
recognition;
• rCAPTCHA w/o preprocessing: the proposed solution to
generated adversarial images, but without considering the
image preprocessing stages;
• rCAPTCHA: the proposed solution to generated adversar-
ial images, considering both sequential recognition and
image preprocessing operations.
The state-of-the-art cracking solution is trained over 20, 000
Fig. 6. Example CAPTCHAs with different complexity levels (from top to
bottom: easy, medium, hard). Each row from left to right shows the different
CAPTCHA images with batch size 128. To examine the
settings of Raw, rCAPTCHA parallel, rCAPTCHA w/o preprocessing and application scope of the proposed CAPTCHA generation
rCAPTCHA. methods, we conducted experiments on the CAPTCHAs with
three levels of complexities: easy, medium, hard. Fig. 6
shows examples of different complexity levels of CAPTCHAs
trivial attention change from the original image. Both attention in the above four settings. For each of the settings, we
maps keep to the region where characters exist. This well selected/generated 500 CAPTCHA images for testing, and
explains the data observations in Section III-A that algorithm summarize the derived average recognition accuracy in Table I.
is generally robust to Gaussian white noise. We also visualized Experimental observations include: (1) By adding adversarial
the attention map of the CAPTCHA images generated from perturbations, the right 3 columns consistently obtain lower
our proposed method on the third column of Fig. 5. It is shown accuracies than the first column, showing the usability of
that the attention maps deviate much from the original image employing adversarial perturbations in resisting CAPTCHA
and focus on unrelated regions where there exist no characters. cracking. (2) Without considering the sequential recognition
This justifies our motivation to employ adversarial perturbation or image preprocessing characteristics, the resisting effect
to mislead the algorithm prediction result, and demonstrates of rCAPTCHA parallel and rCAPTCHA w/o preprocessing
the effectiveness of our proposed CAPTCHA design method. is not as obvious as that of rCAPCHA. This validates the
To further validate the necessity of considering image pre- necessity of multi-target attack and differentiable approxi-
processing in robust CAPTCHA design, the attention maps for mation modules. (3) Regarding CAPTCHAs with different
the images generated from our method but without considering complexities, we observed consistent phenomenon among the
image preprocessing are shown in the fourth column of Fig. 5 four settings, demonstrating the wide application scope of the
for comparison. It is easy to conceive that without considering proposed CAPTCHA generation method.
image preprocessing, the generated images fail to deviate the The notable decrease in algorithm recognition accuracies
attention from the character regions. This is consistent with shows the effectiveness of employing adversarial perturbation
the fact that image preprocessing has the effect of weakening to mislead cracking solution. To facilitate the correlation un-
or eliminating adversarial perturbation. derstanding between misleading cracking solution and friendly

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
9

300 300

250
250

Image distortion
Image distortion

200
200
150
150
100

50 100
10 12 14 16 18 20 22 24 26 28 30 1 2 3 4
Distortion level Number of characters attacked
(a) Image distortion (a) Image distortion

1 1
0.9 0.9
Algorithm recognition

Algorithm recognition
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
10 12 14 16 18 20 22 24 26 28 30 1 2 3 4
Distortion level Number of characters attacked
(b) Algorithm recognition accuracy (b) Algorithm recognition accuracy
Fig. 7. The influence of λ on derived image distortion and cracking Fig. 8. The influence of |Θ| on derived image distortion and cracking
recognition accuracy. recognition accuracy.

to human recognition, we also provide the human recognition


accuracy for each experimental setting in Table I. Similar
to the data analysis, we have recruited 164 workers from
MTurk to recognize 4 character-based CAPTCHA images.
The reported accuracies are averaged over 1, 200 CAPTCHAs.
By comparing different rows, it is shown that the increasing
content complexity brings slight decrease of algorithm recog-
nition accuracy but causes huge trouble to human recognition. Fig. 9. Example CAPTCHAs with different image distortions: from left to
Among different setting columns, while the algorithm recog- right shows images with distortions of 100, 200, 300 and 400.
nition accuracy fluctuates a lot, the human recognition perfor-
mance basically remains stable, validating the different distor-
tion vulnerability between human and algorithm. In summary, The averaged distortion and recognition accuracy with the
regarding CAPTCHA images with different complexities, the change of λ are drawn in Fig. 7. It is shown that as λ increases,
proposed CAPTCHA generation method succeeds to invalid more image distortion is observed in the derived CAPTCHA
the cracking algorithm without increasing human recognition images and cracking method tends to fail in recognizing the
burden. generated CAPTCHAs. This is consistent with the definition
of λ in Eqn. (1). In practical applications, to prevent annoying
human subjects in recognizing the generated CAPTCHAs, an
C. Parameter Influence Analysis appropriate λ is selected with moderate image distortion and
The proposed robust CAPTCHA generation method mainly guaranteed cracking resistant performance. Our experimental
involves with two parameters: the weight parameter λ in results reported in Section V-B are based on λ = 20.
Eqn. (1) and the number of attacked characters |Θ|. As in- Regarding the number of attacked characters, we set |Θ| to
troduced in the methodology, the weight parameter λ controls {1, 2, 3, 4} respectively and examined the corresponding aver-
the relative importance between the visual distortion and aged image distortion and algorithm recognition accuracy in
misclassification confidence. We adjusted λ within range of Fig. 8. As shown in Fig. 8(a), with |Θ| increases, more image
[10, 30] with step of 1 and examined its influence on the distortion is needed to misclassify the characters. In Fig. 9
derived image distortion and cracking recognition accuracy. we show example CAPTCHAs generated by rCAPTCHA with
The image distortion is measured as the sum of squared pixel- different levels of image distortions. Combing with Fig. 8(a), it
wise difference between the original and adversarial images. is demonstrated that using the proposed rCAPTCHA method

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
10

TABLE II
T RANSFERABILITY OF ADVERSARIAL IMAGES GENERATED BETWEEN PAIRS OF MODELS . T HE ELEMENT (i, j) REPRESENTS THE ACCURACY OF THE
ADVERSARIAL IMAGES GENERATED FOR MODEL i ( ROW ) TESTED OVER MODEL j ( COLUMN ).

Testing Model
Training Model
4ConvNet ResNet DenseNet GoogLeNet GoogLeNet w/ Average
Method No Attack 93% 84% 95% 97% 94% 93%
4ConvNet 54% 68% 82% 71% 70% 69%
ACs ResNet 47% 63% 73% 53% 64% 60%
DenseNet 53% 70% 60% 71% 71% 65%
4ConvNet 1% 3% 13% 7% 16% 8%
ResNet 8% 0% 3% 12% 11% 7%
rCAPTCHA DenseNet 16% 2% 3% 23% 27% 14%
Ensemble 0% 2% 1% 3% 7% 3%
GoogLeNet w/ 12% 11% 18% 3% 1% 9%

to attack even all 4 characters, the derived CAPTCHAs are summarizes the black-box cracking recognition accuracy under
generally friendly to human and not bringing extra recognition different training-testing pairs. For example, the value of
burden. As shown in Fig. 8(b), the increase of |Θ| enhances 91% at element (1, 1) represents the recognition accuracy of
the confidence to mislead the cracking algorithm and obtains original CAPTCHA images on 4ConvNet. The value of 0%
consistently lower recognition accuracy. With the introduction at element (8, 1) indicates the recognition accuracy trained
of multi-attack towards sequential recognition, the proposed with ensembled 3 white-box models and tested on 4ConvNet.
rCAPTCHA method possess the flexibility to attack arbitrary Lower accuracy value means superior resistant performance to
number of characters. In our experiments, to guarantee the cracking solutions and better transferability of the method.
resistance capability, we fixed |Θ| = 4.
In the top half of Table II (ACs), the adversarial CAPTCHAs
without considering the sequential recognition obtain higher
D. Robustness towards Different OCRs average accuracies than rCAPTCHA. The accuracies of ACs
To compare the generalization and transferability of our are no lower than 60% when image preprocessing and im-
proposed rCAPTCHA method and ACs [45], we imple- age transformation are not involved. It is expected that the
mented different cracking methods and examined their recog- employing image preprocessing and image transformation can
nition accuracy on the generated CAPTCHAs. For gener- increase the accuracy, and 60% means that these adversar-
ating CAPTCHAs of our method, we respectively trained ial CAPTCHAs are almost recognized by the OCR. This
3 OCR models with different CNN structures, which are demonstrates when excluding controlled variables, considering
denoted as 4ConvNet, mini-ResNet and mini-DenseNet. For the sequential recognition problem is more important than
generating CAPTCHAs of [45], we trained the same OCR adopting Fourier transform during the generation of adversarial
as above, except that the sequential recognition sequential CAPTCHAs. In the bottom half of Table II (rCAPTCHA),
recognition sub-module (LSTM) of OCR is replaced by 4 par- we can observe that the adversarial images generated with
allel recognition networks (fully-connected layer). For testing one model perform well on their own models but generally
CAPTCHAs, we trained 2 OCR models. One is the same OCR perform poorly on other models. However, if we generate
with CNN structure of mini-GoogLeNet, the other is mini- the CAPTCHA images with ensemble training of 3 models,
GoogLeNet w/ attention [17], which also uses mini-GoogLeNet the testing recognition accuracies for all 5 models are no
but adopts the attention mechanism in LSTM. 4ConvNet uses higher than 7%. This demonstrates the transferability of the
four convolutional layers for feature extraction. The LSTM proposed rCAPTCHA method in employing ensemble training
input required a fixed-size feature vector, so we modified the towards black-box cracking. Specially, the value of 7% at
native network. mini-XNets are employed due to the quicker element (8, 4) demonstrates our method can perform well
convergence times and low resolution of CAPTCHA images: on GoogLeNet w/ attention (GoogLeNet w/). The reason
mini-ResNet consists of five ResBlocks and two convolutional is that the network structure of the current state-of-the-art
layers, mini-DenseNet consists of four DenseBlocks with four OCR model is similar (CNN + LSTM). We also generate
convolutional layers, and mini-GoogLeNet consists of two the adversarial images on GoogLeNet w/ attention (the last
Inception modules with six convolutional layers. row), which validates the generalization of our proposed
Three models of 4ConvNet, mini-ResNet, mini-DenseNet are mechanism. It is expected with more models implemented in
selected as white-boxs, with the remaining mini-GoogLeNet ensemble training, the resistant performance towards arbitrary
model and mini-GoogLeNet w/ attention as the black-boxs. black-box cracking methods will be guaranteed. In practical
The black-box models are regard as the potential OCR crack- applications, we can carefully select white-box models with
ing to simulate the alternative cracking choices in real-world typically different structures to improve the generalization and
applications. Averaged over 100 tested CAPTCHAs, Table II transferability to specific models.

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
11

TABLE III
D ISTRIBUTION OF TRANSFORMATIONS

Transformation Minimum Maximum


Rotation -15° 15°
X-axis Translation − 14 width 1
4 width

Y-axis Translation − 14 height 1


4 height

Color Change(RGB) -50 50


Fig. 10. Example CAPTCHAs with different complexity levels (from top to
Rescale 0.15 1.65 bottom: easy, medium, hard). Each row from left to right shows the different
settings of image transformation.

TABLE IV
T HE RECOGNITION OF RAW IMAGES AND ADVERSARIAL IMAGES . T HE
RESULT ARE OBTAINED AFTER STOCHASTIC TRANSFORMATION .

Images Easy Medium Hard


Raw 95.0% 91.6% 73.1%
Adversarial 1.3% 5.4% 9.0%

E. Robustness towards Stochastic Image Transformation


To justify the performance of our method for stochastic im-
age transformation, we implemented a set of transformations Fig. 11. The recognition accuracy of relax adversarial training.
on different complexity levels of CAPTCHAs and examined
their recognition accuracies on the generated CAPTCHAs.
increases along with complexity levels of CAPTCHAs are the
Specifically, we considered the distribution of transformations
noises associated with “hard” CAPTCHA images mix with
that includes rotation, color change, rescale, and translation
the adversarial perturbation and thus adversarial perturbation
of the image. We selected/generated the 1, 000 images for
will be weakened as the image gets more complex. By com-
each of the settings, randomly chose an image transformation
paring the results in Table I, the larger the “approximation”
for each image, and examined whether our method is robust
component is, the more we can detect this trend.
over the chosen distribution. For each adversarial example,
Fig. 10 shows example CAPTCHAs of different complexity
we evaluated over 100 stochastic transformations sampled
levels after stochastic transformations, combing with images
from the distribution at evaluation time. We give the specific
of non-transformations, which observes that these images are
parameters we chose of stochastic transformations in Table III,
still adversarial after different image transformations. This
where each parameter is sampled uniformly at random from
demonstrates that our approach is effective in generating robust
the specified range. These stochastic transformations include
adversarial examples.
color change, image translation, rotation and rescale.
Nevertheless, when the number of transformations the
Table IV summarizes the results. Experimental observa-
cracker chose is too large, expectation will occasionally fail
tions include: (1) By applying stochastic transformations, the
to generate an effective adversarial example. In this study,
second row consistently obtains lower accuracies than the
the number we chose is 4 and the scope of transformations
first row, showing the usability of employing expectation
is also reasonable. Another case occurs when we replace
towards stochastic image transformation. (2) After stochastic
non-differentiable image preprocessing with differentiable ap-
transformations, the accuracies of raw images become lower
proximation. Because the essence of expectation is also an
than before transformations. The decrease in algorithm recog-
“approximation”, when the “double approximation” occurs,
nition accuracies shows that stochastic transformation does
the effectiveness of adversarial example will be weakened.
hurt performance of normal image, which is caused by the
loss of image information and the shortcoming of neural
network, e.g., lack of rotation-invariance. The increase from F. Discussion on Adversarial Defense
0.0%, 0.0%, 4.0% to 1.3%, 5.4%, 9.0% means that stochastic The image non-differentiable preprocessing and stochastic
transformation partially plays a role in eliminating effective- image transformation both are aimed at weakening the ef-
ness of adversarial examples. Because the stochastic trans- fectiveness of adversarial example, which are called adver-
formation is a kind of adversarial defense technology, more sarial defenses. However, the existence of robust adversarial
specific discussions about adversarial defense are reported in CAPTCHAs implies that defenses based on stochastic trans-
Section V-F. The reason that algorithm recognition accuracy forming input are not secure: adversarial examples generated

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
12

by using differentiable approximation and expectation can system towards cracking. Character-based CAPTCHA, which
circumvent these defenses. Prior work has shown that most is most friendly and effective to human, to validate this idea
of adversarial defenses are based on obfuscated gradients, with the simplest scheme. This idea is expected to easily adopt
in this study: (1)image non-differentiable preprocessing is a to generate robust image-based and other CAPTCHAs.
kind of shattered gradients, which are nonexistent or incorrect It is noted that similar to the game competition between ad-
gradients caused either intentionally through non-differentiable versarial attack and defense, with more CAPTCHA designers
operations or unintentionally through numerical instability; employing adversarial attack to resist cracking, future cracking
(2) stochastic image transformation is a kind of stochastic solutions are expected to employ adversarial defense tech-
gradients, which depend on test-time randomness [35]. But niques for self-enhancement. We hope this study could draw
if the obfuscated gradient information can be approximated, it attention of future CAPTCHA designing on the competition
can only provide a false sense of security. between adversarial attack and defense. Moreover, with the
There is another kind of adversarial defense technology development of deep learning and other AI algorithms, we
called adversarial training, which is not dependent on obfus- are confronted with critical security-related problems when
cated gradient. Adversarial training solves a min-max game algorithms are maliciously utilized towards human. In this
through a conceptually simple process: train on adversarial case, it is necessary to get aware of the limitations of current
examples until the model learns to classify them correctly [30]. algorithms and appropriately employ them to resist the abuse
To further validate the adversarial defense, we study the ad- use of algorithms.
versarial training approach of [56] in this subsection. For this
scenario, we generated/selected 500 adversarial CAPTCHA R EFERENCES
images for testing. Then we started to fine-tune the model with
[1] A. M. Turing, “Computing machinery and intelligence-am turing,” Mind,
the 50, 000 steps. However, due to the complexity of OCR vol. 59, no. 236, p. 433, 1950.
model, compared with the common CNN model, standard [2] M. Naor, “Verification of a human in the loop or identification via the
adversarial training does not show the effectiveness as usual. turing test,” 1996.
[3] L. Von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum, “re-
After training with 50, 000 steps, the accuracy of OCR is captcha: Human-based character recognition via web security measures,”
still 0%. So we relax the constraints of standard adversarial Science, vol. 321, no. 5895, pp. 1465–1468, 2008.
training to examine whether the idea of adversarial training [4] A. A. Chandavale, A. M. Sapkal, and R. M. Jalnekar, “Algorithm to
break visual captcha,” in 2009 International Conference on Emerging
will work. Then we fine-tune the model on the same 500 Trends in Engineering & Technology, 2009, pp. 258–262.
adversarial images as testset. The results are shown in Fig. 11, [5] G. Mori and J. Malik, “Recognizing objects in adversarial clutter:
from combined results we make the following observations. Breaking a visual captcha,” in 2003 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, vol. 1, 2003, pp. I–I.
As training data increases, adversarial training significantly [6] S. Sivakorn, I. Polakis, and A. D. Keromytis, “I am robot:(deep) learning
improves the accuracy of the OCR model. But its shortcom- to break semantic image captchas,” in 2016 IEEE European Symposium
ings also obvious that the time cost is gradually increasing. on Security and Privacy (EuroS&P), 2016, pp. 388–403.
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Moreover, if a cracker wants to use adversarial training, he is Surpassing human-level performance on imagenet classification,” in
supposed to have access to the training dataset we use and the 2015 IEEE Conference on Computer Vision and Pattern Recognition
parameter in algorithm we choose, e.g., distance function to (CVPR), 2015, pp. 1026–1034.
[8] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C.
minimize the modification and so on. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-
We discussed adversarial defense technology on adversarial labeled dataset for audio events,” in 2017 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–
CAPTCHAs we proposed. The defense technologies based on 780.
obfuscated gradients cannot hinder the type of CAPTCHAs. [9] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+
The adversarial training based on non-obfuscated gradients is questions for machine comprehension of text,” 2016.
[10] M. R. Ogiela, N. Krzyworzeka, and L. Ogiela, “Application of
still effective but limited to practicality. knowledge-based cognitive captcha in cloud of things security,” Con-
currency and Computation: Practice and Experience, vol. 30, no. 21, p.
e4769, 2018.
VI. CONCLUSION [11] D. Geman, S. Geman, N. Hallonquist, and L. Younes, “Visual turing test
for computer vision systems,” Proceedings of the National Academy of
This study designs robust character-based CAPTCHAs to Sciences, vol. 112, no. 12, pp. 3618–3623, 2015.
resist cracking algorithms by employing their unrobustness to [12] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,
adversarial perturbation. We have conducted data analysis and and R. Fergus, “Intriguing properties of neural networks,” in 2014
International Conference on Learning Representations (ICLR), 2014.
observed human and algorithm’s different vulnerabilities to [13] Q. Liu, L. Wang, and Q. Huo, “A study on effects of implicit and
visual distortions. Based on the observation, robust CAPTCHA explicit language model information for dblstm-ctc based handwriting
(rCAPTCHA) generation framework is introduced with three recognition,” in 2015 International Conference on Document Analysis
and Recognition (ICDAR), 2015, pp. 461–465.
modules of multi-target attack, ensemble adversarial training, [14] T. M. Breuel, “High performance text recognition using a hybrid
differentiable approximation to image preprocessing, and ex- convolutional-lstm implementation,” in 2017 IAPR International Con-
pectation to stochastic image transformation. Qualitative and ference on Document Analysis and Recognition (ICDAR), vol. 1, 2017,
pp. 11–16.
quantitative experimental results demonstrate the effectiveness [15] M. Jenckel, S. S. Bukhari, and A. Dengel, “Transcription free lstm ocr
of generated CAPTCHAs in resisting cracking algorithms. model evaluation,” in 2018 International Conference on Frontiers in
We ascribe the main contribution not as proposing a specific Handwriting Recognition (ICFHR), 2018, pp. 122–126.
[16] H.-R. Shin, J.-S. Park, and J.-K. Song, “Ocr for drawing images using
CAPTCHA system, but as introducing the idea of exploiting bidirectional lstm with ctc,” in 2019 IEEE Student Conference on
algorithm unrobustness to increase the robustness of automated Electric Machines and Systems (SCEMS 2019), 2019, pp. 1–4.

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.3013376, IEEE
Transactions on Multimedia
13

[17] Y. Zi, H. Gao, Z. Cheng, and Y. Liu, “An end-to-end attack on text [41] X. Li, S. Ji, M. Han, J. Ji, Z. Ren, Y. Liu, and C. Wu, “Adversarial
captchas,” IEEE Transactions on Information Forensics and Security, examples versus cloud-based detectors: A black-box empirical study,”
vol. 15, pp. 753–766, 2019. IEEE Transactions on Dependable and Secure Computing, 2019.
[18] Y.-W. Chow, W. Susilo, and P. Thorncharoensri, “Captcha design and [42] X. Ling, S. Ji, J. Zou, J. Wang, C. Wu, B. Li, and T. Wang, “Deepsec: A
security issues,” in Advances in Cyber Security: Principles, Techniques, uniform platform for security analysis of deep learning model,” in 2019
and Applications, 2019, pp. 69–92. IEEE Symposium on Security and Privacy (SP), 2019, pp. 673–690.
[19] H. Kwon, Y. Kim, H. Yoon, and D. Choi, “Captcha image generation [43] M. Osadchy, J. Hernandez-Castro, S. Gibson, O. Dunkelman, and
systems using generative adversarial networks,” IEICE TRANSACTIONS D. Pérez-Cabo, “No bot expects the deepcaptcha! introducing immutable
on Information and Systems, vol. 101, no. 2, pp. 543–546, 2018. adversarial examples, with applications to captcha generation,” IEEE
[20] M. E. Hoque, D. J. Russomanno, and M. Yeasin, “2d captchas from Transactions on Information Forensics and Security, vol. 12, no. 11, pp.
3d models,” in Proceedings of the IEEE SoutheastCon 2006, 2006, pp. 2640–2653, 2017.
165–170. [44] Y. Zhang, H. Gao, G. Pei, S. Kang, and X. Zhou, “Effect of adver-
[21] C. R. Macias and E. Izquierdo, “Visual word-based captcha using 3d sarial examples on the robustness of captcha,” in 2018 International
characters,” 2009. Conference on Cyber-Enabled Distributed Computing and Knowledge
[22] V. D. Nguyen, Y.-W. Chow, and W. Susilo, “On the security of text-based Discovery (CyberC), 2018, pp. 1–109.
3d captchas,” Computers & Security, vol. 45, pp. 84–99, 2014. [45] C. Shi, X. Xu, S. Ji, K. Bu, J. Chen, R. Beyah, and T. Wang, “Adversarial
[23] Q. Ye, Y. Chen, and B. Zhu, “The robustness of a new 3d captcha,” captchas,” arXiv preprint arXiv:1901.01107, 2019.
in 2014 IAPR International Workshop on Document Analysis Systems, [46] C. Shi, S. Ji, Q. Liu, C. Liu, Y. Chen, Y. He, Z. Liu, R. Beyah, and
2014, pp. 319–323. T. Wang, “Text captcha is dead? a large scale deployment and empirical
[24] V. D. Nguyen, Y.-W. Chow, and W. Susilo, “Breaking an animated study,” in Proceedings of the 2020 ACM Conference on Computer and
captcha scheme,” in International Conference on Applied Cryptography Communications Security, 2020.
and Network Security, 2012, pp. 12–29. [47] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
[25] D. George, W. Lehrach, K. Kansky, M. Lázaro-Gredilla, C. Laan, for image-based sequence recognition and its application to scene
B. Marthi, X. Lou, Z. Meng, Y. Liu, H. Wang et al., “A generative text recognition,” IEEE Transactions on Pattern Analysis and Machine
vision model that trains with high data efficiency and breaks text-based Intelligence, vol. 39, no. 11, pp. 2298–2304, 2015.
captchas,” Science, vol. 358, no. 6368, p. eaag2612, 2017. [48] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait, “High-
[26] G. Ye, Z. Tang, D. Fang, Z. Zhu, Y. Feng, P. Xu, X. Chen, and Z. Wang, performance ocr for printed english and fraktur using lstm networks,”
“Yet another text captcha solver: A generative adversarial network based in 2013 12th International Conference on Document Analysis and
approach,” in Proceedings of the 2018 ACM SIGSAC Conference on Recognition, 2013, pp. 683–687.
Computer and Communications Security, 2018, pp. 332–348. [49] F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu, “Defense against
[27] Y. Lv, F. Cai, D. Lin, and D. Cao, “Chinese character captcha recognition adversarial attacks using high-level representation guided denoiser,” in
based on convolution neural network,” in 2016 IEEE Congress on 2018 IEEE Conference on Computer Vision and Pattern Recognition
Evolutionary Computation (CEC), 2016, pp. 4854–4859. (CVPR), 2018, pp. 1778–1787.
[28] J. Chen, X. Luo, Y. Liu, J. Wang, and Y. Ma, “Selective learning [50] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,
confusion class for text-based captcha recognition,” IEEE Access, vol. 7, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with con-
pp. 22 246–22 259, 2019. volutions,” in 2015 IEEE Conference on Computer Vision and Pattern
[29] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in ma- Recognition (CVPR), 2015, pp. 1–9.
chine learning: from phenomena to black-box attacks using adversarial [51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
samples,” arXiv preprint arXiv:1605.07277, 2016. recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
[30] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing Recognition (CVPR), 2016, pp. 770–778.
adversarial examples,” in 2015 International Conference on Learning [52] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adver-
Representations (ICLR), 2015. sarial examples and black-box attacks,” 2017 International Conference
[31] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples on Learning Representations (ICLR), 2017.
in the physical world,” in 2017 International Conference on Learning [53] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing robust
Representation Workshop, 2017. adversarial examples,” in 2018 International Conference on Machine
[32] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting Learning (ICML), 2018, pp. 284–293.
adversarial attacks with momentum,” in 2018 IEEE Conference on [54] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
Computer Vision and Pattern Recognition (CVPR), 2018, pp. 9185– tional networks,” in 2014 European Conference on Computer Vision
9193. (ECCV), 2014, pp. 818–833.
[33] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple [55] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
and accurate method to fool deep neural networks,” in 2016 IEEE D. Batra, “Grad-cam: Visual explanations from deep networks via
Conference on Computer Vision and Pattern Recognition (CVPR), 2016, gradient-based localization,” in 2017 IEEE International Conference on
pp. 2574–2582. Computer Vision (ICCV), 2017, pp. 618–626.
[34] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural [56] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “To-
networks,” in 2017 IEEE Symposium on Security and Privacy (SP), wards deep learning models resistant to adversarial attacks,” in 2018
2017, pp. 39–57. International Conference on Learning Representations (ICLR), 2018.
[35] A. Athalye, N. Carlini, and D. A. Wagner, “Obfuscated gradients give a
false sense of security: Circumventing defenses to adversarial examples,”
2018 International Conference on Machine Learning (ICML), pp. 274–
283, 2018.
[36] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in 2009 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
[37] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille, “Adversarial
examples for semantic segmentation and object detection,” in 2017 IEEE
International Conference on Computer Vision (ICCV), 2017, pp. 1369–
1378.
[38] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize to
a crime: Real and stealthy attacks on state-of-the-art face recognition,”
in 2016 ACM Sigsac Conference on Computer and Communications
Security, 2016, pp. 1528–1540.
[39] N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attacks
on speech-to-text,” in 2018 IEEE Security and Privacy Workshops
(SPW), 2018, pp. 1–7.
[40] J. Li, S. Ji, T. Du, B. Li, and T. Wang, “Textbugger: Generating
adversarial text against real-world applications,” in 2019 Annual Network
and Distributed System Security Symposium, 2019.

1520-9210 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 06,2020 at 01:46:57 UTC from IEEE Xplore. Restrictions apply.

You might also like