You are on page 1of 13

IEEE TRANSACTIONS ON MULTIMEDIA, VOL.

23, 2021 2575

Robust CAPTCHAs Towards Malicious OCR


Jiaming Zhang , Jitao Sang , Kaiyuan Xu , Shangxi Wu , Xian Zhao , Yanfeng Sun , Yongli Hu ,
and Jian Yu

Abstract—Turing test was originally proposed to examine


whether machine’s behavior is indistinguishable from a human.
The most popular and practical Turing test is CAPTCHA,
which is to discriminate algorithm from human by offering
recognition-alike questions. The recent development of deep
learning has significantly advanced the capability of algorithm
in solving CAPTCHA questions, forcing CAPTCHA designers
to increase question complexity. Instead of designing questions
difficult for both algorithm and human, this study attempts to
employ the limitations of algorithm to design robust CAPTCHA
questions easily solvable to human. Specifically, our data analysis
observes that human and algorithm demonstrates different
vulnerability to visual distortions: adversarial perturbation is
significantly annoying to algorithm yet friendly to human. We are Fig. 1. Increasing content complexity of CAPTCHAs.
motivated to employ adversarially perturbed images for robust
CAPTCHA design in the context of character-based questions.
Four modules of multi-target attack, ensemble adversarial training,
image preprocessing differentiable approximation, and expectation that is indistinguishable from a human, and later developed into
are proposed to address the characteristics of character-based a form of reverse Turing test with more practical goal of distin-
CAPTCHA cracking. Qualitative and quantitative experimental guishing between computer and human. Among reverse Tur-
results demonstrate the effectiveness of the proposed solution. We ing tests, CAPTCHA (Completely Automated Public Turing
hope this study can lead to the discussions around adversarial test to tell Computers and Humans Apart) turns out the most
attack/defense in CAPTCHA design and also inspire the future
attempts in employing algorithm limitation for practical usage. well-known one used in anti-spam systems to prevent abuse use
of automated programs [2].
Index Terms—Adversarial example, CAPTCHA, OCR.
Most early CAPTCHAs, like the reCAPTCHA [3] which
I. INTRODUCTION assists in the digitization of Google books, belong to the tra-
ditional character-based scheme involving with only numbers
LAN Turing first proposed the Turing Test question “Can
A machines think like human?” [1] Turing test was initially
designed to examine machine’s exhibited intelligent behavior
and English characters. With the fast progress of machine learn-
ing especially deep learning algorithms, simple character-based
CAPTCHAs fail to distinguish between algorithm and hu-
man [4]–[6]. CAPTCHA designers were therefore forced to in-
Manuscript received December 20, 2019; revised May 5, 2020 and June 15, crease the complexity of content to be recognized. As shown
2020; accepted July 16, 2020. Date of publication August 4, 2020; date of current
version August 24, 2021. This work was supported in part by the National Key R, in Fig. 1, while the extremely complex CAPTCHAs reduce the
and D Program of China under Grant 2018AAA0100604, and in part by the Na- risks to be cracked by algorithms, it also heavily increases the
tional Natural Science Foundation of China under Grants 61632004, 61832002, burden of human recognition. It is noteworthy that the effective-
61672518, U19B2039, 61632006, 61772048, 61672071, and U1811463, and
in part by the Beijing Talents Project (2017A24), and in part by the Beijing ness of simply increasing content complexity is based on the
Outstanding Young Scientists Projects (BJJWZYJH01201910005018). The as- assumption that human has consistently superior recognition
sociate editor coordinating the review of this manuscript and approving it for capability than algorithm. The last few years have witnessed
publication was Dr. Vasileios Mezaris. (Corresponding author: Yongli Hu.)
Jiaming Zhang and Jitao Sang are with the School of Computer and In- human-level AI in tasks like image recognition [7], speech pro-
formation Technology & Beijing Key Laboratory of Traffic Data Analy- cessing [8] and even reading comprehension [9]. It is easy to
sis and Mining, Beijing Jiaotong University, Beijing 100044, China, and
also with the Peng Cheng Laboratory, Shenzhen 518055, China (e-mail:
imagine that with the further development of algorithms, con-
lanzhang1107@gmail.com; jtsang@bjtu.edu.cn). tinuously increasing content complexity will reach such a critical
Kaiyuan Xu, Shangxi Wu, Xian Zhao, and Jian Yu are with the point that algorithm can recognize yet human cannot recognize.
School of Computer and Information Technology & Beijing Key Lab
of Traffic Data Analysis and Mining, Beijing Jiaotong University, Bei-
Let’s review the initial goal of CAPTCHA: to discriminate
jing 100044, China (e-mail: 15281106@bjtu.edu.cn; kirinng0709@gmail.com; human from algorithm by designing tasks unsolvable to al-
lavender.zxshane@gmail.com; jianyu@bjtu.edu.cn). gorithms. Therefore, the straightforward solution is to employ
Yanfeng Sun and Yongli Hu are with the Beijing Key Laboratory of Mul-
timedia and Intelligent Software Technology & Beijing Artificial Intelligence
the limitations of algorithms to facilitate CAPTCHA design.
Institute, Faculty of Information Technology, Beijing University of Technology, While algorithms have advanced their performance in many
Beijing 100124, China (e-mail: yfsun@bjut.edu.cn; huyongli@bjut.edu.cn). perspectives including visual/vocal recognition accuracy, they
Color versions of one or more of the figures in this article are available online
at https://ieeexplore.ieee.org.
remain some notorious limitations with regards to human. Re-
Digital Object Identifier 10.1109/TMM.2020.3013376 searchers and practitioners already employed such limitations to
1520-9210 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2576 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

design new form of CAPTCHAs, e.g., developing cognitive [10] II. RELATED WORK
and sequentially related [11] questions to challenge algorithm’s A. Character-Based CAPTCHAs
lack of commonsense knowledge and poor contextual reasoning
ability. In online services, character-based CAPTCHAs are the most
Following this spirit, we are interested to explore the possi- popular protection to deter character recognition programs.
Since the initial goal of CAPTCHA, friendly design and security
bility of improving the robustness of CAPTCHA towards al-
gorithm cracking without changing the traditional character- of CAPTCHAs have been studied. A fundamental requirement
based scheme. In other words, is it possible to design character of CAPTCHAs necessitates that be designed to be easy for hu-
mans but difficult for computers. In traditional CAPTCHA de-
CAPTCHAs only friendly to human instead of simply increas-
ing content complexity? The key lies in finding the algorithm sign, the trade-off between usability and security is difficult to
balance. The three traditional designs are most common: back-
limitation compatible to the scheme of character image. One
ground confusion, using lines and collapsing [18]. But there
candidate is the vulnerability to visual distortions. We have con-
ducted data analysis and observed that human and algorithm are some studies that use auto-generated methods to synthe-
sis CAPTCHA images, e.g., using GANs, instead of manual
exhibit different vulnerability to visual distortions (the obser-
design [19]. These auto methods, which are applied to both
vations are detailed in Section III). This inspires us to exploit
those distortions friendly to human but obstructing algorithm to character-based CAPTCHA and image-based CAPTCHA, are
novel approaches for generating CAPTCHAs, but they still at-
pollute the original character CAPTCHA. Specifically, adversar-
tempt to increase content complexity of CAPTCHAs.
ial perturbation [12] exactly meets this requirement: adversarial
attack1 and CAPTCHA share the common intention that human To overcome the limitations of traditional character-based
CAPTCHAs, other designs have been proposed, e.g., 3D-based
is imperceptible to but algorithm is significantly affected by the
same distortion. The notorious characteristic of adversarial per- CAPTCHAs, Animated CAPTCHAs [18]. 3D approaches to
turbation for visual understanding turns out to be the desired one CAPTCHA design involve the rendering of 3D models to an
image [20], [21]. However, it has been demonstrated that this
for CAPTCHA design.
Inspired by this, we employ adversarial perturbation to de- approach is easy to attacks [22], [23]. Animated CAPTCHAs
sign robust character-based CAPTCHA in this study. Current attempt to incorporate a time dimension into the design. The
addition of a time dimension is assumed to increase the secu-
state-of-the-art cracking solution views CAPTCHA OCR (Op-
tical Character Recognition) as a sequential recognition prob- rity of the resulting CAPTCHA. Nevertheless, techniques that
lem [13]–[17]. To remove the potential distortions, further im- can successfully attack the CAPTCHAs design have been de-
veloped [24].
age preprocessing operations are typically added before OCR.
Correspondingly in this study, we propose to simultaneously at- The last few years have witnessed deep learning plays an
important role in the field of artificial intelligence. The recogni-
tack multiple targets to address the sequential recognition issue
tion rate of character-based CAPTCHAs increases year by year.
(Section IV-A), differentiably approximate image preprocessing
operations (Section IV-C) and stochastic image transformation George et al. proposed a hierarchical model called the Recursive
Cortical Network (RCN) that incorporates neuroscience insights
(Section IV-D) in the adversarial example generation process to
in a structured probabilistic generative model framework, which
cancel out their potential influence. Moreover, since we have no
knowledge about the detailed algorithm the cracking solution significantly improved the recognition rate [25]. To remove the
interference in the background, Ye et al. proposed the GAN-
used (e.g., neural network structure), the generated adversarial
examples are expected to be resistant to unknown OCR algo- based approach for automatically transforming training data
rithms in the black-box cracking. This study resorts this issue and constructing solvers for character-based CAPTCHAs [26].
The convolutional neural network shows a powerful perfor-
to ensemble adversarial training by generating adversarial ex-
amples effective towards multiple algorithms (Section IV-B). In mance in the recognition of various characters, including Chi-
summary, the contributions of this study are two-fold: nese characters [27]. But the convolutional neural network has
r We have discovered the different vulnerability between hu- low recognition accuracy in confusion class. To solve this prob-
man and algorithm on visual distortions. Based on the ob- lem, Chen et al. proposed a novel method of selective learn-
servations, adversarial perturbation is employed to improve ing confusion class for character-based CAPTCHAs recogni-
tion [28]. As the complexity of character-based CAPTCHAs in-
the robustness of character-based CAPTCHA.
r Corresponding to the characteristics of typical OCR crack- creases, the methods based on combining convolutional neural
network and recurrent neural network achieve state-of-the-art
ing solutions, we proposed a novel methodology address-
performance [13]–[17]. In this paper, we employ the archi-
ing issues including sequential recognition, indifferen-
tiable image preprocessing, stochastic image transforma- tecture consists of convolution neural network (CNN) layers
and long short-term memory (LSTM) as the default OCR algo-
tion and black-box cracking.
rithm. We also test our CAPTCHAs on the latest method [17] in
Section V-D, which is an attention-based model that also con-
sists of CNN layers and LSTM.
1 Adversarial attack refers to the process of adding small but specially crafted
perturbation to generate adversarial examples misleading algorithm. To avoid
confusion with the process of attacking CAPTCHA, in this study, we use “adver- B. Adversarial Example
sarial attack” to indicate the generation of adversarially distorted CAPTCHAs
and use “CAPTCHA crack” to indicate the attempt of passing CAPTCHA with While deep learning has achieved great performance, it also
algorithms. has some security problems. Recent work has discovered that the

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2577

existing machine learning models, not just deep neural networks, These methods based on stochastic gradients lead adversarial
are vulnerable to adversarial example [12]. Given a trained clas- attack methods using a single sample of the randomness to in-
sifier F with model parameters W , a valid input x and with cor- correctly estimate the true gradient. Goodfellow et al. [30] first
responding ground truth prediction y, i.e., y = F (x) with model proposed adversarial training method, adversarial examples are
parameters W . It is often to get a similar input x is close to x ac- regarded as training samples to fit the model until these samples
cording to some distance metric d(x, x ), and cause y = F (x ) are classified correctly. The idea is effective and general for all
with model parameters W . An example x with this property types of adversarial attacks. This makes the network more robust
is known as a untargeted adversarial example. A more power- against the adversarial examples, but cost expensive computa-
ful but difficult example called targeted adversarial example is tion, especially at a large scale, e.g., the ImageNet [36] scale.
more than a misclassification example, i.e., t is a target label and In general, the existing defensive methods cannot completely
t = y, t = F (x ) with model parameters W . eliminate adversarial attacks.
Prior work that considers adversarial examples can be gener- Many researchers have found that adversarial example can be
ally classified into two categories: white-box attack and black- applied in other tasks, such as semantic segmentation [37], face
box attack. White-box attack has full knowledge of the trained detection [38], and even speech recognition [39] and transla-
classifier F including the model architecture and model param- tion [40]. The majority of the published papers have focused on
eters W . Black-box attack has no or limited knowledge of the how to eliminate the impact of adversarial examples in applica-
trained classifier F . Black-box setting is apparently harder than tion. Li et al. [41] evaluated adversarial examples among differ-
white-box setting for attackers because of the leaked gradient ent detection services, such as violence, politician, and pornog-
information. It seems that black-box attack is impossible, but raphy detection. Ling et al. [42] proposed a uniform platform for
adversarial examples that affect one model can often affect an- comprehensive evaluation on adversarial attacks and defenses in
other model, which is called transferability [29]. In the paper, application, which can benefit future adversarial examples re-
we rely on the transferability and deploy ensemble-based ap- search. In contrast, studies on employing adversarial examples
proaches to generate adversarial CAPTCHAs. against the malicious algorithm are relatively limited. Osadchy
Szegedy et al. [12] first pointed out adversarial examples and et al. [43] employed adversarial examples to design CAPTCHAs
proposed a box-constrained LBFGS method to find adversar- and analyzed security and good usability of CAPTCHAs. But
ial examples. To decrease expensive computation, Goodfellow they only considered these data types like MNIST and Ima-
et al. [30] proposed the fast gradient sign method (FGSM) to gen- geNet instead of CAPTCHA data types. Zhang et al. [44] stud-
erate adversarial examples by performing a single gradient step. ied the robustness of adversarial examples on different types of
Kurakin et al. [31] extended this method to an iterative version, CAPTCHA and gave the suggestions that how to improve the se-
and find out that adversarial examples can influence physical curity of CAPTCHA using adversarial examples. Shi et al. [45]
world. Dong et al. [32] further extended the fast gradient sign improved the effectiveness of the adversarial example by us-
method family by proposing momentum-based iterative algo- ing the Fourier transform to generate CAPTCHA images in
rithms. In addition, there are some more powerful methods called the frequency domain. However, they only considered gener-
optimization-based attack methods. Deepfool [33] is an attack ating adversarial examples on CNN systems, which is essen-
technique optimized for the L2 distance metric. This method tially the adversarial attack algorithm based on the classification
is based on the assumption that the decision boundary is partly task. In contrast, the current state-of-the-art CAPTCHA crack-
linear, then the distance and direction of the data points to the ing system consists of feature extraction module and sequential
decision boundary can be calculated approximately. C&W [34] recognition module (CNN + LSTM). Shi et al. [45] deployed
is another targeted optimization-based method. It achieves its character-based adversarial CAPTCHAs on a large-scale online
goal by increasing the probability of target label. platform and tested the proposed CAPTCHAs on convolutional
To defend against adversarial examples, several adversarial recurrent neural networks [46]. However, they ignored experi-
defensive methods have been proposed, which has been an ac- ments and discussions on adversarial defense technologies, such
tive field of AI research. Referring to [35], we generally di- as image binarization and adversarial training. In Section V-D,
vide adversarial defensive methods into two categories. Atha- we compare our method with ACs [45] to prove that considering
lye et al. [35] identify gradient masking, or called obfuscated the sequential recognition is essential. In Section V-B, we show
gradients, which leads to a false sense of security in defenses the necessity of considering image preprocessing.
against adversarial examples. The authors addressed that the rea-
son why many adversarial defenses can defend against adver-
sarial examples is that the fast and optimization-based methods III. DATA ANALYSIS
cannot succeed without useful gradient information. The most To justify the feasibility of employing algorithm limitations
common gradient masking methods include input transforma- for CAPTCHA design and motivate our detailed solution, this
tion and stochastic gradients. Input transformation techniques, section conducts data analysis to answer two questions: (1)
e.g., image cropping and image binarization, cause the gradients Whether human and algorithm have different vulnerability to
to be non-existent or incorrect. In this paper, image binarization visual distortion? (2) What characteristics to consider when em-
will definitely result in non-differentiable if gradient masking is ploying distortions to design robust CAPTCHA?
not overcome. Some adversarial defense methods cause the net- Text-based CAPTCHA is the most widely deployed scheme
work itself is randomized or the input is randomly transformed. requiring subjects to recognize characters from 0-9 and A-Z.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2578 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

Fig. 2. Human v.s. algorithm vulnerability analysis results on Gaussian and adversarial distortions.

Due to its simplicity, character-based CAPTCHA is very ef- segmentation-based approach for OCR works by segmenting a
fective to examine the robustness towards cracking algorithm text line image into individual character images and then rec-
as well as friendliness to human. Therefore, this study em- ognizing the characters [47]. The resultant average recognition
ploys character-based CAPTCHA as the example scheme to accuracies for Gaussian and adversarially distorted CAPTCHAs
conduct data analysis, develop solution and implement experi- are shown in Fig. 2(c) and (d). We can see that, for Gaussian dis-
ments. Specifically, during data analysis, we assume that each torted CAPTCHAs, human’s recognition accuracy consistently
CAPTCHA question is constituted by single character in an declines as the distortion level increases, indicating that Gaus-
RGB image with unique resolution of 48 × 64 px. The char- sian white noise tends to undermine human’s vision. On the con-
acter font is fixed as DroidSansMono. The remainder of the sec- trary, the examined OCR algorithm demonstrates good immu-
tion will report the observations regarding human and character nity to Gaussian white noise, possibly due to the noise removal
recognition performance in different scenarios. effect by multiple convolutional layers [48]. It is easy to imagine
that if we design CAPTCHA by adding Gaussian white noise,
as the noise level increases, the resultant CAPTCHAs will crit-
A. Vulnerability Analysis to Visual Distortion ically confuse humans instead of obstructing the cracking OCR
This subsection designs character recognition competition be- algorithms.
tween human and algorithm to analyze their vulnerability to vi- For adversarially distorted CAPTCHAs, we observed quite
sual distortions. We employed two types of visual distortions: opposite recognition results. Fig. 2(d) shows that humans are
(1) Gaussian white noise is one usual distortion to generate more robust to the adversarial perturbations, while OCR algo-
CAPTCHAs. In this study, the added one-time Gaussian white rithm is highly vulnerable as the adversarial distortion increases.
noise follows normal distribution with mean μ̃ = 0, variance This is not surprising since adversarial perturbation is specially
σ̃ = 0.01 and constant power spectral density. (2) Adversarial crafted to change the algorithm decision under the condition of
perturbation has been recognized as imperceptible to human but not confusing human. This characteristic of adversarial perturba-
significantly confusing algorithm. We employ the widely used tion demonstrates one important limitation of algorithm regards
F GSM [30] to add adversarial perturbation, where one-time to human ability, which perfectly satisfies the requirement of ro-
perturbation is constituted with step size of 0.02. To examine bust CAPTCHA: algorithm tends to fail, while human remains
the change of recognition performance with increasing distor- successful. Therefore, we are motivated to employ adversarial
tion difficulty, we added 8 levels of distortions onto the origi- examples to design robust CAPTCHA to distinguish between
nal character images accumulatively: each level corresponds to algorithm and human.
5 one-time Gaussian white noises and adversarial perturbations
respectively. Examples for derived distorted CAPTCHA images
in different levels are illustrated in Fig. 2(a) and (b).
Regarding the human side, we recruited 77 master work- B. Characteristics Affecting Robust CAPTCHA Design
ers from Amazon Mechanical Turk (MTurk). Each subject was The previous subsection observes that adversarial pertur-
asked to recognize 450 character CAPTCHAs with Gaussian bation is effective to mislead state-of-the-art OCR algorithm,
and adversarial distortions in different levels respectively. Re- which shows its potential to be employed to design robust
garding the algorithm side, we employed the state-of-the-art CAPTCHA. However, typical CAPTCHA cracking solution in-
OCR (Optical Character Recognition) algorithm, which is the volves beyond OCR, e.g., image preprocessing operations like

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2579

one specific OCR model. To address the above characteristics


of CAPTCHA cracking, our proposed CAPTCHA generation
framework consists of three modules: multi-target attack, en-
semble adversarial training, and image preprocessing differen-
tiable approximation. The proposed framework and its relation
to CAPTCHA cracking are illustrated on the right of Fig. 4.

A. Multi-Target Attack Towards Sequential Recognition


Typical CAPTCHAs usually contain more than one character
for recognition, e.g. the example CAPTCHAs contain 4 charac-
ters. Therefore, state-of-the-art CAPTCHA cracking solutions
are forced to address a sequential character recognition prob-
lem at the OCR stage [47]. Specifically, OCR stage consists of
three sub-modules as feature extraction, sequential recognition,
and output decoding. Feature extraction is basically realized by
a convolutional neural network to encoding the input image as
neural feature. Sequential recognition is typically realized by a
recurrent neural network to process the issued image neural fea-
ture and output multiple tokens including characters (0-9, A-Z)
and blank token ∅.2 Output decoding serves to transform the se-
quential tokens into final character recognition results, by merg-
Fig. 3. The affection of image preprocessing: (a) distorted characters before
(top row) and after (bottom row) image binarization and (b) recognition accuracy ing sequentially duplicated tokens and removing blank ∅ tokens.
on adversarially distorted characters. For example, the original token sequence “aa∅b∅∅ccc∅dd” will
be transformed to “abcd”.
While CAPTCHA cracking views OCR as a sequential recog-
nition problem, standard adversarial perturbation is designed to
binarization and Gaussian filtering will be applied to remove dis-
attack single target. In this study, we propose to attack multiple
tortions before issuing to the OCR module. Fig. 3(a) illustrates
targets corresponding to the multiple tokens derived from OCR
the adversarially distorted CAPTCHA images before and after
sequential recognition. The generated adversarial CAPTCHA
binarization preprocessing. It is easy to conceive that the effec-
image is expected to simultaneously misclassify all the char-
tiveness of adversarial perturbation will be critically affected by
acter tokens. For specific token sequence t, all the characters
image preprocessing operations.
appearing in t constitute the original set Θ, while the remain-
We further quantified this affection by analyzing the OCR
ing characters from (0-9, A-Z) constitute the adversary set Θ̄.
performance on the same adversarially distorted CAPTCHA
Denoting the raw image as x and the corresponding adversary
images from previous subsection. The recognition accuracies
image as x , the multi-target attack is formulated as the following
on the CAPTCHAs before and after binarization preprocessing
optimization problem:
are plotted and compared in Fig. 3(b). It is shown that after
removing most distortions via image binarization, OCR algo- 
min d(x, x 
) + λ · [max F (x )θj i − F (x )θθ̄i ]+ (1)
rithm demonstrates basically stable performance in recognizing x
j=θ̄i i
θi ∈Θ
CAPTCHAs with different levels of adversarial perturbation.
This tells us that standard adversarial perturbation is insufficient where d(·, ·) is distance function to minimize the modification
to obstruct the cracking method. It is necessary to design the from x to x3 , λ is the weight parameter balancing between
robust CAPTCHA solution considering the characteristics (like the image modification and the misclassification confidence in
preprocessing operations) of CAPTCHA cracking method. the second term. Within the second term, θi is the character
appearing in the original set Θ, θ̄i is its one-to-one mapping
IV. METHODOLOGY character in the adversary set Θ̄, F (x )θi denotes the output of
the second-to-last layer, the logits, corresponding to token θi af-
As shown on the left of Fig. 4, typical cracking of character-
ter sequential recognition, F (x )θj i denotes its j dimension, and
based CAPTCHA consists of two stages as image preprocessing
[f ]+ is the positive part function denoting max(f, 0). Note that
and OCR. The above data analysis has demonstrated that im-
the one-to-one mapping from θi to θ̄i can be either random or
age preprocessing has the effect of distortion removal, making
it not possible to straightforwardly employ adversarial pertur-
bation for robust CAPTCHA design. In addition to the image
2 For typical 4 character-based CAPTCHAs, recurrent neural network usu-
preprocessing stage, the OCR stage also possesses character-
istics obstructing CAPTCHA: (1) sequential recognition, dis- ally outputs 12-token sequence to improve tolerance for segmentation and align-
ment [47].
abling the traditional single character-oriented adversarial per- 3 Alternative choices for the distance function are allowed. In our experiment,
turbation; and (2) black-box crack, making it ineffective to attack we use L2 distance.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2580 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

Fig. 4. The proposed robust CAPTCHA designing framework. The left represents the process of CAPTCHA cracking, including sequential recognition, fea-
ture extraction, image binarization (Gaussian filtering) and stochastic transformation. The right represents our solution of CAPTCHA generation, including the
corresponding multi-target attack, ensemble adversarial training, differentiable approximation and expectation, respectively.

fixed. Random one-to-one mapping leads to targeted adversar- Specifically, given K white-box OCR models with their cor-
ial attack, and fixed mapping leads to non-targeted adversarial responding the output of the second-to-last layer as J1 , . . ., JK ,
attack.4 we re-formulate the objective function in Eq. (1) by replacing
When the original set Θ contains only one character, the F (x ) with F̃ (x ) defined as follows:
multi-target attack reduces to single-target attack as the stan-
dard adversarial perturbation. In fact, according to the mecha- K
nism of output decoding in CAPTCHA cracking, we only need 
F̃ (x ) = αk Jk (x ) (2)
to misclassify any one of the character tokens to invalid the fi-
k=1
nal recognition result. The above equation in Eqn. (1) provides a
general case of attacking flexible numbers of character tokens. In 
practice, the number of attacked characters is one important pa- where αk is the ensemble weight with K k=1 αk = 1. In most
rameter to control the model performance. More attacked char- cases, αk = 1/K except that one model is more important than
acters guarantee higher success rate to resist crack, yet leading others. Among the three sub-modules of OCR stage, feature
to more derived distortions and human recognition burden. The extraction has the most model choices (e.g. various CNN struc-
quantitative influence of attacked character number on the im- tures as GoogLeNet [49], ResNet [50]) which can be easily im-
age distortion level and algorithm recognition rate is discussed plemented into different CAPTCHA cracking solutions. There-
in Section V-C. fore, this study addresses the black-box cracking issue by attack-
ing multiple feature extraction models. Specifically, the training
B. Ensemble Adversarial Training Towards Black-Box Crack data and basic structure of Ji (x ) and F (x ) are identical ex-
cept for the different CNN structures in the feature extraction
As mentioned in Section I, CAPTCHA cracking may em-
sub-module. On the number of CNN structures, the larger the
ploy multiple OCR algorithms for character recognition. At the
value of K, the stronger the generalization capability of the
stage of designing CAPTCHA, it is impractical to target one
derived adversarial CAPTCHA images. However, an excessive
specific OCR algorithm, which requires to design adversarial
K value will lead to high computational complexity and trivial
CAPTCHA images that are effective to as many OCR algo-
weight αk to underemphasize single model. Referring to previ-
rithms as possible. Fortunately, it is recognized that adversarial
ous studies on ensemble adversarial attack [51], 3 ∼ 5 models
perturbation is transferable between models: if an adversarial
achieve a good balance between transferability and practicality.
image remains effective for multiple models, it is more likely to
In this study, we select K = 3 and evenly set αk = 1/3. The
transfer to other models as well [29]. Inspired by this, in order
experimental results in [51] show that under the same training
to improve the resistance to unknown cracking models, we pro-
set, the adversarial examples can achieve stronger transferability
pose to generate adversarial images simultaneously misleading
when the network structure is more similar, and it is reasonable
multiple models.
to choose the model with large structure difference to employ
ensemble adversarial training. The performance of employing
4 The reported experimental results in Section V are based on random one- ensemble adversarial training to resist different OCRs is reported
to-one mapping. in Section V-D.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2581

C. Differentiable Approximation Towards that these adversarial examples fail to remain effective under
Image Preprocessing image transformations, and the gradient of the expected value
The data observations in Section III-B demonstrate the dis- can increase adversarial even under various image transforma-
tions [52]. To integrate image preprocessing issues with potential
tortion removal consequences from binarization operation, re-
transformations, we compute expectation of stochastic image
quiring us to consider the affection of image preprocessing in
adversarial image generation. To address this, we regard image transformation including different angles rotation. The expecta-
tion allows us for the construction of adversarial examples that
preprocessing operation as part of the entire end-to-end solu-
remain adversarial over a chosen transformation distribution T .
tion so that we can generate corresponding adversarial images
effectively to mislead the whole cracking solution. More concretely, we use a chosen distribution T of transforma-
tion functions t taking an input x controlled by the adversary
According to the usability to be incorporated into the end-to-
end solution, image preprocessing operations can be roughly to the true input t(x) perceived by the OCR rather than op-
divided into two categories as either differentiable or non- timizing the objective function of a single example. We then
re-formulate the second term in Eq. (4) by replacing x, x with
differentiable. For each category, we select one representative
operation to address in this study, i.e., Gaussian filtering and t(x), t(x ) defined as follows:
image binarization. Regarding the differentiable Gaussian fil- 
 1 − x
2
Et∼T [max F̃ (φ(t(x )))θj i − F̃ (φ(t(x ))))θθ̄i ]+ (5)
tering operation, g(x )= √2πσ , we can readily incorporate it
e 2σ2
θi ∈Θ
j=θ̄i i

into the OCR model (Eq. (1), Eq. (2)) by replacing the input
image x with the preprocessing image g(x ). Both forward and Furthermore, rather than simply taking the d(·, ·) to constrain
backward propagation are conducted on the replaced function the solution space, we instead aim to constrain the expected
F (g(x )), leading to the generated adversarial images expected effective distance between the adversarial and original inputs.
to eliminate the affection from Gaussian filtering. The first term in Eq. (4) is replaced by d(t(x), t(x )) defined as
Regarding the non-differentiable image binarization, we can- follows:
not straightforwardly incorporate it into the objective function.
Instead, we find a differentiable approximation s(x ) to image Et∼T [d(t(x), t(x ))] (6)
binarization and incorporate the approximated function into the
end-to-end solution. In this study, s(x ) is defined as follows: In practice, the distribution T can model perceptual transforma-
1 tion such as color change, image translation, random rotation,
s(x ) = x −τ
(3) or addition of noise. These transformations amount to a set of
1 + e− ω
the random linear combinations, which are more thoroughly de-
where τ denotes the threshold of image binarization, ω denotes scribed in Section V-E. Then we can approximate the gradient of
the degree of lateral expansion of the curve. Note that to guaran- the expected value through sampling transformations indepen-
tee that the generated adversarial images are resistant to im- dently at each gradient descent step in optimizing the objective
age binarization, we only employ the approximated s(x ) at function and differentiating through the transformation. Given
the backward propagation stage to update the generated image, its ability to generate robust adversarial CAPTCHA images, we
while the forward propagation still use the actual x to calculate use the gradient of the expected value to directly eliminate the
∇x F (x). affection from stochastic transformation for differentiable im-
To simultaneously resist to the affections from Gaussian fil- age transformation. But regarding some non-differentiable im-
tering and image binarization, we concatenate s(·) and g(·) in age transformation, we cannot straightforwardly differentiate
the final objective function. Therefore, the overall optimization through the transformation. Instead, we can use the same cate-
problem incorporating the three proposed modules is as follows: gory as mentioned in Section IV-C that finding a differentiable
 + approximation and incorporate the approximated function into
 the end-to-end solution.
  θi  θi
min
d(x, x ) + λ · max F̃ (φ(x ))j − F̃ (φ(x )))θ̄
x j=θ̄i i
θi ∈Θ
(4) V. EXPERIMENTS
where F̃ (·) denotes the ensemble of multiple OCR models de-
fined in Eq. (2), and φ(x ) = s(g(x )) denotes the approximated We examined CAPTCHA images with 4 characters for ex-
image preprocessing operations defined in Eq. (3). periments. The CAPTCHAs are RGB images with resolution
of 192 × 64 px. Regarding the cracking method, we consid-
ered image binarization and Gaussian filtering (kernel size:
D. Expectation Towards Stochastic Image Transformation 3 × 3, σ = 0.8) at the image preprocessing stage. The OCR
The above three subsections are more than enough for gen- stage is instantiated with CNN structures for feature extraction
eral CAPTCHAs generation towards OCR cracking solution. and LSTM+softmax for sequential recognition. Regarding our
However, a potential cracker could use a number of transforma- proposed CAPTCHA generation method, image binarization is
tions to make the adversarial perturbations meaningless, e.g., approximated with τ = 0.8, ω = 0.05, and 4 CNN structures are
the cracker could slightly rotate the image, doing so entirely employed for ensemble adversarial training. All experiments are
bypasses general adversarial examples. Prior work has shown conducted on Nvidia GTX 1080Ti GPU with 11 G memory.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2582 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

Fig. 5. Example images (top row) and theri attention maps (bottom row). From
left to right, we show the original image, the image with Gaussian white noise,
the adversarial generated by our method and the adversarial image generated by Fig. 6. Example CAPTCHAs with different complexity levels (from top to
our method but without considering image preprocessing. bottom: easy, medium, hard). Each row from left to right shows the differ-
ent settings of Raw, rCAPTCHA_parallel, rCAPTCHA_w/o preprocessing and
rCAPTCHA.

A. Qualitative Attention Analysis


Visual attention has been widely used to explain which region r rCAPTCHA: the proposed solution to generated adversar-
of image contributes much to the model decision [53]. In this
ial images, considering both sequential recognition and im-
study, we extracted the attention map using Grad-CAM [54] to
age preprocessing operations.
understand the change of recognition performance under differ-
The state-of-the-art cracking solution is trained over 20,000
ent visual distortions.
CAPTCHA images with batch size 128. To examine the appli-
The first and second columns of Fig. 5 visualize the atten-
cation scope of the proposed CAPTCHA generation methods,
tion map of the raw image and the image with Gaussian white
we conducted experiments on the CAPTCHAs with three lev-
noise. It can be found that Gaussian white noise brings trivial
els of complexities: easy, medium, hard. Fig. 6 shows exam-
attention change from the original image. Both attention maps
ples of different complexity levels of CAPTCHAs in the above
keep to the region where characters exist. This well explains the
four settings. For each of the settings, we selected/generated
data observations in Section III-A that algorithm is generally
500 CAPTCHA images for testing, and summarize the de-
robust to Gaussian white noise. We also visualized the attention
rived average recognition accuracy in Table I. Experimental
map of the CAPTCHA images generated from our proposed
observations include: (1) By adding adversarial perturbations,
method on the third column of Fig. 5. It is shown that the atten-
the right 3 columns consistently obtain lower accuracies than
tion maps deviate much from the original image and focus on
the first column, showing the usability of employing adversar-
unrelated regions where there exist no characters. This justifies
ial perturbations in resisting CAPTCHA cracking. (2) With-
our motivation to employ adversarial perturbation to mislead the
out considering the sequential recognition or image preprocess-
algorithm prediction result, and demonstrates the effectiveness
ing characteristics, the resisting effect of rCAPTCHA_parallel
of our proposed CAPTCHA design method.
and rCAPTCHA_w/o preprocessing is not as obvious as that
To further validate the necessity of considering image pre-
of rCAPCHA. This validates the necessity of multi-target at-
processing in robust CAPTCHA design, the attention maps for
tack and differentiable approximation modules. (3) Regarding
the images generated from our method but without considering
CAPTCHAs with different complexities, we observed consis-
image preprocessing are shown in the fourth column of Fig. 5
tent phenomenon among the four settings, demonstrating the
for comparison. It is easy to conceive that without considering
wide application scope of the proposed CAPTCHA generation
image preprocessing, the generated images fail to deviate the
method.
attention from the character regions. This is consistent with the
The notable decrease in algorithm recognition accuracies
fact that image preprocessing has the effect of weakening or
shows the effectiveness of employing adversarial perturbation
eliminating adversarial perturbation.
to mislead cracking solution. To facilitate the correlation under-
standing between misleading cracking solution and friendly to
B. Quantitative Performance Comparison human recognition, we also provide the human recognition ac-
To compare the performance of the proposed robust curacy for each experimental setting in Table I. Similar to the
CAPTCHA (rCAPTCHA) designing method, we report the data analysis, we have recruited 164 workers from MTurk to
recognition accuracies of state-of-the-art cracking solution un- recognize 4 character-based CAPTCHA images. The reported
der the following settings: accuracies are averaged over 1,200 CAPTCHAs. By comparing
r Raw: the original CAPTCHA images without adding ad- different rows, it is shown that the increasing content complex-
versarial perturbations; ity brings slight decrease of algorithm recognition accuracy but
r rCAPTCHA_parallel: the proposed solution to generated causes huge trouble to human recognition. Among different set-
adversarial images, expect that the sub-module of OCR is ting columns, while the algorithm recognition accuracy fluctu-
replaced by 4 parallel recognition networks (each realized ates a lot, the human recognition performance basically remains
by one fully-connected layer) to address one character’s stable, validating the different distortion vulnerability between
recognition; human and algorithm. In summary, regarding CAPTCHA im-
r rCAPTCHA_w/o preprocessing: the proposed solution to ages with different complexities, the proposed CAPTCHA gen-
generated adversarial images, but without considering the eration method succeeds to invalid the cracking algorithm with-
image preprocessing stages; out increasing human recognition burden.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2583

TABLE I
THE RECOGNITION OF DIFFERENT COMPLEXITY LEVELS OF CAPTCHAS IN THE DIFFERENT SETTINGS. THE RESULTS OF ALGORITHMS ARE
OBTAINED AFTER GAUSSIAN FILTERING AND IMAGE BINARIZATION

Fig. 7. The influence of λ on derived image distortion and cracking recognition Fig. 8. The influence of |Θ| on derived image distortion and cracking recog-
accuracy. nition accuracy.

C. Parameter Influence Analysis


prevent annoying human subjects in recognizing the generated
The proposed robust CAPTCHA generation method mainly CAPTCHAs, an appropriate λ is selected with moderate image
involves with two parameters: the weight parameter λ in Eq. (1) distortion and guaranteed cracking resistant performance. Our
and the number of attacked characters |Θ|. As introduced in experimental results reported in Section V-B are based on λ =
the methodology, the weight parameter λ controls the relative 20.
importance between the visual distortion and misclassification Regarding the number of attacked characters, we set |Θ| to
confidence. We adjusted λ within range of [10,30] with step of 1 {1, 2, 3, 4} respectively and examined the corresponding aver-
and examined its influence on the derived image distortion and aged image distortion and algorithm recognition accuracy in
cracking recognition accuracy. The image distortion is measured Fig. 8. As shown in Fig. 8(a), with |Θ| increases, more image
as the sum of squared pixel-wise difference between the original distortion is needed to misclassify the characters. In Fig. 9 we
and adversarial images. The averaged distortion and recognition show example CAPTCHAs generated by rCAPTCHA with dif-
accuracy with the change of λ are drawn in Fig. 7. It is shown ferent levels of image distortions. Combing with Fig. 8(a), it is
that as λ increases, more image distortion is observed in the demonstrated that using the proposed rCAPTCHA method to at-
derived CAPTCHA images and cracking method tends to fail tack even all 4 characters, the derived CAPTCHAs are generally
in recognizing the generated CAPTCHAs. This is consistent friendly to human and not bringing extra recognition burden. As
with the definition of λ in Eq. (1). In practical applications, to shown in Fig. 8(b), the increase of |Θ| enhances the confidence

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2584 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

TABLE II
TRANSFERABILITY OF ADVERSARIAL IMAGES GENERATED BETWEEN PAIRS OF MODELS. THE ELEMENT (i, j) REPRESENTS THE ACCURACY OF
THE ADVERSARIAL IMAGES GENERATED FOR MODEL i (ROW) TESTED OVER MODEL j (COLUMN)

mini-GoogLeNet consists of two Inception modules with six


convolutional layers.
Three models of 4ConvNet, mini-ResNet, mini-DenseNet are
selected as white-boxs, with the remaining mini-GoogLeNet
model and mini-GoogLeNet w/ attention as the black-boxs. The
black-box models are regard as the potential OCR cracking to
simulate the alternative cracking choices in real-world appli-
Fig. 9. Example CAPTCHAs with different image distortions: from left to cations. Averaged over 100 tested CAPTCHAs, Table II sum-
right shows images with distortions of 100, 200, 300, and 400. marizes the black-box cracking recognition accuracy under dif-
ferent training-testing pairs. For example, the value of 91% at
element (1,1) represents the recognition accuracy of original
to mislead the cracking algorithm and obtains consistently lower CAPTCHA images on 4ConvNet. The value of 0% at element
recognition accuracy. With the introduction of multi-attack to- (8,1) indicates the recognition accuracy trained with ensembled
wards sequential recognition, the proposed rCAPTCHA method 3 white-box models and tested on 4ConvNet. Lower accuracy
possess the flexibility to attack arbitrary number of characters. value means superior resistant performance to cracking solu-
In our experiments, to guarantee the resistance capability, we tions and better transferability of the method.
fixed |Θ| = 4. In the top half of Table II (ACs), the adversarial CAPTCHAs
without considering the sequential recognition obtain higher av-
erage accuracies than rCAPTCHA. The accuracies of ACs are
D. Robustness Towards Different OCRs no lower than 60% when image preprocessing and image trans-
To compare the generalization and transferability of our pro- formation are not involved. It is expected that the employing
posed rCAPTCHA method and ACs [45], we implemented dif- image preprocessing and image transformation can increase the
ferent cracking methods and examined their recognition accu- accuracy, and 60% means that these adversarial CAPTCHAs are
racy on the generated CAPTCHAs. For generating CAPTCHAs almost recognized by the OCR. This demonstrates when exclud-
of our method, we respectively trained 3 OCR models with dif- ing controlled variables, considering the sequential recognition
ferent CNN structures, which are denoted as 4ConvNet, mini- problem is more important than adopting Fourier transform dur-
ResNet and mini-DenseNet. For generating CAPTCHAs of [45], ing the generation of adversarial CAPTCHAs. In the bottom
we trained the same OCR as above, except that the sub-module half of Table II (rCAPTCHA), we can observe that the adversar-
(LSTM) of OCR is replaced by 4 parallel recognition networks ial images generated with one model perform well on their own
(fully-connected layer). For testing CAPTCHAs, we trained 2 models but generally perform poorly on other models. However,
OCR models. One is the same OCR with CNN structure of if we generate the CAPTCHA images with ensemble training of
mini-GoogLeNet, the other is mini-GoogLeNet w/ attention [17], 3 models, the testing recognition accuracies for all 5 models
which also uses mini-GoogLeNet but adopts the attention mech- are no higher than 7%. This demonstrates the transferability of
anism in LSTM. 4ConvNet uses four convolutional layers for the proposed rCAPTCHA method in employing ensemble train-
feature extraction. The LSTM input required a fixed-size fea- ing towards black-box cracking. Specially, the value of 7% at
ture vector, so we modified the native network. mini-XNets element (8, 4) demonstrates our method can perform well on
are employed due to the quicker convergence times and low GoogLeNet w/ attention (GoogLeNet w/). The reason is that the
resolution of CAPTCHA images: mini-ResNet consists of five network structure of the current state-of-the-art OCR model is
ResBlocks and two convolutional layers, mini-DenseNet con- similar (CNN + LSTM). We also generate the adversarial images
sists of four DenseBlocks with four convolutional layers, and on GoogLeNet w/ attention (the last row), which validates the

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2585

TABLE III
DISTRIBUTION OF TRANSFORMATIONS

Fig. 10. Example CAPTCHAs with different complexity levels (from top to
bottom: easy, medium, hard). Each row from left to right shows the different
settings of image transformation.

TABLE IV of rotation-invariance. The increase from 0.0%, 0.0%, 4.0% to


THE RECOGNITION OF RAW IMAGES AND ADVERSARIAL IMAGES. THE RESULT 1.3%, 5.4%, 9.0% means that stochastic transformation partially
ARE OBTAINED AFTER STOCHASTIC TRANSFORMATION plays a role in eliminating effectiveness of adversarial examples.
Because the stochastic transformation is a kind of adversarial
defense technology, more specific discussions about adversarial
defense are reported in Section V-F. The reason that algorithm
recognition accuracy increases along with complexity levels of
CAPTCHAs is the noises associated with hard CAPTCHA im-
ages mix with the adversarial perturbation and thus adversarial
perturbation will be weakened as the image gets more complex.
generalization of our proposed mechanism. It is expected with By comparing the results in Table I, the larger the “approxima-
more models implemented in ensemble training, the resistant tion” component is, the more we can detect this trend.
performance towards arbitrary black-box cracking methods will Fig. 10 shows example CAPTCHAs of different complexity
be guaranteed. In practical applications, we can carefully select levels after stochastic transformations, combing with images of
white-box models with typically different structures to improve non-transformations, which observes that these images are still
the generalization and transferability to specific models. adversarial after different image transformations. This demon-
strates that our approach is effective in generating robust adver-
E. Robustness Towards Stochastic Image Transformation sarial examples.
Nevertheless, when the number of transformations the cracker
To justify the performance of our method for stochastic im-
can choose is too large, expectation will occasionally fail to gen-
age transformation, we implemented a set of transformations on
erate an effective adversarial example. In this study, the number
different complexity levels of CAPTCHAs and examined their
we choose is 4 and the scope of transformations is also reason-
recognition accuracies on the generated CAPTCHAs. Specif-
able. Another case occurs when we replace non-differentiable
ically, we considered the distribution of transformations that
image preprocessing with differentiable approximation. Be-
includes rotation, color change, rescale, and translation of the
cause the essence of expectation is also an “approximation,”
image. We selected/generated the 1,000 images for each of the
when the “double approximation” occurs, the effectiveness of
settings, randomly chose an image transformation for each im-
adversarial example will be weakened.
age, and examined whether our method is robust over the chosen
distribution. For each adversarial example, we evaluated over
F. Discussion on Adversarial Defense
100 stochastic transformations sampled from the distribution at
evaluation time. We give the specific parameters we chose of The image non-differentiable preprocessing and stochastic
stochastic transformations in Table III, where each parameter is image transformation both are aimed at weakening the ef-
sampled uniformly at random from the specified range. These fectiveness of adversarial example, which are called adver-
stochastic transformations include color change, image transla- sarial defenses. However, the existence of robust adversarial
tion, rotation and rescale. CAPTCHAs implies that defenses based on stochastic trans-
Table IV summarizes the results. Experimental observations forming input are not secure: adversarial examples generated by
include: (1) By applying stochastic transformations, the sec- using differentiable approximation and expectation can circum-
ond row consistently obtains lower accuracies than the first vent these defenses. Prior work has shown that most of adver-
row, showing the usability of employing expectation towards sarial defenses are based on obfuscated gradients, in this study:
stochastic image transformation. (2) After stochastic transfor- (1) Image non-differentiable preprocessing is a kind of shattered
mations, the accuracies of raw images become lower than gradients, which are nonexistent or incorrect gradients caused
before transformations. The decrease in algorithm recognition either intentionally through non-differentiable operations or un-
accuracies shows that stochastic transformation does hurt per- intentionally through numerical instability; (2) Stochastic image
formance of normal image, which is caused by the loss of image transformation is a kind of stochastic gradients, which depend
information and the shortcoming of neural network, e.g., lack on test-time randomness [35]. But if the obfuscated gradient

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2586 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

and quantitative experimental results demonstrate the effective-


ness of generated CAPTCHAs in resisting cracking algorithms.
We ascribe the main contribution not as proposing a specific
CAPTCHA system, but as introducing the idea of exploiting
algorithm unrobustness to increase the robustness of automated
system towards cracking. Character-based CAPTCHA, which is
most friendly and effective to human, to validate this idea with
the simplest scheme. This idea is expected to easily adopt to
generate robust image-based and other CAPTCHAs.
It is noted that similar to the game competition between ad-
versarial attack and defense, with more CAPTCHA designers
employing adversarial attack to resist cracking, future cracking
Fig. 11. The recognition accuracy of relax adversarial training. solutions are expected to employ adversarial defense techniques
for self-enhancement. We hope this study could draw attention
of future CAPTCHA designing on the competition between ad-
information can be approximated, it can only provide a false
versarial attack and defense. Moreover, with the development
sense of security.
of deep learning and other AI algorithms, we are confronted
There is another kind of adversarial defense technology called
with critical security-related problems when algorithms are ma-
adversarial training, which is not dependent on obfuscated gra-
liciously utilized towards human. In this case, it is necessary to
dient. Adversarial training solves a min-max game through a
get aware of the limitations of current algorithms and appropri-
conceptually simple process: train on adversarial examples un-
ately employ them to resist the abuse use of algorithms.
til the model learns to classify them correctly [30]. To further
validate the adversarial defense, we study the adversarial train-
ing approach of [55] in this subsection. For this scenario, we REFERENCES
generated/selected 500 adversarial CAPTCHA images for test- [1] A. M. Turing, “Computing machinery and intelligence-am turing,” Mind,
ing. Then we started to fine-tune the model with the 50,000 vol. 59, no. 236, pp. 433–434, 1950.
steps. However, due to the complexity of OCR model, com- [2] L. Von Ahn, M. Blum and J. Langford, “Telling humans and computer-
sapart automatically,” Commun. ACM, vol. 47, no. 2, pp. 56–60, 2004.
pared with the common CNN model, standard adversarial train- [3] L. Von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum, “Re-
ing does not show the effectiveness as usual. After training with captcha: Human-based character recognition via web security measures,”
50,000 steps, the accuracy of OCR is still 0%. So we relax the Science, vol. 321, no. 5895, pp. 1465–1468, 2008.
[4] A. A. Chandavale, A. M. Sapkal, and R. M. Jalnekar, “Algorithm to break
constraints of standard adversarial training to examine whether visual captcha,” in Proc. Int. Conf. Emerg. Trends Eng. Technol., 2009, pp.
the idea of adversarial training will work. Then we fine-tune the 258–262.
model on the same 500 adversarial images as testset. The re- [5] G. Mori and J. Malik, “Recognizing objects in adversarial clutter: Breaking
a visual captcha,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
sults are shown in Fig. 11, from combined results we make the Recognit., 2003, vol. 1, pp. 1–8.
following observations. As training data increases, adversarial [6] S. Sivakorn, I. Polakis, and A. D. Keromytis, “I am robot: (Deep) learn-
training significantly improves the accuracy of the OCR model. ing to break semantic image captchas,” in Proc. IEEE Eur. Symp. Secur.
Privacy, 2016, pp. 388–403.
But its shortcomings also obvious that the time cost is gradu- [7] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpass-
ally increasing. Moreover, if a cracker wants to use adversarial ing human-level performance on imagenet classification,” in Proc. IEEE
training, he is supposed to have access to the training dataset Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1026–1034.
[8] J. F. Gemmeke et al., “Audio set: An ontology and human-labeled dataset
we use and the parameter in algorithm we choose, e.g., distance for audio events,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.,
function to minimize the modification and so on. 2017, pp. 776–780.
We discussed adversarial defense technology on adversarial [9] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ ques-
tions for machine comprehension of text,” 2016, arXiv:1606.05250.
CAPTCHAs we proposed. The defense technologies based on [10] M. R. Ogiela, N. Krzyworzeka, and L. Ogiela, “Application of knowledge-
obfuscated gradients cannot hinder the type of CAPTCHAs. The based cognitive captcha in cloud of things security,” Concurrency Com-
adversarial training based on non-obfuscated gradients is still put.: Pract. Experience, vol. 30, no. 21, pp. 1–11, 2018.
[11] D. Geman, S. Geman, N. Hallonquist, and L. Younes, “Visual turing test
effective but limited to practicality. for computer vision systems,” Proc. Nat. Acad. Sci., vol. 112, no. 12,
pp. 3618–3623, 2015.
[12] C. Szegedy et al., “Intriguing properties of neural networks,” in Proc. Int.
VI. CONCLUSION Conf. Learn. Representations, 2014, pp. 1–10.
[13] Q. Liu, L. Wang, and Q. Huo, “A study on effects of implicit and explicit
This study designs robust character-based CAPTCHAs to language model information for dblstm-ctc based handwriting recogni-
resist cracking algorithms by employing their unrobustness tion,” in Proc. Int. Conf. Document Anal. Recognit., 2015, pp. 461–465.
to adversarial perturbation. We have conducted data analysis [14] T. M. Breuel, “High performance text recognition using a hybrid
convolutional-LSTM implementation,” in Proc. IAPR Int. Conf. Docu-
and observed human and algorithm’s different vulnerabilities ment Anal. Recognit., 2017, vol. 1, pp. 11–16.
to visual distortions. Based on the observation, robust [15] M. Jenckel, S. S. Bukhari, and A. Dengel, “Transcription free LSTM OCR
CAPTCHA (rCAPTCHA) generation framework is introduced model evaluation,” in Proc. Int. Conf. Frontiers Handwriting Recognit.,
2018, pp. 122–126.
with three modules of multi-target attack, ensemble adversarial [16] H.-R. Shin, J.-S. Park, and J.-K. Song, “OCR for drawing images us-
training, differentiable approximation to image preprocessing, ing bidirectional LSTM with CTC,” in Proc. IEEE Student Conf. Electric
and expectation to stochastic image transformation. Qualitative Mach. Syst., 2019, pp. 1–4.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2587

[17] Y. Zi, H. Gao, Z. Cheng, and Y. Liu, “An end-to-end attack on text [38] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize to
CAPTCHAs,” IEEE Trans. Inf. Forensics Secur., vol. 15, pp. 753–766, a crime: Real and stealthy attacks on state-of-the-art face recognition,” in
2020. Proc. ACM Sigsac Conf. Comput. Commun. Secur., 2016, pp. 1528–1540.
[18] Y.-W. Chow, W. Susilo, and P. Thorncharoensri, “CAPTCHA design and [39] N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attacks
security issues,” in Proc. Adv. Cyber Secur.: Principles, Techn., Appl., on speech-to-text,” in Proc. IEEE Secur. Privacy Workshops, 2018, pp.
2019, pp. 69–92. 1–7.
[19] H. Kwon, Y. Kim, H. Yoon, and D. Choi, “CAPTCHA image generation [40] J. Li, S. Ji, T. Du, B. Li, and T. Wang, “Textbugger: Generating adversarial
systems using generative adversarial networks,” IEICE Trans. Inf. Syst., text against real-world applications,” in Proc. Annu. Netw. Distrib. Syst.
vol. 101, no. 2, pp. 543–546, 2018. Secur. Symp., 2019, pp. 1–15.
[20] M. E. Hoque, D. J. Russomanno, and M. Yeasin, “2D CAPTCHAs from [41] X. Li, K. Yu, S. Ji, Y. Wang, C. Wu, and H. Xue et al., “Fighting against
3D models,” in Proc. IEEE SoutheastCon, 2006, pp. 165–170. deep-fake: Patch&pair convolutional neural networks,” in Proc. Compan-
[21] M. Imsamai and S. Phimoltares, “3d captcha: A next generation of the- ion Web Conf., 2020, pp. 88–89.
captcha,” in Proc. Int. Conf. Inf. Sci. Appl., 2010, pp. 1–8. [42] X. Ling et al., “Deepsec: A uniform platform for security analysis of deep
[22] V. D. Nguyen, Y.-W. Chow, and W. Susilo, “On the security of text-based learning model,” in Proc. IEEE Symp. Secur. Privacy, 2019, pp. 673–690.
3D captchas,” Comput. Secur., vol. 45, pp. 84–99, 2014. [43] M. Osadchy, J. Hernandez-Castro, S. Gibson, O. Dunkelman, and D.
[23] Q. Ye, Y. Chen, and B. Zhu, “The robustness of a new 3D captcha,” in Pérez-Cabo, “No bot expects the deepCAPTCHA! introducing immutable
Proc. IAPR Int. Workshop Document Anal. Syst., 2014, pp. 319–323. adversarial examples, with applications to CAPTCHA generation,” IEEE
[24] V. D. Nguyen, Y.-W. Chow, and W. Susilo, “Breaking an animated Transactions. on Information. Forensics. and Security., vol. 12, no. 11,
CAPTCHA scheme,” in Proc. Int. Conf. Appl. Cryptography Netw. Se- pp. 2640–2653, Nov. 2017.
cur., 2012, pp. 12–29. [44] Y. Zhang, H. Gao, G. Pei, S. Kang, and X. Zhou, “Effect of adversarial
[25] D. Georgeet al., “A generative vision model that trains with high data examples on the robustness of CAPTCHA,” in Proc. Int. Conf. Cyber-
efficiency and breaks text-based captchas,” Science, vol. 358, no. 6368, Enabled Distrib. Comput. Knowl. Discovery, 2018, pp. 1–109.
pp. 2612–2621, 2017. [45] C. Shi et al., “Text CAPTCHA is dead? a large scale deployment and
[26] G. Ye et al., “Yet another text CAPTCHA solver: A generative adversarial empirical study,” in Proc. ACM Conf. Comput. Commun. Secur., 2020,
network based approach,” in Proc. ACM SIGSAC Conf. Comput. Commun. pp. 1–16.
Secur., 2018, pp. 332–348. [46] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
[27] Y. Lv, F. Cai, D. Lin, and D. Cao, “Chinese character CAPTCHA recog- for image-based sequence recognition and its application to scene text
nition based on convolution neural network,” in Proc. IEEE Congr. Evol. recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11,
Comput., 2016, pp. 4854–4859. pp. 2298–2304, Nov. 2017.
[28] J. Chen, X. Luo, Y. Liu, J. Wang, and Y. Ma, “Selective learning con- [47] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait, “High-
fusion class for text-based captcha recognition,” IEEE Access, vol. 7, performance OCR for printed english and fraktur using LSTM networks,”
pp. 22246–22259, 2019. in Proc. 12th Int. Conf. Document Anal. Recognit., 2013, pp. 683–687.
[29] W. Zhou et al., “Transferable adversarial perturbations,” in Proc. Eur. Conf. [48] F. Liao et al., “Defense against adversarial attacks using high-level rep-
Comput. Vision, 2018, pp. 452–467. resentation guided denoiser,” in Proc. IEEE Conf. Comput. Vis. Pattern
[30] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harness- Recognit., 2018, pp. 1778–1787.
ing adversarial examples,” in Proc. Int. Conf. Learn. Representations, [49] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf.
2015, pp. 1–10. Comput. Vis. Pattern Recognit., 2015, pp. 1–9.
[31] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in [50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
the physical world,” in Proc. Int. Conf. Learn. Representation Workshop, recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
2017. pp. 770–778.
[32] Y. Dong et al., “Boosting adversarial attacks with momentum,” in Proc. [51] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adversarial
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9185–9193. examples and black-box attacks,” in Proc. Int. Conf. Learn. Representa-
[33] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: A simple tions, 2017, pp. 1–10.
and accurate method to fool deep neural networks,” in Proc. IEEE Conf. [52] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing ro-
Comput. Vis. Pattern Recognit., 2016, pp. 2574–2582. bust adversarial examples,” in Proc. Int. Conf. Mach. Learn., 2018,
[34] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural pp. 284–293.
networks,” in Proc. IEEE Symp. Secur. Privacy, 2017, pp. 39–57. [53] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
[35] A. Athalye, N. Carlini, and D. A. Wagner, “Obfuscated gradients give a networks,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833.
false sense of security: Circumventing defenses to adversarial examples,” [54] R. R. Selvaraju et al., “Grad-CAM: Visual explanations from deep net-
in Proc. Int. Conf. Mach. Learn., 2018, pp. 274–283. works via gradient-based localization,” in Proc. IEEE Int. Conf. Comput.
[36] J. Deng et al., “Imagenet: A large-scale hierarchical image database,” in Vis., 2017, pp. 618–626.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255. [55] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards
[37] C. Xie et al., “Adversarial examples for semantic segmentation and object deep learning models resistant to adversarial attacks,” in Proc. Int. Conf.
detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1369–1378. Learn. Representations, 2018, pp. 1–10.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.

You might also like