Professional Documents
Culture Documents
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2576 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021
design new form of CAPTCHAs, e.g., developing cognitive [10] II. RELATED WORK
and sequentially related [11] questions to challenge algorithm’s A. Character-Based CAPTCHAs
lack of commonsense knowledge and poor contextual reasoning
ability. In online services, character-based CAPTCHAs are the most
Following this spirit, we are interested to explore the possi- popular protection to deter character recognition programs.
Since the initial goal of CAPTCHA, friendly design and security
bility of improving the robustness of CAPTCHA towards al-
gorithm cracking without changing the traditional character- of CAPTCHAs have been studied. A fundamental requirement
based scheme. In other words, is it possible to design character of CAPTCHAs necessitates that be designed to be easy for hu-
mans but difficult for computers. In traditional CAPTCHA de-
CAPTCHAs only friendly to human instead of simply increas-
ing content complexity? The key lies in finding the algorithm sign, the trade-off between usability and security is difficult to
balance. The three traditional designs are most common: back-
limitation compatible to the scheme of character image. One
ground confusion, using lines and collapsing [18]. But there
candidate is the vulnerability to visual distortions. We have con-
ducted data analysis and observed that human and algorithm are some studies that use auto-generated methods to synthe-
sis CAPTCHA images, e.g., using GANs, instead of manual
exhibit different vulnerability to visual distortions (the obser-
design [19]. These auto methods, which are applied to both
vations are detailed in Section III). This inspires us to exploit
those distortions friendly to human but obstructing algorithm to character-based CAPTCHA and image-based CAPTCHA, are
novel approaches for generating CAPTCHAs, but they still at-
pollute the original character CAPTCHA. Specifically, adversar-
tempt to increase content complexity of CAPTCHAs.
ial perturbation [12] exactly meets this requirement: adversarial
attack1 and CAPTCHA share the common intention that human To overcome the limitations of traditional character-based
CAPTCHAs, other designs have been proposed, e.g., 3D-based
is imperceptible to but algorithm is significantly affected by the
same distortion. The notorious characteristic of adversarial per- CAPTCHAs, Animated CAPTCHAs [18]. 3D approaches to
turbation for visual understanding turns out to be the desired one CAPTCHA design involve the rendering of 3D models to an
image [20], [21]. However, it has been demonstrated that this
for CAPTCHA design.
Inspired by this, we employ adversarial perturbation to de- approach is easy to attacks [22], [23]. Animated CAPTCHAs
sign robust character-based CAPTCHA in this study. Current attempt to incorporate a time dimension into the design. The
addition of a time dimension is assumed to increase the secu-
state-of-the-art cracking solution views CAPTCHA OCR (Op-
tical Character Recognition) as a sequential recognition prob- rity of the resulting CAPTCHA. Nevertheless, techniques that
lem [13]–[17]. To remove the potential distortions, further im- can successfully attack the CAPTCHAs design have been de-
veloped [24].
age preprocessing operations are typically added before OCR.
Correspondingly in this study, we propose to simultaneously at- The last few years have witnessed deep learning plays an
important role in the field of artificial intelligence. The recogni-
tack multiple targets to address the sequential recognition issue
tion rate of character-based CAPTCHAs increases year by year.
(Section IV-A), differentiably approximate image preprocessing
operations (Section IV-C) and stochastic image transformation George et al. proposed a hierarchical model called the Recursive
Cortical Network (RCN) that incorporates neuroscience insights
(Section IV-D) in the adversarial example generation process to
in a structured probabilistic generative model framework, which
cancel out their potential influence. Moreover, since we have no
knowledge about the detailed algorithm the cracking solution significantly improved the recognition rate [25]. To remove the
interference in the background, Ye et al. proposed the GAN-
used (e.g., neural network structure), the generated adversarial
examples are expected to be resistant to unknown OCR algo- based approach for automatically transforming training data
rithms in the black-box cracking. This study resorts this issue and constructing solvers for character-based CAPTCHAs [26].
The convolutional neural network shows a powerful perfor-
to ensemble adversarial training by generating adversarial ex-
amples effective towards multiple algorithms (Section IV-B). In mance in the recognition of various characters, including Chi-
summary, the contributions of this study are two-fold: nese characters [27]. But the convolutional neural network has
r We have discovered the different vulnerability between hu- low recognition accuracy in confusion class. To solve this prob-
man and algorithm on visual distortions. Based on the ob- lem, Chen et al. proposed a novel method of selective learn-
servations, adversarial perturbation is employed to improve ing confusion class for character-based CAPTCHAs recogni-
tion [28]. As the complexity of character-based CAPTCHAs in-
the robustness of character-based CAPTCHA.
r Corresponding to the characteristics of typical OCR crack- creases, the methods based on combining convolutional neural
network and recurrent neural network achieve state-of-the-art
ing solutions, we proposed a novel methodology address-
performance [13]–[17]. In this paper, we employ the archi-
ing issues including sequential recognition, indifferen-
tiable image preprocessing, stochastic image transforma- tecture consists of convolution neural network (CNN) layers
and long short-term memory (LSTM) as the default OCR algo-
tion and black-box cracking.
rithm. We also test our CAPTCHAs on the latest method [17] in
Section V-D, which is an attention-based model that also con-
sists of CNN layers and LSTM.
1 Adversarial attack refers to the process of adding small but specially crafted
perturbation to generate adversarial examples misleading algorithm. To avoid
confusion with the process of attacking CAPTCHA, in this study, we use “adver- B. Adversarial Example
sarial attack” to indicate the generation of adversarially distorted CAPTCHAs
and use “CAPTCHA crack” to indicate the attempt of passing CAPTCHA with While deep learning has achieved great performance, it also
algorithms. has some security problems. Recent work has discovered that the
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2577
existing machine learning models, not just deep neural networks, These methods based on stochastic gradients lead adversarial
are vulnerable to adversarial example [12]. Given a trained clas- attack methods using a single sample of the randomness to in-
sifier F with model parameters W , a valid input x and with cor- correctly estimate the true gradient. Goodfellow et al. [30] first
responding ground truth prediction y, i.e., y = F (x) with model proposed adversarial training method, adversarial examples are
parameters W . It is often to get a similar input x is close to x ac- regarded as training samples to fit the model until these samples
cording to some distance metric d(x, x ), and cause y = F (x ) are classified correctly. The idea is effective and general for all
with model parameters W . An example x with this property types of adversarial attacks. This makes the network more robust
is known as a untargeted adversarial example. A more power- against the adversarial examples, but cost expensive computa-
ful but difficult example called targeted adversarial example is tion, especially at a large scale, e.g., the ImageNet [36] scale.
more than a misclassification example, i.e., t is a target label and In general, the existing defensive methods cannot completely
t = y, t = F (x ) with model parameters W . eliminate adversarial attacks.
Prior work that considers adversarial examples can be gener- Many researchers have found that adversarial example can be
ally classified into two categories: white-box attack and black- applied in other tasks, such as semantic segmentation [37], face
box attack. White-box attack has full knowledge of the trained detection [38], and even speech recognition [39] and transla-
classifier F including the model architecture and model param- tion [40]. The majority of the published papers have focused on
eters W . Black-box attack has no or limited knowledge of the how to eliminate the impact of adversarial examples in applica-
trained classifier F . Black-box setting is apparently harder than tion. Li et al. [41] evaluated adversarial examples among differ-
white-box setting for attackers because of the leaked gradient ent detection services, such as violence, politician, and pornog-
information. It seems that black-box attack is impossible, but raphy detection. Ling et al. [42] proposed a uniform platform for
adversarial examples that affect one model can often affect an- comprehensive evaluation on adversarial attacks and defenses in
other model, which is called transferability [29]. In the paper, application, which can benefit future adversarial examples re-
we rely on the transferability and deploy ensemble-based ap- search. In contrast, studies on employing adversarial examples
proaches to generate adversarial CAPTCHAs. against the malicious algorithm are relatively limited. Osadchy
Szegedy et al. [12] first pointed out adversarial examples and et al. [43] employed adversarial examples to design CAPTCHAs
proposed a box-constrained LBFGS method to find adversar- and analyzed security and good usability of CAPTCHAs. But
ial examples. To decrease expensive computation, Goodfellow they only considered these data types like MNIST and Ima-
et al. [30] proposed the fast gradient sign method (FGSM) to gen- geNet instead of CAPTCHA data types. Zhang et al. [44] stud-
erate adversarial examples by performing a single gradient step. ied the robustness of adversarial examples on different types of
Kurakin et al. [31] extended this method to an iterative version, CAPTCHA and gave the suggestions that how to improve the se-
and find out that adversarial examples can influence physical curity of CAPTCHA using adversarial examples. Shi et al. [45]
world. Dong et al. [32] further extended the fast gradient sign improved the effectiveness of the adversarial example by us-
method family by proposing momentum-based iterative algo- ing the Fourier transform to generate CAPTCHA images in
rithms. In addition, there are some more powerful methods called the frequency domain. However, they only considered gener-
optimization-based attack methods. Deepfool [33] is an attack ating adversarial examples on CNN systems, which is essen-
technique optimized for the L2 distance metric. This method tially the adversarial attack algorithm based on the classification
is based on the assumption that the decision boundary is partly task. In contrast, the current state-of-the-art CAPTCHA crack-
linear, then the distance and direction of the data points to the ing system consists of feature extraction module and sequential
decision boundary can be calculated approximately. C&W [34] recognition module (CNN + LSTM). Shi et al. [45] deployed
is another targeted optimization-based method. It achieves its character-based adversarial CAPTCHAs on a large-scale online
goal by increasing the probability of target label. platform and tested the proposed CAPTCHAs on convolutional
To defend against adversarial examples, several adversarial recurrent neural networks [46]. However, they ignored experi-
defensive methods have been proposed, which has been an ac- ments and discussions on adversarial defense technologies, such
tive field of AI research. Referring to [35], we generally di- as image binarization and adversarial training. In Section V-D,
vide adversarial defensive methods into two categories. Atha- we compare our method with ACs [45] to prove that considering
lye et al. [35] identify gradient masking, or called obfuscated the sequential recognition is essential. In Section V-B, we show
gradients, which leads to a false sense of security in defenses the necessity of considering image preprocessing.
against adversarial examples. The authors addressed that the rea-
son why many adversarial defenses can defend against adver-
sarial examples is that the fast and optimization-based methods III. DATA ANALYSIS
cannot succeed without useful gradient information. The most To justify the feasibility of employing algorithm limitations
common gradient masking methods include input transforma- for CAPTCHA design and motivate our detailed solution, this
tion and stochastic gradients. Input transformation techniques, section conducts data analysis to answer two questions: (1)
e.g., image cropping and image binarization, cause the gradients Whether human and algorithm have different vulnerability to
to be non-existent or incorrect. In this paper, image binarization visual distortion? (2) What characteristics to consider when em-
will definitely result in non-differentiable if gradient masking is ploying distortions to design robust CAPTCHA?
not overcome. Some adversarial defense methods cause the net- Text-based CAPTCHA is the most widely deployed scheme
work itself is randomized or the input is randomly transformed. requiring subjects to recognize characters from 0-9 and A-Z.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2578 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021
Fig. 2. Human v.s. algorithm vulnerability analysis results on Gaussian and adversarial distortions.
Due to its simplicity, character-based CAPTCHA is very ef- segmentation-based approach for OCR works by segmenting a
fective to examine the robustness towards cracking algorithm text line image into individual character images and then rec-
as well as friendliness to human. Therefore, this study em- ognizing the characters [47]. The resultant average recognition
ploys character-based CAPTCHA as the example scheme to accuracies for Gaussian and adversarially distorted CAPTCHAs
conduct data analysis, develop solution and implement experi- are shown in Fig. 2(c) and (d). We can see that, for Gaussian dis-
ments. Specifically, during data analysis, we assume that each torted CAPTCHAs, human’s recognition accuracy consistently
CAPTCHA question is constituted by single character in an declines as the distortion level increases, indicating that Gaus-
RGB image with unique resolution of 48 × 64 px. The char- sian white noise tends to undermine human’s vision. On the con-
acter font is fixed as DroidSansMono. The remainder of the sec- trary, the examined OCR algorithm demonstrates good immu-
tion will report the observations regarding human and character nity to Gaussian white noise, possibly due to the noise removal
recognition performance in different scenarios. effect by multiple convolutional layers [48]. It is easy to imagine
that if we design CAPTCHA by adding Gaussian white noise,
as the noise level increases, the resultant CAPTCHAs will crit-
A. Vulnerability Analysis to Visual Distortion ically confuse humans instead of obstructing the cracking OCR
This subsection designs character recognition competition be- algorithms.
tween human and algorithm to analyze their vulnerability to vi- For adversarially distorted CAPTCHAs, we observed quite
sual distortions. We employed two types of visual distortions: opposite recognition results. Fig. 2(d) shows that humans are
(1) Gaussian white noise is one usual distortion to generate more robust to the adversarial perturbations, while OCR algo-
CAPTCHAs. In this study, the added one-time Gaussian white rithm is highly vulnerable as the adversarial distortion increases.
noise follows normal distribution with mean μ̃ = 0, variance This is not surprising since adversarial perturbation is specially
σ̃ = 0.01 and constant power spectral density. (2) Adversarial crafted to change the algorithm decision under the condition of
perturbation has been recognized as imperceptible to human but not confusing human. This characteristic of adversarial perturba-
significantly confusing algorithm. We employ the widely used tion demonstrates one important limitation of algorithm regards
F GSM [30] to add adversarial perturbation, where one-time to human ability, which perfectly satisfies the requirement of ro-
perturbation is constituted with step size of 0.02. To examine bust CAPTCHA: algorithm tends to fail, while human remains
the change of recognition performance with increasing distor- successful. Therefore, we are motivated to employ adversarial
tion difficulty, we added 8 levels of distortions onto the origi- examples to design robust CAPTCHA to distinguish between
nal character images accumulatively: each level corresponds to algorithm and human.
5 one-time Gaussian white noises and adversarial perturbations
respectively. Examples for derived distorted CAPTCHA images
in different levels are illustrated in Fig. 2(a) and (b).
Regarding the human side, we recruited 77 master work- B. Characteristics Affecting Robust CAPTCHA Design
ers from Amazon Mechanical Turk (MTurk). Each subject was The previous subsection observes that adversarial pertur-
asked to recognize 450 character CAPTCHAs with Gaussian bation is effective to mislead state-of-the-art OCR algorithm,
and adversarial distortions in different levels respectively. Re- which shows its potential to be employed to design robust
garding the algorithm side, we employed the state-of-the-art CAPTCHA. However, typical CAPTCHA cracking solution in-
OCR (Optical Character Recognition) algorithm, which is the volves beyond OCR, e.g., image preprocessing operations like
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2579
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2580 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021
Fig. 4. The proposed robust CAPTCHA designing framework. The left represents the process of CAPTCHA cracking, including sequential recognition, fea-
ture extraction, image binarization (Gaussian filtering) and stochastic transformation. The right represents our solution of CAPTCHA generation, including the
corresponding multi-target attack, ensemble adversarial training, differentiable approximation and expectation, respectively.
fixed. Random one-to-one mapping leads to targeted adversar- Specifically, given K white-box OCR models with their cor-
ial attack, and fixed mapping leads to non-targeted adversarial responding the output of the second-to-last layer as J1 , . . ., JK ,
attack.4 we re-formulate the objective function in Eq. (1) by replacing
When the original set Θ contains only one character, the F (x ) with F̃ (x ) defined as follows:
multi-target attack reduces to single-target attack as the stan-
dard adversarial perturbation. In fact, according to the mecha- K
nism of output decoding in CAPTCHA cracking, we only need
F̃ (x ) = αk Jk (x ) (2)
to misclassify any one of the character tokens to invalid the fi-
k=1
nal recognition result. The above equation in Eqn. (1) provides a
general case of attacking flexible numbers of character tokens. In
practice, the number of attacked characters is one important pa- where αk is the ensemble weight with K k=1 αk = 1. In most
rameter to control the model performance. More attacked char- cases, αk = 1/K except that one model is more important than
acters guarantee higher success rate to resist crack, yet leading others. Among the three sub-modules of OCR stage, feature
to more derived distortions and human recognition burden. The extraction has the most model choices (e.g. various CNN struc-
quantitative influence of attacked character number on the im- tures as GoogLeNet [49], ResNet [50]) which can be easily im-
age distortion level and algorithm recognition rate is discussed plemented into different CAPTCHA cracking solutions. There-
in Section V-C. fore, this study addresses the black-box cracking issue by attack-
ing multiple feature extraction models. Specifically, the training
B. Ensemble Adversarial Training Towards Black-Box Crack data and basic structure of Ji (x ) and F (x ) are identical ex-
cept for the different CNN structures in the feature extraction
As mentioned in Section I, CAPTCHA cracking may em-
sub-module. On the number of CNN structures, the larger the
ploy multiple OCR algorithms for character recognition. At the
value of K, the stronger the generalization capability of the
stage of designing CAPTCHA, it is impractical to target one
derived adversarial CAPTCHA images. However, an excessive
specific OCR algorithm, which requires to design adversarial
K value will lead to high computational complexity and trivial
CAPTCHA images that are effective to as many OCR algo-
weight αk to underemphasize single model. Referring to previ-
rithms as possible. Fortunately, it is recognized that adversarial
ous studies on ensemble adversarial attack [51], 3 ∼ 5 models
perturbation is transferable between models: if an adversarial
achieve a good balance between transferability and practicality.
image remains effective for multiple models, it is more likely to
In this study, we select K = 3 and evenly set αk = 1/3. The
transfer to other models as well [29]. Inspired by this, in order
experimental results in [51] show that under the same training
to improve the resistance to unknown cracking models, we pro-
set, the adversarial examples can achieve stronger transferability
pose to generate adversarial images simultaneously misleading
when the network structure is more similar, and it is reasonable
multiple models.
to choose the model with large structure difference to employ
ensemble adversarial training. The performance of employing
4 The reported experimental results in Section V are based on random one- ensemble adversarial training to resist different OCRs is reported
to-one mapping. in Section V-D.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2581
C. Differentiable Approximation Towards that these adversarial examples fail to remain effective under
Image Preprocessing image transformations, and the gradient of the expected value
The data observations in Section III-B demonstrate the dis- can increase adversarial even under various image transforma-
tions [52]. To integrate image preprocessing issues with potential
tortion removal consequences from binarization operation, re-
transformations, we compute expectation of stochastic image
quiring us to consider the affection of image preprocessing in
adversarial image generation. To address this, we regard image transformation including different angles rotation. The expecta-
tion allows us for the construction of adversarial examples that
preprocessing operation as part of the entire end-to-end solu-
remain adversarial over a chosen transformation distribution T .
tion so that we can generate corresponding adversarial images
effectively to mislead the whole cracking solution. More concretely, we use a chosen distribution T of transforma-
tion functions t taking an input x controlled by the adversary
According to the usability to be incorporated into the end-to-
end solution, image preprocessing operations can be roughly to the true input t(x) perceived by the OCR rather than op-
divided into two categories as either differentiable or non- timizing the objective function of a single example. We then
re-formulate the second term in Eq. (4) by replacing x, x with
differentiable. For each category, we select one representative
operation to address in this study, i.e., Gaussian filtering and t(x), t(x ) defined as follows:
image binarization. Regarding the differentiable Gaussian fil-
1 − x
2
Et∼T [max F̃ (φ(t(x )))θj i − F̃ (φ(t(x ))))θθ̄i ]+ (5)
tering operation, g(x )= √2πσ , we can readily incorporate it
e 2σ2
θi ∈Θ
j=θ̄i i
into the OCR model (Eq. (1), Eq. (2)) by replacing the input
image x with the preprocessing image g(x ). Both forward and Furthermore, rather than simply taking the d(·, ·) to constrain
backward propagation are conducted on the replaced function the solution space, we instead aim to constrain the expected
F (g(x )), leading to the generated adversarial images expected effective distance between the adversarial and original inputs.
to eliminate the affection from Gaussian filtering. The first term in Eq. (4) is replaced by d(t(x), t(x )) defined as
Regarding the non-differentiable image binarization, we can- follows:
not straightforwardly incorporate it into the objective function.
Instead, we find a differentiable approximation s(x ) to image Et∼T [d(t(x), t(x ))] (6)
binarization and incorporate the approximated function into the
end-to-end solution. In this study, s(x ) is defined as follows: In practice, the distribution T can model perceptual transforma-
1 tion such as color change, image translation, random rotation,
s(x ) = x −τ
(3) or addition of noise. These transformations amount to a set of
1 + e− ω
the random linear combinations, which are more thoroughly de-
where τ denotes the threshold of image binarization, ω denotes scribed in Section V-E. Then we can approximate the gradient of
the degree of lateral expansion of the curve. Note that to guaran- the expected value through sampling transformations indepen-
tee that the generated adversarial images are resistant to im- dently at each gradient descent step in optimizing the objective
age binarization, we only employ the approximated s(x ) at function and differentiating through the transformation. Given
the backward propagation stage to update the generated image, its ability to generate robust adversarial CAPTCHA images, we
while the forward propagation still use the actual x to calculate use the gradient of the expected value to directly eliminate the
∇x F (x). affection from stochastic transformation for differentiable im-
To simultaneously resist to the affections from Gaussian fil- age transformation. But regarding some non-differentiable im-
tering and image binarization, we concatenate s(·) and g(·) in age transformation, we cannot straightforwardly differentiate
the final objective function. Therefore, the overall optimization through the transformation. Instead, we can use the same cate-
problem incorporating the three proposed modules is as follows: gory as mentioned in Section IV-C that finding a differentiable
+ approximation and incorporate the approximated function into
the end-to-end solution.
θi θi
min
d(x, x ) + λ · max F̃ (φ(x ))j − F̃ (φ(x )))θ̄
x j=θ̄i i
θi ∈Θ
(4) V. EXPERIMENTS
where F̃ (·) denotes the ensemble of multiple OCR models de-
fined in Eq. (2), and φ(x ) = s(g(x )) denotes the approximated We examined CAPTCHA images with 4 characters for ex-
image preprocessing operations defined in Eq. (3). periments. The CAPTCHAs are RGB images with resolution
of 192 × 64 px. Regarding the cracking method, we consid-
ered image binarization and Gaussian filtering (kernel size:
D. Expectation Towards Stochastic Image Transformation 3 × 3, σ = 0.8) at the image preprocessing stage. The OCR
The above three subsections are more than enough for gen- stage is instantiated with CNN structures for feature extraction
eral CAPTCHAs generation towards OCR cracking solution. and LSTM+softmax for sequential recognition. Regarding our
However, a potential cracker could use a number of transforma- proposed CAPTCHA generation method, image binarization is
tions to make the adversarial perturbations meaningless, e.g., approximated with τ = 0.8, ω = 0.05, and 4 CNN structures are
the cracker could slightly rotate the image, doing so entirely employed for ensemble adversarial training. All experiments are
bypasses general adversarial examples. Prior work has shown conducted on Nvidia GTX 1080Ti GPU with 11 G memory.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2582 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021
Fig. 5. Example images (top row) and theri attention maps (bottom row). From
left to right, we show the original image, the image with Gaussian white noise,
the adversarial generated by our method and the adversarial image generated by Fig. 6. Example CAPTCHAs with different complexity levels (from top to
our method but without considering image preprocessing. bottom: easy, medium, hard). Each row from left to right shows the differ-
ent settings of Raw, rCAPTCHA_parallel, rCAPTCHA_w/o preprocessing and
rCAPTCHA.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2583
TABLE I
THE RECOGNITION OF DIFFERENT COMPLEXITY LEVELS OF CAPTCHAS IN THE DIFFERENT SETTINGS. THE RESULTS OF ALGORITHMS ARE
OBTAINED AFTER GAUSSIAN FILTERING AND IMAGE BINARIZATION
Fig. 7. The influence of λ on derived image distortion and cracking recognition Fig. 8. The influence of |Θ| on derived image distortion and cracking recog-
accuracy. nition accuracy.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2584 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021
TABLE II
TRANSFERABILITY OF ADVERSARIAL IMAGES GENERATED BETWEEN PAIRS OF MODELS. THE ELEMENT (i, j) REPRESENTS THE ACCURACY OF
THE ADVERSARIAL IMAGES GENERATED FOR MODEL i (ROW) TESTED OVER MODEL j (COLUMN)
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2585
TABLE III
DISTRIBUTION OF TRANSFORMATIONS
Fig. 10. Example CAPTCHAs with different complexity levels (from top to
bottom: easy, medium, hard). Each row from left to right shows the different
settings of image transformation.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
2586 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST CAPTCHAS TOWARDS MALICIOUS OCR 2587
[17] Y. Zi, H. Gao, Z. Cheng, and Y. Liu, “An end-to-end attack on text [38] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize to
CAPTCHAs,” IEEE Trans. Inf. Forensics Secur., vol. 15, pp. 753–766, a crime: Real and stealthy attacks on state-of-the-art face recognition,” in
2020. Proc. ACM Sigsac Conf. Comput. Commun. Secur., 2016, pp. 1528–1540.
[18] Y.-W. Chow, W. Susilo, and P. Thorncharoensri, “CAPTCHA design and [39] N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attacks
security issues,” in Proc. Adv. Cyber Secur.: Principles, Techn., Appl., on speech-to-text,” in Proc. IEEE Secur. Privacy Workshops, 2018, pp.
2019, pp. 69–92. 1–7.
[19] H. Kwon, Y. Kim, H. Yoon, and D. Choi, “CAPTCHA image generation [40] J. Li, S. Ji, T. Du, B. Li, and T. Wang, “Textbugger: Generating adversarial
systems using generative adversarial networks,” IEICE Trans. Inf. Syst., text against real-world applications,” in Proc. Annu. Netw. Distrib. Syst.
vol. 101, no. 2, pp. 543–546, 2018. Secur. Symp., 2019, pp. 1–15.
[20] M. E. Hoque, D. J. Russomanno, and M. Yeasin, “2D CAPTCHAs from [41] X. Li, K. Yu, S. Ji, Y. Wang, C. Wu, and H. Xue et al., “Fighting against
3D models,” in Proc. IEEE SoutheastCon, 2006, pp. 165–170. deep-fake: Patch&pair convolutional neural networks,” in Proc. Compan-
[21] M. Imsamai and S. Phimoltares, “3d captcha: A next generation of the- ion Web Conf., 2020, pp. 88–89.
captcha,” in Proc. Int. Conf. Inf. Sci. Appl., 2010, pp. 1–8. [42] X. Ling et al., “Deepsec: A uniform platform for security analysis of deep
[22] V. D. Nguyen, Y.-W. Chow, and W. Susilo, “On the security of text-based learning model,” in Proc. IEEE Symp. Secur. Privacy, 2019, pp. 673–690.
3D captchas,” Comput. Secur., vol. 45, pp. 84–99, 2014. [43] M. Osadchy, J. Hernandez-Castro, S. Gibson, O. Dunkelman, and D.
[23] Q. Ye, Y. Chen, and B. Zhu, “The robustness of a new 3D captcha,” in Pérez-Cabo, “No bot expects the deepCAPTCHA! introducing immutable
Proc. IAPR Int. Workshop Document Anal. Syst., 2014, pp. 319–323. adversarial examples, with applications to CAPTCHA generation,” IEEE
[24] V. D. Nguyen, Y.-W. Chow, and W. Susilo, “Breaking an animated Transactions. on Information. Forensics. and Security., vol. 12, no. 11,
CAPTCHA scheme,” in Proc. Int. Conf. Appl. Cryptography Netw. Se- pp. 2640–2653, Nov. 2017.
cur., 2012, pp. 12–29. [44] Y. Zhang, H. Gao, G. Pei, S. Kang, and X. Zhou, “Effect of adversarial
[25] D. Georgeet al., “A generative vision model that trains with high data examples on the robustness of CAPTCHA,” in Proc. Int. Conf. Cyber-
efficiency and breaks text-based captchas,” Science, vol. 358, no. 6368, Enabled Distrib. Comput. Knowl. Discovery, 2018, pp. 1–109.
pp. 2612–2621, 2017. [45] C. Shi et al., “Text CAPTCHA is dead? a large scale deployment and
[26] G. Ye et al., “Yet another text CAPTCHA solver: A generative adversarial empirical study,” in Proc. ACM Conf. Comput. Commun. Secur., 2020,
network based approach,” in Proc. ACM SIGSAC Conf. Comput. Commun. pp. 1–16.
Secur., 2018, pp. 332–348. [46] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
[27] Y. Lv, F. Cai, D. Lin, and D. Cao, “Chinese character CAPTCHA recog- for image-based sequence recognition and its application to scene text
nition based on convolution neural network,” in Proc. IEEE Congr. Evol. recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11,
Comput., 2016, pp. 4854–4859. pp. 2298–2304, Nov. 2017.
[28] J. Chen, X. Luo, Y. Liu, J. Wang, and Y. Ma, “Selective learning con- [47] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait, “High-
fusion class for text-based captcha recognition,” IEEE Access, vol. 7, performance OCR for printed english and fraktur using LSTM networks,”
pp. 22246–22259, 2019. in Proc. 12th Int. Conf. Document Anal. Recognit., 2013, pp. 683–687.
[29] W. Zhou et al., “Transferable adversarial perturbations,” in Proc. Eur. Conf. [48] F. Liao et al., “Defense against adversarial attacks using high-level rep-
Comput. Vision, 2018, pp. 452–467. resentation guided denoiser,” in Proc. IEEE Conf. Comput. Vis. Pattern
[30] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harness- Recognit., 2018, pp. 1778–1787.
ing adversarial examples,” in Proc. Int. Conf. Learn. Representations, [49] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf.
2015, pp. 1–10. Comput. Vis. Pattern Recognit., 2015, pp. 1–9.
[31] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in [50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
the physical world,” in Proc. Int. Conf. Learn. Representation Workshop, recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
2017. pp. 770–778.
[32] Y. Dong et al., “Boosting adversarial attacks with momentum,” in Proc. [51] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adversarial
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9185–9193. examples and black-box attacks,” in Proc. Int. Conf. Learn. Representa-
[33] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: A simple tions, 2017, pp. 1–10.
and accurate method to fool deep neural networks,” in Proc. IEEE Conf. [52] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing ro-
Comput. Vis. Pattern Recognit., 2016, pp. 2574–2582. bust adversarial examples,” in Proc. Int. Conf. Mach. Learn., 2018,
[34] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural pp. 284–293.
networks,” in Proc. IEEE Symp. Secur. Privacy, 2017, pp. 39–57. [53] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
[35] A. Athalye, N. Carlini, and D. A. Wagner, “Obfuscated gradients give a networks,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833.
false sense of security: Circumventing defenses to adversarial examples,” [54] R. R. Selvaraju et al., “Grad-CAM: Visual explanations from deep net-
in Proc. Int. Conf. Mach. Learn., 2018, pp. 274–283. works via gradient-based localization,” in Proc. IEEE Int. Conf. Comput.
[36] J. Deng et al., “Imagenet: A large-scale hierarchical image database,” in Vis., 2017, pp. 618–626.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255. [55] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards
[37] C. Xie et al., “Adversarial examples for semantic segmentation and object deep learning models resistant to adversarial attacks,” in Proc. Int. Conf.
detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1369–1378. Learn. Representations, 2018, pp. 1–10.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:14:47 UTC from IEEE Xplore. Restrictions apply.