You are on page 1of 14

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL.

15, 2020 753

An End-to-End Attack on Text CAPTCHAs


Yang Zi, Haichang Gao , Member, IEEE, Zhouhang Cheng, and Yi Liu

Abstract— Text-based CAPTCHAs are the most widely used aimed at hollow text CAPTCHAs; [3] used a brute-force
CAPTCHA scheme. Most text-based CAPTCHAs have been approach with a high cost; and [4] utilized a Gabor filter,
cracked. However, previous works have mostly relied on a series but the component combination was complex, and the graph
of preprocessing steps to attack text CAPTCHAs, which was
complicated and inefficient. In this paper, we introduce a simple, search was time-consuming. With the development of deep
generic, and effective end-to-end attack on text CAPTCHAs learning, the application of deep learning to CAPTCHA break-
without any preprocessing. Through a convolutional neural ing has become a trend. Reference [5] introduced a convo-
network and an attention-based recurrent neural network, our lutional neural network (CNN) in their attack on Microsoft’s
attack broke a wide range of real-world text CAPTCHAs that CAPTCHA in 2015. In 2018, Tang et al. [6] claimed that they
are deployed by the top 50 most popular websites ranked by
Alexa.com. In addition, this paper comprehensively analyzed the proposed a generic method, and they adopted a CNN as well.
security of most resistance mechanisms of text-based CAPTCHAs However, these methods were mostly based on preprocessing
through experiments. Experimental results prove that the anti- and segmentation techniques, which were complicated and
segmentation principle can be completely broken under deep inefficient.
learning attacks without any segmentation or preprocessing steps In this paper, we propose an end-to-end method that is
in contrast to commonly held beliefs.
simple, generic and effective in breaking text CAPTCHAs.
Index Terms— CAPTCHA, text-based, security, CNN, RNN, This method utilizes an attention-based model that consists of
attention, deep learning. a CNN and a recurrent neural network (RNN). We tested our
I. I NTRODUCTION attack on text CAPTCHAs deployed by the top 50 most popu-
lar websites (ranked by Alexa.com). The experimental results
C APTCHA (Completely Automated Public Turing Test to
Tell Computers and Humans Apart) is used to distinguish
malicious bots from legitimate users [1] by automatically
demonstrated that our model obtained high success rates
without any segmentation or other preprocessing techniques.
Additionally, we conducted a systematic analysis of the
setting up specified tests that are difficult for computers
effectiveness of all common resistance mechanisms under deep
but easy for humans. This technology has almost become a
learning attacks. Our targeted schemes cover almost all real-
standard security mechanism used to prevent automatic voting,
world CAPTCHA design features. The analysis proved that the
registration, spam and dictionary attacks on passwords on
segmentation-resistance principle may no longer be applicable
websites. The most commonly used type of CAPTCHA is
under deep learning attacks. This paper also analyzed the
a text-based CAPTCHA that requires the user to decipher
security of a series of uncommon CAPTCHAs. All target
characters within an image.
CAPTCHAs were broken with high success rates, indicating
To enhance the security of text CAPTCHAs, CAPTCHA
that our model is generic for various CAPTCHA schemes.
designers have made every effort to design CAPTCHAs that
The rest of this paper is organized as follows: Section II
are robust to different automated attacks, and their measures
first briefly introduces the most commonly used CAPTCHA
have included distorting or rotating characters and applying
resistance mechanisms, then summarizes previous methods in
noise or a complex background. In early studies, the process
text-based CAPTCHA breaking; Section III introduces the
of mainstream segmentation-based attacks consisted of three
network structure of our model and the training process in
main steps: preprocessing, segmentation and recognition. Con-
detail; Section IV evaluates the security of a range of real-
sidering that different CAPTCHA schemes may have distinct
world text CAPTCHAs; Section V makes a comprehensive
features due to their unique designs and generation algorithms,
analysis of the effectiveness of existing resistance schemes;
attackers must accordingly design different preprocessing and
Section VI outlooks the development direction of text-based
segmentation methods. It is clear that the whole attack process
CAPTCHAs in the future; and Section VII concludes the
is very tedious and lacks generalization capacity. Some pre-
paper.
vious studies claimed to propose generic methods [2]–[4],
but these methods still have limitations. Reference [2] only
II. BACKGROUND
Manuscript received July 11, 2018; revised April 7, 2019 and May 25,
2019; accepted July 9, 2019. Date of publication July 15, 2019; date of current A. Summary of Resistance Mechanisms
version September 24, 2019. This work was supported by the National Natural
Science Foundation of China under Grant 61472311. The associate editor Text CAPTCHAs are the most commonly used CAPTCHA
coordinating the review of this manuscript and approving it for publication schemes. The intuitiveness and low cost of generation are
was Prof. Karen Renaud. (Corresponding author: Haichang Gao.) the main reasons for the popularity of this type. However,
The authors are with the Institute of Software Engineering, Xidian Univer-
sity, Xi’an 710071, China (e-mail: hchgao@xidian.edu.cn). since the development of image processing techniques, early
Digital Object Identifier 10.1109/TIFS.2019.2928622 text CAPTCHAs are vulnerable to rogue programs due to
1556-6013 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
754 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

the early sophisticated algorithms, Yan and El Ahmad [10]


used the pixels counting method to attack Captchaservice.org
with a success rate of nearly 100%. Reference [11] proposed
new segmentation techniques and managed to crack several
CAPTCHA schemes but failed on CCT CAPTCHAs. In 2015,
Karthik and Recasens [5] broke the Microsoft CAPTCHA
by template matching with a success rate of 5.6%. They
also attempted to use a CNN as the character classifier and
achieved a success rate of 57.1%. The CNN they used is a
variation of LeNet-5, which consists of three convolutional
layers and three-fully connected layers. However, Karthik’s
segmentation algorithm is based on the assumption that the
string length of each CAPTCHA is constant. If the string
length and character size are variable, their segmentation
method will lose efficacy. In addition, Starostenko et al. [12]
Fig. 1. Samples of real-world CAPTCHA schemes with different resistance broke four CAPTCHAs with success rates ranging from 40.4%
mechanisms: (a) character overlapping, (b) rotation, (c) distortion, (d) hol-
low, (e) varied-length, (f) multi-fonts, (g) noisy arcs, and (h) background
to 94.3%. And Gao et al. [13] attacked a Microsoft two-layer
interference. CAPTCHA by transforming the CAPTCHA into two single-
layer CAPTCHAs and achieved a success rate of 44.6%.
2) Toolbox Method: This kind of method refers to apply-
their simple structure. A series of resistance mechanisms have ing corresponding techniques in the toolbox to specific
been adopted to enhance the security of text CAPTCHAs. CAPTCHAs. In 2009, PWNtcha [14] broke a large range
We summarize the most widely used schemes as follows: of CAPTCHAs. The success rates ranged from 49% to
1) Character Overlapping: The segmentation-resistance 100%. However, all of the methods are cumbersome and
principle was a commonly accepted guideline for CAPTCHA result in a slow attack speed. In 2010, Li et al. [15] built
designers, and having the characters overlap is a good appli- a set of CAPTCHA-breaking tools to break 3 e-banking
cation of the principle, as shown in Figure 1(a). CAPTCHAs and 41 login CAPTCHA schemes. In 2011,
2) Rotation: Rotation means to rotate the characters at a Bursztein et al. [16] proposed Decaptcha, which broke 13 out
certain angle, as shown in Figure 1(b). of 15 of schemes with success rates ranging from 4.9% to
3) Distortion: Distortion, which means that the text of 66.2%. The whole process includes five stages and seven
a CAPTCHA is locally or globally warped, as shown techniques. Various techniques were selected based on the
in Figure 1(c). features of each CAPTCHA.
4) Hollow: This design uses contour lines to form charac- 3) Generic Method: Gao et al. [2] broke five hollow
ters (see Figure 1(d)) to increase the difficulty of extracting CAPTCHA schemes with success rates ranging from 36%
an individual character. to 89%, but their method lost effectiveness when dealing
5) Variable Length: The string lengths of most text with non-hollow schemes. Their team also proposed another
CAPTCHAs are fixed. However, some CAPTCHAs adopt a method that was based on the Gabor filter [4] that effectively
variable string length. For the Microsoft CAPTCHA shown broke a variety of text CAPTCHAs with success rates ranging
in Figure 1(e), the text length varies from 8 to 10. from 5% to 77%. The disadvantage of their method is that the
6) Multi-Font: Multi-font means that different fonts are preprocessing step is not efficient enough, especially when
used in the same CAPTCHA image, as shown in Figure 1(f). dealing with CAPTCHAs adopting complicated background
7) Noisy Arcs: Using noisy arcs is an anti-segmentation interference. Bursztein et al. [3] applied their approach to
defense. As shown in Figure 1(g), noisy arcs include thin multiple CAPTCHAs. The success rates ranged from 5.3%
foreground arcs, thick foreground arcs and thin background to 55.2%. The drawback to this method is that as the length
arcs. of the CAPTCHA increases, the computation cost increases
8) Background Interference: Background interference exponentially.
embeds characters into a complex background to prevent Tang et al. [6] used two CNNs in their pipeline method
attackers from separating and segmenting the CAPTCHA, and broke eleven real-world CAPTCHA schemes, with success
as shown in Figure 1(h). rates ranging form 10.1% to 90.0%. The CNNs in their
In practice, a CAPTCHA scheme always combines several method are both based on LeNet-5 network and consist of four
resistance mechanisms to guarantee its security. convolutional layers, two max pooling payers and three fully-
connected layers. The first CNN in their method is responsible
B. Previous Works for predicting the CAPTCHA length, and the other is used
1) Ad Hoc Method: This type of method aims to break for recognizing individual characters. However, the training
a specific CAPTCHA with particular features. Early studies process of two individual CNNs is time-consuming. Mori and
including [7], [8] and [9] used shape feature descriptors, Malik [17] managed to break CAPTCHAs in one step using a
distortion estimation techniques or different preprocessing single CNN. The CAPTCHA in their experiment is the most
techniques to break a specific CAPTCHA. In contrast to primary and fixed-length scheme. Therefore, the robustness

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
ZI et al.: END-TO-END ATTACK ON TEXT CAPTCHAs 755

Fig. 2. Architecture overview.

of their method remains to be tested. In 2017, Le et al. [18] the ILSVRC2014 image classification task [22], was used as
combined a CNN and an RNN and proposed a end-to-end the CNN feature extractor. Another reason that we choose the
model. The CNN structure consists of six convolutional layers, Inception-v3 network is that the inception block increases the
three max-pooling layers and two fully-connected layers. First, depth and width of network. According to the empirical
the CNN embeds a CAPTCHA image into a fixed-dimensional CNN design principles proposed by [21], increasing the
embedding vector. At each time step, the LSTM concatenates network depth and width tends to contribute to better network
the image embedding, the hidden state and the one-hot pre- performance.
diction of the prior time step as input to predict the one-
hot prediction of each character. But when solving real-world B. Character Recognition
CAPTCHAs, their model requires extra finetuning process, The core of our model is an LSTM. It works as a decoder
making their method a two-stage attack. George et al. [19] to transfer the feature vector f into a text sequence. Compared
proposed a recursive cortical network (RCN) that broke four to a traditional RNN, which performs poor in storing infor-
schemes with success rates ranging from 57.1% to 66.6%. mation over extended time intervals, the LSTM overcomes
For each scheme, nine parameters of this network must be set the weakness of the RNN in long-term dependencies and
manually. Additionally, the average attack speed is 94 seconds, explicitly learns when to store information. In a traditional
which is extremely slow. sequence-to-sequence model, at each time step, the decoder
takes the whole embedding vector f as the input directly.
III. N ETWORK S TRUCTURE This paper introduces the attention mechanism into our model.
Our goal is to recognize the complete character sequence The bottleneck of the traditional encoder-decoder structure
without any segmentation. We use a sequence-to-sequence is that the input is constant, which limits the representation
model similar to that in [20], which was originally designed for capability of the model. The attention mechanism allows the
the text recognition task in natural scenes, in our experiment. decoder to ignore irrelevant information while preserving the
The model is based on a CNN, a long short-term memory most significant information of the feature vector f. In essence,
(LSTM) network and a new attention mechanism. The archi- the attention mechanism assigns different weights to different
tecture is briefly illustrated in Figure 2. The model mainly parts of the feature vector, so that the model can focus on a
consists of two parts: an encoder and a decoder. specific part of the feature vector at each time step, making the
predictions more accurate. This is the fundamental reason why
our method can recognize every individual character without
A. Feature Extraction segmentation.
The encoder of our model is essentially a CNN. The CNN Bahdanau’s [23] team adopted the attention mechanism in
is used to extract features from the whole CAPTCHA image. the RNN encoder-decoder framework to address the machine
Specifically, the function of the CNN is to embed the original translation task. They first calculated the relevance of different
CAPTCHA image into an embedding vector that contains parts of the input and output. Based on this intermediate result,
the global information of the original CAPTCHA image. The they assigned different weights to the context vectors when
feature vector extracted by the encoder is denoted as f = fi, j,c predicting each target word. The attention mechanism helped
(i and j index the location in the feature map, and c indexes them achieve a good performance.
the channels). In [24], Chorowski et al. proposed a location-aware attention
The Inception-v3 network [21], which is the key component mechanism. Their method measured the relevance of each part
of GoogLeNet and achieves excellent performance in of the feature map by integrating information of the hidden

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
756 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

state at the current time step, the attention distribution at the


prior time step and the global feature map. They adopted their
attention mechanism in a speech recognition task and achieved
good results. Fig. 3. Samples of the difficult version of the Google CAPTCHA.
In our task, the model sometimes needed to skip some
unwanted character recognition, as described in Section V.G
later. Therefore, the attention mechanisms above are not current best model, the weights of the best model will be
applicable to our task. Based on prior works, we proposed a updated, and the number of the current epoch will be recorded.
new attention mechanism and describe the details as follows. In the testing phase, the best model is used to run an attack
We denote the hidden state of the LSTM at time t as st . on the test samples.
At each time step t, the input to the LSTM decoder is
computed by
IV. E XPERIMENTS
x̂ t = We et −1 + Wu1 u t −1 (1)
A. Sample of a Large-Scale Attack on the Google CAPTCHA
where et −1 represents the one-shot encoding of the pre- We collected a typical crowding-characters-together (CCT)
dicted character at time t-1 (see the red line of the Decoder
CAPTCHA, a difficult version of the Google CAPTCHA, as a
in Figure 2 ) and ut is the masked embedding feature vectors
representative subject for our attack, as shown in Figure 3.
(see the green line of Decoder of Figure 2), computed as We summarize the features of the Google CAPTCHA as

ut = αt,i, j f i, j,c (2) follows:
i, j • Every challenge is derived from the lowercase alphabet

We refer to α t as the attention masks at time step t. (26 lowercase letters)


• Varied CAPTCHA lengths are used, and the CAPTCHA
First, we compute the relevant importance of each part of the
feature vector f, which is obtained by combining the image lengths range from 8 to 10.
• CCT is the main segmentation resistance mechanism.
information in f with a time offset st (see the dotted blue line
• Large degrees of rotation and distortion are applied.
in Figure 2) and the feature vector f.
This version contains many user-unfriendly challenges. For
αt = so f tmax i, j (tanh(st , f i, j,c )) (3) example, Figure 3(a) might be recognized as “shmomnc”,
The previous hidden state st −1 and the current input x̂ t are while the answer is “shmoranc”, and 3(b) might be recognized
employed to compute the hidden output ot and the next hidden as “deremas”, while the answer is “cleremas”. As a result,
state st . the difficulty of recognizing this version of Google CAPTCHA
Finally, in order to obtain the final probability distribution is extremely high.
over the letters at time step t, we combine the hidden output Previous studies have tried to attack this version of the
information ot of the LSTM with the attentional feature vector Google CAPTCHA. In 2011, [16] successfully broke 13 of
ut , computed as 15 text CAPTCHAs but failed (0% success rate) to break
the Google CAPTCHA. In 2015, [12] broke the Google
Pt = so f tmax(Wo ot + Wu2 u t ) (4) CAPTCHA with a success rate of 94.5%. However, the method
According to the probability distribution, we use the argmax involves seven steps before the recognition stage and is very
function to choose the most likely character as the final result. complicated. In 2016, [4] broke this CAPTCHA using log-
Gabor filters, but their success rate was only 7.8%.
Our attack consists of two phases: the training phase and
C. Implementation Details the testing phase. During the training phase, we collected
For the encoder, we did not use the complete network 10,000 CAPTCHA samples from the Google website in
structure of Inception-v3. Instead, we cropped the Inception- July 2017, which were divided into a training set and a
v3 from the first layer to the “Mixed_6d” block for use as validation set at a ratio of 3:1. Then, the original dataset
our encoder. The decoder consists of T LSTM cells, each of was expanded 5, 10, 15 and 20 times by including newly
which contains 256 hidden units. The number of LSTM cells is collected samples to build another four datasets. The answer
equal to the max string length of the CAPTCHA. For the entire of each CAPTCHA sample in these datasets was labeled
network structure, the number of trainable parameters reaches manually. To avoid human error, we double checked the
8,459,676. Depending on the original size of the different answers. For each of these datasets, an individual model was
CAPTCHAs categories, we rescaled each CAPTCHA image trained independently.
so that the short edge was in the range of 100 to 200 pixels. In the testing phase, another 3,000 samples were used.
In training phase, the model is trained by stochastic gradient We used the full sequence accuracy to benchmark the per-
descent (SGD) with a momentum of 0.9. The learning rate is formance of our model. In other words, for a CAPTCHA
set to 0.004. In addition, we trained 200 epochs with a batch sample in the test set, the full predicted character sequence was
size of 64. considered to be correctly resolved only if it was identical to
After each epoch, the model is tested on the validation set. the manually labeled answer. We trained five individual models
If the performance of the model is better than that of the with different numbers of samples during the training phase.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
ZI et al.: END-TO-END ATTACK ON TEXT CAPTCHAs 757

TABLE I
R ESISTANCE M ECHANISMS OF TARGETED CAPTCHA S CHEMES

The testing samples were input to each of the five trained For each scheme, we collected 10,000 CAPTCHA images
models respectively to output the final prediction. from the corresponding website and manually labeled their
With 10,000 training samples, a success rate of 42.7% answers. Of these, 8500 samples were used for training,
was achieved. As the number of image samples of the train- 1000 for validating and the remaining 500 for testing. We used
ing set increased to 50,000, 100,000, 150,000, and 200,000, the method described in Section IV(A) to calculate the final
the success rate increased to 87.9%, 94.5%, 97.4%, and 98.3% success rates. Table II summarizes the attack results for every
respectively. The average processing time per image was less CAPTCHA scheme. Specifically, Baidu, Jd and 360 deployed
than 0.13 seconds, which is 2 times faster than [12] and more than one kind of CAPTCHA, so we performed our attack
62 times faster than [4]. Obviously, despite the complexity and on their different schemes separately. The final success rates
user-unfriendliness of the Google CAPTCHA, our method has ranged from 74.8% to 97.3%, proving we broke all target
achieved excellent results in both success rate and speed. schemes successfully. The average attack speeds ranged from
0.08 to 0.23 seconds. These results not only far exceeded the
speed of humans but also surpassed those of most traditional
B. Large-Scale Attack methods that are based on segmentation (6.22 seconds per
image for [3], 15 seconds for [4], and 9.05 seconds for [13]).
To evaluate the effectiveness of our attack more compre-
hensively, we chose sixteen CAPTCHA schemes deployed
by eleven websites, including Wikipedia, Baidu, QQ, Sina, C. Comparison With Prior Works
Weibo, Jd, 360, Apple, Yandex, Alipay and Microsoft, which Table III compares the results of our attack with those
are ranked in the top 50 according to the Alexa.com traffic of previous methods. Obviously, our results outperformed
rankings as of February 2018. Their design features cov- all other CAPTCHA crackers. Moreover, our attack doesn’t
ered almost all current mainstream resistance mechanisms, need any segmentation or preprocessing process, unlike tradi-
as shown in Table I. tional text CAPTCHA breaking techniques, thus making our

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
758 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

TABLE II TABLE IV
ATTACK R ESULTS ON THE CAPTCHA S CHEMES ATTACK R ESULTS W ITH D IFFERENT N UMBER OF T RAINING S AMPLES
D EPLOYED BY 11 W EBSITES

TABLE III
C OMPARISON W ITH P REVIOUS W ORKS

deep learning models perform better when using more training


samples. To strike a balance between efficiency and accuracy,
we conducted a series of supplementary experiments. For
approach simple and efficient. For any type of target text each CAPTCHA scheme in Table I, we set the number
CAPTCHA scheme, we only need to collect 10,000 samples of samples used for training to 2,000, 4,000, 6,000 and
to train the model. Most importantly, our method is highly 8,000 respectively. The ratio between the training set and the
robust. No previous work has achieved such great success in validation set is 3:1. With these different numbers of training
such large scale text CAPTCHA schemes. samples, we ran our attack on each scheme respectively.
Compared to Karthik and Tang’s works, the biggest advan- The attack results are listed in Table IV. Table IV shows
tage of our approach is that it recognizes a CAPTCHA that, with 2,000 real-world samples, out method achieved
image without any preprocessing or segmentation. Therefore, success rates over 40% on 13 out of 16 forms of CAPTCHAs.
the errors introduced by these two steps can be avoided. With 4,000 real-world samples, our method achieved suc-
Through training, our model is able to focus on the font cess rates over 70% on 13 out of 16 CAPTCHAs. The
features and filter out redundant visual information such as CAPTCHA schemes with lower success rates tend to adopt
noisy arcs or rotation angles. Therefore, our method far more resistance mechanisms. For example, each of the two
surpasses prior pipeline methods in both accuracy and speed. CAPTCHA schemes of the 360 company adopted a total
Compared to Le’s work, our method performed better for of seven resistance mechanisms, which greatly increased the
two reasons. First, we used a more advanced CNN structure, variations of character features.
Inception-v3, which is wider and deeper in network structure. To make the trend of the model’s success rates more
Therefore, our encoder is better in its representation capability. explicit, we present the relationship curve between the success
Second, the attention mechanism is combined with the LSTM rate and the number of training samples of each CAPTCHA
during the decoding phase. This allows our model to focus on scheme in Figure 4. Overall, the success rate increases as
the most relevant information and makes the output at each the number of training samples increases. For some schemes,
time step more explicit. This explains our better success rate as the number of samples increases, the success rates decreases
than that of Le’s work for Wikipedia’s CAPTCHA. slightly, which is due to the difference between the samples
distributions of training set and testing set.
All current CAPTCHA cracking works require the manual
D. Attack With Different Number of Training Samples
labeling of samples. The sample size of 2,000 to 4,000 are
Despite the good results in both accuracy and speed, relatively small-scale for deep learning techniques. If we are
our approach has one drawback: manually labeling pursuing a balance between accuracy and efficiency, maybe
10,000 CAPTCHA samples is not an easy task. In theory, the use of 4,000 samples is a better choice.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
ZI et al.: END-TO-END ATTACK ON TEXT CAPTCHAs 759

Fig. 4. The effect of the number of training samples on the attack success
rate.

V. C OMPREHENSIVE A NALYSIS
In this section, we first introduced an automatic CAPTCHA
generation system. By imitating the features of real-world
CAPTCHA schemes, we generated five different CAPTCHA
schemes and implemented our attacks on the corresponding
real-world CAPTCHAs. Then, we explored whether it is
possible to train the model with one CAPTCHA scheme and
use the trained model to break another scheme. Next, we com-
prehensively studied the security of the most commonly used Fig. 5. The framework of our CAPTCHA generation system. The system
resistance mechanisms and some special mechanisms such as first generates single characters according to pre-defined parameters; next,
it embeds generated characters into a background image; finally, the system
the mix-background scheme, stylization and two-layer struc- selectively adds global interference.
ture. Taking the Chinese CAPTCHAs as examples, we also
studied the security of large alphabet CAPTCHAs. Finally, TABLE V
we trained a generic model applicable to multiple CAPTCHA ATTACK R ESULTS W ITH S YNTHETIC CAPTCHA S AMPLES
categories.

A. CAPTCHA Generation System


Synthetic data is often used as a replacement for real-world
CAPTCHA samples to reduce costs. For example, [13], [17]
and [26] achieved good results by using their own generated
CAPTCHAs: [13] mimicked the Microsoft CAPTCHA; [17]
used the Cool PHP CAPTCHA framework to generate black
and white images; and [26] generated single Chinese character
images with noisy arcs.
Compared to the above methods, our generation system
can imitate most of real-world CAPTCHA schemes, includ-
ing all mechanisms listed in Section II. The automatically
production process is shown in Figure 5. The basic working Character sets. For CAPTCHAs that are based on Roman
principle of our system is simple. First, generate individual characters, the character set contains 62 categories (10 digits,
characters. Then, local interference can be added to individual 26 lowercase letters and 26 uppercase letters). For Chinese
characters, or individual characters can be inserted into the CAPTCHAs, we used the GB2312 character set which con-
background image. Finally, the global interference can be tains 3755 commonly used Chinese characters.
added to the whole CAPTCHA image. Fonts. Most of the CAPTCHAs that are based on Roman
It should be noted that for different CAPTCHA schemes, characters uses the bold Microsoft YaHei font as the default
we used different character sets and character fonts. font, except those adopting the multi-font mechanisms.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
760 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

TABLE VI
ATTACK R ESULTS OF BASELINE M ODELS AND F INE -T UNED M ODELS

For Chinese CAPTCHAs, the font library contains three fonts: synthetic samples. Therefore, the time cost of using synthetic
KaiTi, SimSun and Microsoft YaHei. samples to break CAPTCHAs may be even higher.

C. Generalization of the Baseline Models


B. Attack With Synthetic CAPTCHA Samples
To further test the generalization ability of our model,
In this section, we attempted to attack the real-world
we used a model trained on one CAPTCHA scheme to attack
CAPTCHA schemes with synthetic samples. We chose the
another scheme. The model trained on the original scheme
five most complicated CAPTCHAs in Table I as our tar-
is called the baseline model. According to the attack results
get schemes, including those of QQ, Sina, 360, Apple, and
in Table II, we selected the two schemes with the lowest
Microsoft. According to the resistance mechanisms of each
success rates (QQ and Microsoft) and the two schemes with
scheme, our CAPTCHA generation system imitated the fea-
the highest success rates(Jd and Baidu) as our experimental
tures of these CAPTCHAs and generated corresponding sam-
subjects.
ples. To improve the model’s generalization ability, for each
First, the baseline model of each scheme was used
scheme, we adopted 150,000 synthetic samples (100,000 as
directly to attack the other three targeted schemes. As shown
the training set and 50,000 as the validation set) to train the
in Table VI, without exception, the success rates were all
model.
zero. In fact, the features of these four schemes are very
The attack results of models trained on synthetic samples
distinct. Therefore, it is challenging for our baseline models
are listed in the ‘Imitation model’ column in Table V. For
to address totally different CAPTCHAs. Next, using the fine-
comparison, the attack results of models trained on real-world
tuning strategy, each baseline model was optimized with
samples are listed in another column named ‘Baseline model’.
1,000 real-world samples. The attack results of the fine-tuned
The success rates of the imitation models ranged from 3.8%
models are also listed in Table VI.
to 85.7%. The accuracy of the QQ’s CAPTCHA was the
After fine-tuning, the performance of the model was sig-
lowest. Obviously, the similarity between our synthetic QQ
nificantly improved. We found that the growth rates of the
scheme and the real-world scheme was not good enough for
simple targeted schemes were much higher than those of the
our attack. Note that for the Apple’s CAPTCHA, our imitation
complicated schemes due to the diversity between the different
model’s performance even outperformed the baseline model.
schemes.
The success rates of most imitation models are much lower
In conclusion, only if the target scheme is used to train the
than those of the corresponding baseline models. The reason
model, or the training samples are very similar to the target
behind this result was that the tiny difference between the
samples, can our method achieve a good performance.
synthetic samples and the real-world samples will enlarge
greatly though multiple activation layers and cause an output
error. D. Security of Normal Resistance Mechanisms
To correct the error caused by the difference between the To further examine the effectiveness of the mainstream
synthetic and the real-world samples, we used a small number resistance mechanisms in text-based CAPTCHAs, we com-
(500) of real-world samples to fine-tune the imitation model. bined different resistance mechanisms to generated seven
The results are shown in the last column in Table V. The kinds of CAPTCHA schemes. The character set contained
success rates of the fine-tuned models range from 77.6% 62 categories (10 digits, 26 lowercase letters and 26 uppercase
to 90.9%. The performances of the fine-tuned model are letters). As shown in Table VII, the number of resistance
comparable to or even better than those of the baseline model. mechanisms increased from 2 to 8.
Using synthetic samples to break real-world CAPTCHA For each scheme in Table VII, we generated 150,000 sam-
schemes requires fewer real-world samples than does the ples as the training set and 50,000 samples as the validation
baseline method. However, imitating real-world samples is set. By using the samples from different schemes to train the
actually very time-consuming. For example, to imitate the model, we obtained seven trained models. Then, 3,000 testing
360’s CAPTCHA, we had to consider a series of factors, samples of each scheme were input to the corresponding
including the angle of rotation, font style, degree of over- model to output the prediction answers. The experimental
lapping and background interference. The parameters of the results are listed in Table VII. As the complexity of the
system had to be adjusted manually before generating the CAPTCHA increased, the success rate decreased from 96.1%

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
ZI et al.: END-TO-END ATTACK ON TEXT CAPTCHAs 761

TABLE VII
ATTACK R ESULTS ON D IFFERENT R ESISTANCE M ECHANISM C OMBINATIONS

to 3.3%, indicating that the use of combinations of multiple TABLE VIII


resistance mechanisms dose enhance the security of the text- E FFECT OF H OLLOW AND VARIABLE -L ENGTH R ESISTANCE M ECHANISM
based CAPTCHA. According to the Bursztein [16] criterion,
all CAPTCHA schemes listed in Table VII were broken by
our method successfully.
The first five resistance mechanisms (overlapping, rota-
tion, distortion, hollow and variable length) have little effect
on the recognition success rate. However, as the multi-font
mechanism was adopted, the success rate dropped sharply
from 88.7% to 13.5%. Such a huge reduction indicated that
using multiple fonts in CAPTCHA design is an effective
strategy. Finally, when the CAPTCHA adopted noisy arcs
and background interference, the accuracy dropped slightly,
proving that these CAPTCHAs were almost useless to defend
against our deep learning based attack.
Among all the schemes, the adoption of the hollow and
variable-length strategies slightly increased the attack success
rate, while the adoption of the multi-font scheme led to a sharp
drop in attack accuracy. To analyze the reason behind these
anomalies, we performed some further experiments.

1) Effect of Hollow and Variable-Length Mechanism: To


eliminate the effects of other resistance mechanisms, we only
combined two resistance mechanisms, at most, in our supple-
mentary experiments. First, the CAPTCHA generation system
generated four basic CAPTCHA schemes, each of which only
adopted one resistance mechanism, including the overlapping,
distortion, multi-font and rotation mechanisms. Then, the hol-
low or variable-length mechanism was added to each of these of each character. The hollow mechanism made the outline of
basic CAPTCHA schemes, respectively, to obtain another each character category clearer, which reduced the difficulty
eight CAPTCHA schemes. of the recognition task for our method. In other words,
Experimental results concerning the effectiveness of the hol- the hollow mechanism actually weakens the security of text-
low and variable-length mechanisms are listed in Table VIII. based CAPTCHAs against deep learning attacks.
Without exception, for every basic CAPTCHA scheme that According to the results in Table VIII, the variable-length
used one resistance mechanism, once the hollow mechanism mechanism also seemed to have a negative effect on the secu-
was added, the success rate increased to some extent. The rity of CAPTCHAs. However, by comparing the statistics of
reason for this anomaly is the inner working mechanism of our variable-length and fixed-length CAPTCHA samples, we real-
method. Unlike traditional segmentation-based approaches, ized that when we adopted the variable-length mechanism in
our character recognition process relies on the outline features our experiment, the average length of the CAPTCHA was

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
762 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

TABLE IX
U SABILITY A NALYSIS OF THE T WO C OMPLEX CAPTCHA S CHEMES

TABLE X
ATTACK R ESULTS ON S PECIAL R ESISTANCE M ECHANISMS

Fig. 6. The attack success rates on CAPTCHAs that adopt different number
of fonts.

shorter. For fixed-length CAPTCHAs, the length was set to


8, while the variable-length CAPTCHAs had lengths ranging
from 4 to 10, with an average of 7. In fact, in the decoding visual perception based CAPTCHAs range from 70% to 98%
process of our attack, the model recognizes one character and the response time range from 6.8 seconds to 13 seconds.
at a time and sequences all predicted characters as the final Despite the response time of these two schemes is acceptable,
answer. Therefore, the final recognition accuracy is inversely the user accuracy of 9.9% and 5.2% reflect that the two
proportional to the CAPTCHA length. schemes are awful in usability, which means such kind of
2) Effect of the Multi-Font Mechanism: In Table VII, a total CAPTCHAs is unlikely to be used in the real world.
of 230 fonts was used to generate CAPTCHA samples to study
the effect of the multi-font mechanism. In the real world, it is E. CAPTCHAs With Special Resistance Mechanisms
unlikely to adopt so many fonts in CAPTCHAs. We tested
the security of six CAPTCHA schemes that adopted 1, 10, In addition to the common resistance mechanisms summa-
50, 100, 200 and 230 fonts, respectively. Figure 6 shows the rized above, we also studied three special resistance mecha-
attack results. When the number of fonts increased from 1 to nisms, including multiple background interference, stylization
230, the success rate dropped from 99.2% to 42.9%. The and two-layer structure.The experimental results are listed
adoption of the multi-fonts mechanism made the shape of in Table X.
each character category more diverse, greatly increasing the 1) Multiple Background Interference CAPTCHAs: For each
difficulty of character recognition. The experimental results CAPTCHA image, two interferences were adopted in the back-
showed that the multi-font mechanism is indeed an effective ground image. Our success rate on this scheme was 93.8%,
mechanism to resist deep learning attacks. and average attack speed was 0.14 seconds, demonstrating that
3) Usability of Complex CAPTCHA Schemes: Note that the the multiple background interference mechanism still cannot
success rates of the last two CAPTCHA schemes in Table VII resist our attack.
were only 7.7% and 3.3%, respectively, but these two schemes 2) Stylized CAPTCHAs: Neural style transfer changes the
were also very unfriendly and unrecognizable to human texture of a whole image, while preserving its content.
beings. To comprehensively analyze the usability of the last Tang [6] first applied this idea to click-based CAPTCHAs
two schemes in Table VII, we did a supplementary experiment to better fuse foreground objects into the background image.
and evaluated the usability of these two schemes from two We combined the networks proposed by [28] and [29] as
aspects: our neural style transfer network. We selected the well-known
artwork Composition X as the style image to generate stylized
• Accuracy. It is measured by the percentage of passed CAPTCHA.The success rate of 97.8% shows that the neural
tests in all attempts. style transfer had little effect on improving the security of
• Response time. It means the time cost for the users to text-based CAPTCHAs.
recognize the characters in the CAPTCHA and submit 3) Two-Layer CAPTCHAs: Two-layer scheme means to
their answers. arrange characters into two rows. This design aims at increas-
The two schemes were deployed online respectively. For ing the difficulty of segmentation. To study the security of the
each scheme, 50 volunteers were invited to take partici- two-layer scheme against our deep learning attack, we imitated
pate in at least 30 tests. The experimental results are listed two-layer CAPTCHA scheme of Microsoft and ran our attack
in Table IX. According to [27], the user accuracies of most on it. The final success rate we achieved was 90.5%, and the

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
ZI et al.: END-TO-END ATTACK ON TEXT CAPTCHAs 763

TABLE XI
ATTACK R ESULTS ON C HINESE CAPTCHA S

Fig. 7. Visualization of the attention mask.

characters when labeling the training samples. For example,


average processing time was 0.14 seconds. Therefore, the two- the label of the left sample in Figure 7 was ‘QCY’. The success
layer structure is also ineffective against deep learning attacks. rates for the two schemes were 99.7% and 99.9%, respectively.
There are two reasons why we achieved such a high success
F. Large-Alphabet CAPTCHAs rate. First, the attention mechanism allows our model to
The character sets of large alphabets, such as Korean, focus on the most relevant feature information and filter out
Japanese and Chinese, are much larger than the Roman irrelevant information. Second, our model relies on the font
alphabet. Intuitively, a larger alphabet will increase the solu- features of training samples, so our model can recognize
tion space, thus making it much more difficult to break the specific characters and ignore other unwanted characters.
CAPTCHA. To describe how the attention mechanism works in pre-
Taking Chinese CAPTCHAs as an example, the total num- dicting individual characters, we utilized a method similar
ber of Chinese characters exceeds 20,000. Compared with to that proposed by [31] to visualize the attention masks
those of Roman characters, the Chinese character strokes are on the two CAPTCHAs. The attention visualization result is
more complicated, and many characters are very similar. This shown in Figure 7, with the highlighted parts representing the
further increases the difficulty of the automatic recognition most relevant image regions for the current prediction. For the
task. In this subsection, we examined whether larger alphabets sample in 7 (a), the requirement was to output only uppercase
are resistant to deep learning attacks. letters, and the answer was ‘QCY’. In the first step, our model
We used the GB2313 character set, which includes 3755 of focused attention on the uppercase letter ‘Q’. In the second
the most commonly used characters, as the character library. step, the attention was focused on the next uppercase character
Three Chinese CAPTCHAs were generated (see the first ‘C’ rather than the lowercase character ‘a’. In the training
column of Table XI), which mimicked those of Baidu, phase, our model did not learn the font feature of character
BotDetectTM-Chess and Tianya, respectively. Their charac- ‘a’. Therefore, in the testing phase, our model treated character
teristics are listed in the second column of Table XI. ‘a’ as background and ignored it directly. In the third step, our
We achieved 96.5%, 99.6% and 99.9% success rates on the model shifted attention to the next uppercase character ‘Y’ and
three Chinese CAPTCHAs, respectively, with a small number ignored the numeric character ‘3’. The recognition process
of training epochs. The results showed that although there of the scheme in 7 (b) is similar. At each step, our model
are many more Chinese characters than Roman characters, only focused attention on upright characters and bypassed all
the Chinese CAPTCHAs are also easy to break. Compared inverted ones directly.
with [30], which required a huge cost to extract individual Chi-
nese characters, our attack had better accuracy and efficiency. H. A Generic Model
Our attack has achieved great success on various
G. Selective Recognition of Text-Based CAPTCHAs CAPTCHA schemes. However, for different CAPTCHA
In general, common text-based CAPTCHA schemes require schemes, we need to collect the corresponding samples and
users to recognize all characters to pass the test. In some train the model separately. Considering that there are hundreds
cases, however, the mechanism requires that only some of the of text-based CAPTCHA schemes in the real world, it would
characters in the image be identified. be desirable to train a generic model that can break multiple
Figure 7 displays two different CAPTCHAs schemes. The CAPTCHA schemes.
left one is a Roman-character CAPTCHA that requires users to To improve the robustness of the attack model to various
only input uppercase letters. The right is a Chinese CAPTCHA text CAPTCHAs, we applied all the resistance mechanisms
scheme. Some of the Chinese characters are inverted. The listed in Section II to generate ten CAPTCHA schemes with
requirement is to recognize all upright characters to pass the distinct features as our training samples (as listed in Table XII).
test. We used 200,000 samples of each CAPTCHA to train the
For this type of mechanism, we do not need to make model. The model was trained for 300 epochs. For each
any changes to our network. We just ignore the unwanted scheme, we prepared another 3000 samples for testing.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
764 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

TABLE XII TABLE XIII


ATTACK R ESULTS OF THE G ENERIC M ODEL ON S YNTHETIC S AMPLES ATTACK R ESULTS OF G ENERIC M ODEL , BASELINE M ODEL AND
F INE -T UNED M ODEL

the segmentation-resistance principle, which has become the


cornerstone for designing text CAPTCHAs. For a long time,
how to extract individual characters from a CAPTCHA image
has been the primary task for attackers. However, using deep
learning techniques, we overturned this principle. Under the
premise of ensuring usability, the text CAPTCHA has little
room for increased security.
Based on the experimental results, we presented the possible
future directions of text-based CAPTCHAs. According to the
experimental results shown in Table VII, the multi-font mech-
anism dramatically reduced the success rates of our attack.
The variation of character fonts increased the difficulty of the
recognition process. This indicates that resisting recognition,
rather than resisting segmentation, will be the likely develop-
The success rates on different schemes ranged from 58.7% ment direction in the future. If we can find an effective method
to 96.7%, with an average speed of 0.12 seconds. The experi- to increase the recognition difficulty, then the text CAPTCHA
mental results demonstrated that it is possible to break multiple will still be applicable. For example, [33] and [34] described a
kinds of CAPTCHA with one generic model. However, it is pixel-label distortion process that produced adversarial exam-
unclear whether the existing generic model is applicable to ples which are almost visually identical to the original. The
the new CAPTCHAs that have not been included in the difference is not obvious to humans, but all neural networks
training process. To examine this idea, we used the other five in their study misclassified these images. Upon adding slight
CAPTCHA schemes in Table II. For each scheme, we prepared perturbation to images, pandas are classified as gibbons. This
500 test samples and ran the attack respectively. Unfortu- process can be applied to text-based CAPTCHAs, as it can
nately, the generic model did not perform well on the new mislead a classifier, while the image remains easily identifiable
CAPTCHAs. to humans.
Then, for each scheme, we fine-tuned the generic model Other studies have shown that synthetic artifacts can be
with 1,500 corresponding real-world samples. Unsurprisingly, created to confuse a recognition network. Recently, [35]
the fine-tuned generic model achieved a higher success rate proposed a misclassification attack against state-of-the-art face
again, ranging from 65.0% to 86.4%, as shown in Table XIII. recognition systems by producing physical eyeglasses on the
This result indicated that a generic model can indeed attack face in the image, leading a face recognition system that is
a variety of CAPTCHA schemes, but when dealing with new based on a deep neural network to misclassify the subject.
CAPTCHA schemes, the generic model needs to be fine-tuned This method can also be applied to text CAPTCHAs by pasting
with the target scheme samples first. some physical artifacts to fool the recognition systems.

VI. CAPTCHA S IN THE F UTURE VII. C ONCLUSION


Currently, most of the CAPTCHAs used are still text- This paper systematically analyzed the security of text
based CAPTCHAs. Since its inception, the text CAPTCHA CAPTCHAs. Our proposed attack is a generic, effective
has been continuously update in the confrontation between and end-to-end solution. Using deep learning techniques,
CAPTCHA designers and attackers. In 2005, [32] established we have successfully broken the difficult version of the

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
ZI et al.: END-TO-END ATTACK ON TEXT CAPTCHAs 765

Google CAPTCHA with a success rate of 98.3%, thereby [11] J. Yan and A. S. El Ahmad, “A low-cost attack on a microsoft
outperforming all previous attacks. We also cracked a large CAPTCHA,” in Proc. 15th ACM Conf. Comput. Commun. Secur., 2008,
pp. 543–554.
amount of real-world text CAPTCHAs that are deployed by [12] O. Starostenko, C. Cruz-Perez, F. Uceda-Ponga, and V. Alarcon-Aquino,
the top 50 websites as ranked by Alexa.com. The success rates “Breaking text-based CAPTCHAs with variable word and charac-
ranged from 74.8% to 97.3%. The average speed was less than ter orientation,” Pattern Recognit., vol. 48, no. 4, pp. 1101–1112,
2015.
0.23 seconds per challenge. [13] H. Gao, M. Tang, Y. Liu, P. Zhang, and X. Liu, “Research on the secu-
We comprehensively analyzed the security of eight com- rity of microsoft’s two-layer CAPTCHA,” IEEE Trans. Inf. Forensics
mon resistance mechanisms by attacking seven generated Security, vol. 12, no. 7, pp. 1671–1685, Jul. 2017.
[14] PWNtcha. Pretend We’re Not a Turing Computer but a Human Antag-
CAPTCHA schemes adopting these mechanisms. We also onist. Accessed: Dec. 4, 2009. [Online]. Available: http://caca.zoy.
performed our attack on some special resistance mechanisms org/wiki/PWNtcha
including the mixed-background scheme, stylized scheme and [15] S. Li, S. A. H. Shah, M. A. U. Khan, S. A. Khayam, A.-R. Sadeghi,
and R. Schmitz, “Breaking e-banking CAPTCHAs,” in Proc. 26th Annu.
two-layer scheme. In addition, our method is not only applica- Comput. Secur. Appl. Conf., 2010, pp. 171–180.
ble to common text CAPTCHAs that are based on Roman [16] E. Bursztein, M. Martin, and J. Mitchell, “Text-based CAPTCHA
characters but also to CAPTCHAs that are based on large strengths and weaknesses,” in Proc. 18th ACM Conf. Comput. Commun.
Secur., 2011, pp. 125–138.
alphabets, such as Chinese. Experiments also proved that [17] F. Stark, C. Hazırbas, R. Triebel, and D. Cremers, “CAPTCHA recog-
our method is able to perform selective recognition. Finally, nition with active deep learning,” in Proc. GCPR Workshop New
we examined the feasibility of using one generic model to Challenges Neural Comput., 2015, pp. 1–8.
[18] T. A. Le, A. G. Baydin, R. Zinkov, and F. Wood, “Using synthetic data
attack multiple CAPTCHA schemes. to train neural networks is model-based reasoning,” in Proc. Int. Joint
The experimental results presented in this paper proved Conf. Neural Netw. (IJCNN), May 2017, pp. 3514–3521.
that most real-world text CAPTCHAs are not secure, and [19] D. George et al., “A generative vision model that trains with high
data efficiency and breaks text-based CAPTCHAs,” Science, vol. 358,
the existing resistance mechanisms are also not as effective no. 6368, p. eaag2612, 2017.
as expected. Our experimental results also overturned the [20] Z. Wojna et al., “Attention-based extraction of structured information
segmentation-resistance principle. Instead, the recognition- from street view imagery,” 2017, arXiv:1704.03549. [Online]. Available:
https://arxiv.org/abs/1704.03549
resistance principle may be the development direction for [21] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
text-based CAPTCHAs. In addition, new techniques such as the inception architecture for computer vision,” in Proc. IEEE Conf.
adversarial examples could be used to improve the security of Comput. Vis. Pattern Recognit., Jun. 2016, pp. 2818–2826.
[22] O. Russakovsky et al., “ImageNet large scale visual recognition
text-based CAPTCHAs. challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252,
We expect that our work will help to promote the security Dec. 2015.
of text CAPTCHAs. Current text CAPTCHAs are no longer [23] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” 2014, arXiv:1409.0473. [Online].
secure. How to design CAPTCHAs that are resistant to deep Available: https://arxiv.org/abs/1409.0473
learning attacks while maintaining user friendliness is an open [24] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,
problem and the subject of our ongoing work. “Attention-based models for speech recognition,” in Proc. Adv. Neural
Inf. Process. Syst., 2015, pp. 577–585.
[25] H. Gao et al., “Robustness of text-based completely automated public
turing test to tell computers and humans apart,” IET Inf. Secur., vol. 10,
R EFERENCES no. 1, pp. 45–52, 2016.
[26] J. Chen, X. Luo, Y. Guo, Y. Zhang, and D. Gong, “A survey on breaking
[1] L. Von Ahn, M. Blum, and J. Langford, “Telling humans and computers technique of text-based CAPTCHA,” Secur. Commun. Netw., vol. 2017,
apart automatically,” Commun. ACM, vol. 47, no. 2, pp. 56–60, 2004. Dec. 2017, Art. no. 6898617.
[2] H. Gao, W. Wang, J. Qi, X. Wang, X. Liu, and J. Yan, “The robustness of [27] E. Bursztein, S. Bethard, C. Fabry, J. C. Mitchell, and D. Juraf-
hollow CAPTCHAs,” in Proc. ACM SIGSAC Conf. Comput. Commun. sky, “How good are humans at solving CAPTCHAs? A large
Secur., 2013, pp. 1075–1086. scale evaluation,” in Proc. IEEE Symp. Secur. Privacy, May 2010,
[3] E. Bursztein, J. Aigrain, A. Moscicki, and J. C. Mitchell, “The end is pp. 399–413.
nigh: Generic solving of text-based CAPTCHAs,” in Proc. WOOT, 2014, [28] L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic
pp. 1–15. style,” 2015, arXiv:1508.06576. [Online]. Available: https://arxiv.org/
[4] H. Gao et al., “A simple generic attack on text CAPTCHAs,” in Proc. abs/1508.06576
NDSS, 2016, pp. 1–26. [29] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time
[5] C. Hong, B. Lopez-Pineda, K. Rajendran, and A. Recasens, “Breaking style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vis.
microsoft’s CAPTCHA,” Comput. Netw. Secur. Term Projects, MIT, Cham, Switzerland: Springer, 2016, pp. 694–711.
Cambridge, MA, USA, Tech. Rep. 6.857, 2015. [30] A. Algwil, D. Ciresan, B. Liu, and J. Yan, “A security analysis of
[6] M. Tang, H. Gao, Y. Zhang, Y. Liu, P. Zhang, and P. Wang, “Research automated chinese turing tests,” in Proc. 32nd Annu. Conf. Comput.
on deep learning techniques in breaking text-based CAPTCHAs and Secur. Appl., 2016, pp. 520–532.
designing image-based CAPTCHA,” IEEE Trans. Inf. Forensics Security, [31] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolu-
vol. 13, no. 10, pp. 2522–2537, Oct. 2018. tional networks: Visualising image classification models and saliency
[7] G. Mori and J. Malik, “Recognizing objects in adversarial clutter: maps,” 2013, arXiv:1312.6034. [Online]. Available: https://arxiv.org/
Breaking a visual CAPTCHA,” in Proc. IEEE Comput. Soc. Conf. abs/1312.6034
Comput. Vis. Pattern Recognit., vol. 1, Jun. 2003, p. 1. [32] K. Chellapilla, K. Larson, P. Y. Simard, and M. Czerwinski, “Computers
[8] G. Moy, N. Jones, C. Harkless, and R. Potter, “Distortion estimation beat humans at single character recognition in reading based human
techniques in solving visual CAPTCHAs,” in Proc. IEEE Comput. Soc. interaction proofs (HIPs),” in Proc. CEAS, 2005, pp. 1–8.
Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 2, Jun./Jul. 2004, [33] C. Szegedy et al., “Intriguing properties of neural networks,” 2013,
p. 2. arXiv:1312.6199. [Online]. Available: https://arxiv.org/abs/1312.6199
[9] K. Chellapilla and P. Y. Simard, “Using machine learning to break visual [34] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harness-
human interaction proofs (HIPs),” in Proc. Adv. Neural Inf. Process. ing adversarial examples,” 2014, arXiv:1412.6572. [Online]. Available:
Syst., 2005, pp. 265–272. https://arxiv.org/abs/1412.6572
[10] J. Yan and A. S. El Ahmad, “Breaking visual CAPTCHAs with naive [35] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “A general frame-
pattern recognition algorithms,” in Proc. 23rd Annu. Comput. Secur. work for adversarial examples with objectives,” 2017, arXiv:1801.00349.
Appl. Conf. (ACSAC), Dec. 2007, pp. 279–291. [Online]. Available: https://arxiv.org/abs/1801.00349

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.
766 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 15, 2020

Yang Zi is currently pursuing the master’s degree in Zhouhang Cheng is currently pursuing the master’s
computer science with Xidian University. His current degree in computer science with Xidian University.
research interest is captcha. Her current research interest is captcha.

Haichang Gao is currently a Professor with the Yi Liu is currently pursuing the master’s degree in
Institute of Software Engineering, Xidian University. computer science with Xidian University. His current
He has published over 30 papers. He is currently research interest is captcha.
in charge of the National Natural Science Founda-
tion of China Project. His current research interests
include captcha, computer security, and machine
learning.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 13,2022 at 15:13:58 UTC from IEEE Xplore. Restrictions apply.

You might also like