You are on page 1of 9

Breaking Google’s reCAPTCHA:

The Story of the Stiltwalker project and dc949

reCAPTCHA is Google’s free version of a CAPTCHA service, a device to prevent bots or computers from
filling in forms, such as website registrations, contests, spam in comment fields, and so on. It is used
extensively in Twitter, Facebook, Ticketmaster, and Craigslist . CAPTCHA, or Completely Automated
Public Turing Test to Tell Computers and Humans Apart, was coined, and first used, in 2000 by Luis von
Ahn, Manuel Blum, Nicholas Hopper and John Langford of Carnegie Mellon University.

What does that have to do with dc949?

Most hacker security conferences have contests; Shmoocon is no different. One of their recent contests
involved claiming the maximum number of Twitter followers. What if you were able to sign up a very
large number of accounts, and have them all follow your account? Team dc949 took the challenge, but
found that on the third Twitter account creation, the user had to fill in a reCAPTCHA test to ensure that
the Twitter accounts weren’t requested by a script or bot. Being erudite hackers, dc949 decided that
sidestepping the reCAPTCHA road block would allow them to generate more followers, and tried to
write a script to do just that. They failed in that attempt, as did their efforts to win the Twitter follower
contest.

Some time later, CP and Adam of dc949 realized that something was bothering both of them, that it
should be possible to defeat audio reCAPTCHA, and project Stiltwalker was born. In the meantime,
Adam was presciently taking a machine learning computer class at Stanford.
Fig.1. Stars of Layer one: Jeffball, C-P, deep in thought and motion, and Adam. Note the very cool
speaker badges (designed by nullsapcelabs) on Jeffball and Adam that resemble Cylons. If you look
closely at Jeffball’s badge, you can see the static image of the moving “eye.” On Adam’s badge, that
“eye” is just about off the badge edge. C-P’s badge is in his pocket.

reCAPTCHA Analysis

In the reCAPTCHA lexicon there were 58 words, with a background of NPR speech chopped up and
played backwards. These reCAPTCHA words consisted of numbers, colors, kitchen things, days of week,
vehicles, and six miscellaneous words. These words were sometimes combined in weird ways: “Eight
Christmas,” “Redline Purple,” etc.

How to analyze the words, and subtract the background ? After initial attempts, it appeared that trying
to subtract the background did not appreciably alter the success rate, so attempts to account for the
background stopped. dc949 analyzed the frequency of the words over time, and found that the
background was at 3.4 kHz, and the words themselves were at 5 kHz.
Fig. 2. Frequency versus Time: Real words are shown as peaks at about 5 KHz.

By using a Fast Fourier Transform, they were able to identify target words from the spectrogram. Then
dc949 used phash, a perceptual hash library that tells how similar hashes are to compare the hashes of
the FFT of the spectrogram for human analyzed – known - words versus the unknown samples .

dc949 listened to 100,000 samples. Then, dc949 solved 50,000 samples by hand. They split those
samples, into two sets, and used phash to compare the spectrogram analyzed “hash.” The best match
against their now known library was what Stiltwalker would input to reCAPTCHA. A recorded sample
that dc949 played during their talk showed just how mind numbingly boring and tedious that task was.
It appeared to have driven them to frequent bouts of drinking, extending even to their presence on st
age.
Fig 3. Adam, C-P, and Jeffball drink up onstage at Layer one. Johnny Walker blue label and Coca-Cola.
Fig 3 (alternate) C-P, Adam, and Vyrus of dc949 onstage at Layer one. Johnny Walker blue label and
Coke.

Improving the Solution

Via fairly complex mathematics, Adam created a neural network amongst dc949’s computers, in so
doing creating up to 5000 nodes and multiple outputs. As he explained, the machine learning is very
similar to linear regression, and he set up some neural networks to help solve the problem of reducing
2048 inputs to fewer outputs (58)– the set of words used by reCAPTCHA. Now by plotting frequency
versus amplitude, and trying to match that curve, a sample would only match that inflection, and that
specific background.
Figure 4. Red marks represent a broken frequency “map” of red, blue marks represent the word blue,
and green marks represent the word green. What word is represented by the black marks? No, not
green. It’s red.
Fig. 5. Amplitude versus Frequency of target words, showing a good, but not perfect, curve fit. This
curve better fits general cases of the word red, versus a more exacting fit. Stiltwalker measures the
distance between an unknown word, and the various curves generated by known words to find a
match.

By solving many CAPTCHAs beforehand, dc949 coached the neural network into making better choices.
Then Adam and his peeps created back up solvers - more neural nets, trained on different combinations
of words, inflections, background noises - to solve more conditions of varying spoken word inflection
and various background noises. The best combination used 13 different solvers, chained together.
What they did, essentially, was measure the distance from the x’s to the known good/solved word, and
assign that a certainty. So a given sample could have a certainty of 1% to be boat, but 97% to be kettle,
so the guess would be kettle. Some of the challenges included the fact that reCAPTCHA used three
simultaneous instances of background noise intertwined. Like humans, dc949 ignored the background
noise, which worked!

Word Mashing – Get Fuzzy

Something else made this project more interesting, and slightly easier to solve. For audio, reCAPTCHA
wasn’t always looking for an exact match. It is best to explain this by example. Remember, audio tests
are designed for the visually challenged. So red is the same as read, and blew is the same as blue.
Sometimes vowels don’t matter, so spauaoauuoiuaioioaien = spoon = spn. Thorsday is also acceptable.
Smart = smrt. Merging is also permissible, sometimes. Fork and Four become fourk, and no, spork
doesn’t work, they tried. Why merge? It reduces the key space – less area that has to be searched for a
match. By brute forcing merges, and because of allowance for one miss, only 5 words need to be
submitted, and two words could be chosen for a merge, and both are submitted. As an example:

Solved - seven oven black six zero nine

Submit – soven soven black six zero This works!

Other teams have tried to defeat capture. A team at Stanford - http://cdn.ly.tl/publications/text-based-


captcha-strengths-and-weaknesses.pdf On reCAPTCHA, they scored 1.52%. Carnegie Mellon’s team -
http://www.captcha.net/Breaking_Audio_CAPTCHAs.pdf On reCAPTCHA, they scored 58%. dc949
achieved 99.1%, once solving 846 correctly in a row. Humans would take 8s to solve this; dc949’s
neural net did it in about .3 -.5s. Maybe that’s what gave away the fact that non-humans were decoding
the puzzle?

That was before Google changed the way reCAPTCHA works. Before, there were 6 words, presented
over 8s. Now, there are 10 much harder to distinguish words presented over 28s. dc949 claimed about
28% accuracy when decoding the new reCAPTCHA by human ear; out of 10 tries, I only solved one. The
frequency between the background and the words are the same.

Google’s Reaction: The Lady or The Tiger?

Two hours before dc949’s presentation, reCAPTCHA suddenly became much harder to solve. From 58
words to more words, new background noise, and ten challenges per session (versus 6) over a much
longer time period, dc949’s 99.1% solution rate turned to mush. Did Google know about that planned
presentation? Is that why they made the change? Or was it merely fortuitous, perhaps based on the
previous papers, or even based on foreign country botnet solutions?

Frustratingly, Google demurred. From an unnamed Google spokesperson: We try to strike the right
balance to make things easier for humans but still difficult for machines to break. We measure the
effectiveness of CAPTCHAs first by internal testing, then we roll out experiments and observe how real
users are responding to the new one compared to the old one. We test and push refinements very
frequently, so we expect your experience to improve as we continue to modify our systems.

When I pressed my contact and tried to stir the pot a bit, I received this: We're going to continue to test
our system and make improvements as necessary to provide the best balance of security and usability.
Your characterization of "breaking" reCAPTCHA is incorrect. We took swift action to fix a vulnerability
that affected reCAPTCHA, and we aren’t aware of any abuse that used the techniques discovered. We're
continuing to study the vulnerability to prevent similar issues in the future.

We believe most CAPTCHA-solving is being done by paid humans. Bots still do poorly at CAPTCHAs
compared to humans. They have to make lots of attempts, and in doing so they often leave fingerprints
we can analyze to block them. We've seen success in our attempts to shut down bots, and we'll continue
to improve our systems to detect new techniques.

Conclusion

Like The Lady or the Tiger short story by Frank Stockton, we don’t know if the Google audio reCAPTCHA
quick change was done in response to dc949’s engineering attempts. Certainly 800,000 attempts, in a
short time, even in the practice arena, might have been noticeable to Google. As the unnamed
spokesperson observed, there were no compromises. Now, though, reCAPTCHA audio is exceedingly
difficult to use, so if not breaking this functionality, Stiltwalker has forced Google to retool reCAPTCHA.
As so many postings and questions make clear, at some point, the bots or neural nets will be able to
solve these problems, but not humans.

For more information: http://www.google.com/recaptcha/learnmore

http://www.google.com/recaptcha/captcha - history and purpose

You might also like