You are on page 1of 31

WEB CAPTCHA

HUMAN OR SCRIPT?
An AI approach to cryptography
Overview
Vulnerabilities, Threats, Controls
2 Precursors
4 Proposals
6 General Approaches
3 Deployment Options
If time: issues and links
Vulnerabilities
HTTP does not distinguish between
human & machine users.
HTTP & SSL do not guarantee client
software or user is benign.
Malicious bots can be anonymous and
distributed.
Benign bots spider for searches, etc.
Threats to Web
Content Theft-- stealing paid data
Copyright Infringement-- “scraping” content
from one site to display on another, “out of
context”
Unwanted spidering-- search engines may
ignore robots.txt or “nofollow” tags
Poll Stuffing-- MIT vs. CMU on /. [1]
Web Spam-- unsolicited commenting,
abusing free email, scraping addresses
Web Spam
Web comments, discussions, guest
books, Wikis, many public forms are
open to spam messages.
More eyeballs per message than e-mail
E-mail spam is illegal, but most Web
spam is legal.
Bots collect email addresses on Web.
Motives
Google-- more links, higher ranking
Profit-- ads for real product/service
Phishing-- bait and switch for identity
theft, financial theft
Astroturfing-- promote agenda by
simulating “grassroots” word-of-mouth
Vandalism-- competition, damage,
thrill, revenge, activism, etc.
Cracked Controls
IP tracking/banning-- repurposed DDoS
scripts; IP masking, hijacking
User Authentication-- if not easily
cracked, use service like bugmenot.com
Moderation (human review)-- script
makes own moderator account in DB
Good start, but may need more.
CAPTCHA™
Acronym for Completely Automated
Public Turing test to tell Computers &
Humans Apart-- Dr. Manuel Blum
Reverse Turing test-- computers finding
humans, not humans finding computers
A category, not a specific solution
Precursors
Unpublished manuscript by Moni Naor
first mentions automated Turing test in
1997, but not proposed or formalized.
Altavista patent in 1998 first practical
example of using slightly distorted
images of text to deter bots, but only
defeats stock OCR, not custom OCR
Definition
In 2000, formalized by Luis von Ahn
, Manuel Blum & Nicholas J. Hopper of Carne
“A CAPTCHA is a cryptographic
protocol whose underlying hardness
assumption is based on an AI problem.”
[1]
www.captcha.net
Win-Win
If cracked, AI is advanced because a
very difficult (unsolved) AI problem has
been solved;
If not cracked, steganographic
cryptography is advanced [1]
CAPTCHA.net Proposals
Gimpy-- text distortion used by Yahoo!
(routinely cracked & improved)
Bongo-- visual puzzle, like Mensa tests
(if 4 options, guess works 25%)
Pix-- photographic recognition (need
large image DB, or Google API)
Sounds-- voice synthesis, distortion
Gimpy
Images of distorted text.
Frequently cracked and
improved.
In current version, 5 pairs of
overlapped words. User
identifies 3 words.
Random placement, font,
distortion, background pattern
Overlapping words need no
noise.
Bongo
Visual puzzle
Computer can generate &
display, but not solve.
If too many choices,
humans get it wrong.
If not enough choices,
computers can be effective
with random guess.
Pix
Photo Recognition
Need large image DB
Images need keywords
Four images with same keyword shown
Random subset of keywords as choices
Poor implementations easy to crack
(color of top left pixel unique, etc.)
General Approaches
Text (ASCII/Unicode)
Image
Speech
Animation
3-D
Combinations of all above
ASCII/Unicode ©4Pt¢h4
Change text to look-alike: SPAM is $P4M.
Fools simplest text matching.
Accented or non-English chars: Spám
Chars to words: uce@ftc.gov --> uce at ftc
dot gov
URL/HTML entities: COPY becomes
¢0Ρ¥ or %430P%59
Better than nothing, but easy to crack
It is not technically CAPTCHA
Image CAPTCHA
Presents one-time-password as an image
humans can read, but not scripts
If image is too simple, OCR can crack; too
complex, human cannot read.
To beat OCR, vary position, warp, noise,
background, colors, overlap, randomness,
font, angles, language, methods used
Show filtered photos as well as words
Can deny accessibility to vision-impaired…
Considering Accessibility
Government and everyone who does business with
government must meet federal accessibility
standards for disabilities. Serious legal penalties.
Professional ethics requires everyone else to do the
same, with lesser consequences.
Often ignored by amateurs, but at risk of being
considered rude.
Very few CAPTCHAs are “accessible.”
Solution (W3C): use both image & speech, manual
approval; but chain only strong as weakest link.
Speech CAPTCHA
Usually spells out one-time-password in
synthesized or recorded voices
Voice recognition cracks simple case.
Applied audio filters risk human
misunderstanding.
Used with image CAPTCHA for
increased accessibility.
If both use same OTP, easier to crack.
Animated CAPTCHA
Can use Flash, MPEG, animated GIF
Often combined with speech
Weaknesses of Image CAPTCHA apply
Usually easier to crack due to extra data
for pattern matching to analyze
Much higher processor and traffic load
Not practical in most cases
3D
Renders OTP in 3D space to image
Reputedly the most difficult to crack
Server needs good graphics card to be
practical (rare)
Can be combined with other methods
Not yet common (tEABAG_3D)
Might see more in future
Circumventing CAPTCHA
Social engineering can foil most
CAPTCHAs. How?
Scrape captcha from origin, pose to
human for free access to other content
(adult, news, search, blogs)
User unaware of helping spammers
Which CAPTCHA?
Even simplest CAPTCHA can beat vast
majority of scripts
Even best CAPTCHA can be cracked
by dedicated, sophisticated coders
Weigh strength vs. cost (compute
cycles, bandwidth, dollars)
Be careful not to violate accessibility
laws or open new holes.
Deploying CAPTCHA
Install existing software (pro or free)
Use remote CAPTCHA service
Develop own CAPTCHA or customize
open source scripts.
Existing Software
Hundreds or thousands of options
Narrow choices by price, server
requirements, standards compliance,
third-party testing results
Big targets— cracking a popular control
opens hundreds of sites to spammers
Like antivirus, ineffective unless
frequently updated.
CAPTCHA Svc Providers
Work even with servers not configured to
generate images or sound.
Server sends encrypted OTP to service,
which sends image to client.
Code is easy to embed (botblock)
Service updates itself automatically.
Saves bandwidth and processor time.
captchaS.net (experimental, but free)
Trust issues when outsourcing security.
Custom CAPTCHA
Starting from Open Source or public domain
code, not too difficult to customize.
Customizing can make your implementation
resistant to all but direct assaults.
CAPTCHA volunteers may help you test and
improve your algorithm.
Can be stronger than using a service or
preconfigured software.
CAPTCHA Beyond the Web
Prevent dictionary attacks in any
password system (Pinkas & Sander)
Protect e-mail systems from worms,
spam, other malware-- if sender not in
address book or message is suspect,
challenge sender with CAPTCHA.
Deter unwanted macro-scripting of a
standalone application.
My Project
Survey CAPTCHA alternatives.
Select and install one.
Test on MAMP (Mac / PHP)
Deploy on LAMP (Linux)
Evaluate and submit to my company for
use with Wiki-based CMS
Project Status
Several false starts
First few selections either did not install,
did not meet requirements or failed
accessibility tests
Best bet now is on the service at
http://www.captchas.net
Asked for two-week extension to finish
installation and paper.