You are on page 1of 31

WEB CAPTCHA

HUMAN OR SCRIPT? An AI approach to cryptography

Overview
Vulnerabilities, Threats, Controls 2 Precursors 4 Proposals 6 General Approaches 3 Deployment Options If time: issues and links

Vulnerabilities
HTTP does not distinguish between human & machine users. HTTP & SSL do not guarantee client software or user is benign. Malicious bots can be anonymous and distributed. Benign bots spider for searches, etc.

Threats to Web
Content Theft-- stealing paid data Copyright Infringement-- scraping content from one site to display on another, out of context Unwanted spidering-- search engines may ignore robots.txt or nofollow tags Poll Stuffing-- MIT vs. CMU on /. [1] Web Spam-- unsolicited commenting, abusing free email, scraping addresses

Web Spam
Web comments, discussions, guest books, Wikis, many public forms are open to spam messages. More eyeballs per message than e-mail E-mail spam is illegal, but most Web spam is legal. Bots collect email addresses on Web.

Motives
Google-- more links, higher ranking Profit-- ads for real product/service Phishing-- bait and switch for identity theft, financial theft Astroturfing-- promote agenda by simulating grassroots word-of-mouth Vandalism-- competition, damage, thrill, revenge, activism, etc.

Cracked Controls
IP tracking/banning-- repurposed DDoS scripts; IP masking, hijacking User Authentication-- if not easily cracked, use service like bugmenot.com Moderation (human review)-- script makes own moderator account in DB Good start, but may need more.

CAPTCHA
Acronym for Completely Automated Public Turing test to tell Computers & Humans Apart-- Dr. Manuel Blum Reverse Turing test-- computers finding humans, not humans finding computers A category, not a specific solution

Precursors
Unpublished manuscript by Moni Naor first mentions automated Turing test in 1997, but not proposed or formalized. Altavista patent in 1998 first practical example of using slightly distorted images of text to deter bots, but only defeats stock OCR, not custom OCR

Definition
In 2000, formalized by Luis von Ahn, Manuel Blum & Nicholas J. Hopper of Carnegie Mellon; John Langford of IBM A CAPTCHA is a cryptographic protocol whose underlying hardness assumption is based on an AI problem. [1] www.captcha.net

Win-Win
If cracked, AI is advanced because a very difficult (unsolved) AI problem has been solved; If not cracked, steganographic cryptography is advanced [1]

CAPTCHA.net Proposals
Gimpy-- text distortion used by Yahoo! (routinely cracked & improved) Bongo-- visual puzzle, like Mensa tests (if 4 options, guess works 25%) Pix-- photographic recognition (need large image DB, or Google API) Sounds-- voice synthesis, distortion

Gimpy
Images of distorted text. Frequently cracked and improved. In current version, 5 pairs of overlapped words. User identifies 3 words. Random placement, font, distortion, background pattern Overlapping words need no noise.

Bongo
Visual puzzle Computer can generate & display, but not solve. If too many choices, humans get it wrong. If not enough choices, computers can be effective with random guess.

Pix
Photo Recognition Need large image DB Images need keywords Four images with same keyword shown Random subset of keywords as choices Poor implementations easy to crack (color of top left pixel unique, etc.)

General Approaches
Text (ASCII/Unicode) Image Speech Animation 3-D Combinations of all above

ASCII/Unicode 4Pth4
Change text to look-alike: SPAM is $P4M. Fools simplest text matching. Accented or non-English chars: Spm Chars to words: uce@ftc.gov --> uce at ftc dot gov URL/HTML entities: COPY becomes ¢0Ρ¥ or %430P%59 Better than nothing, but easy to crack It is not technically CAPTCHA

Image CAPTCHA
Presents one-time-password as an image humans can read, but not scripts If image is too simple, OCR can crack; too complex, human cannot read. To beat OCR, vary position, warp, noise, background, colors, overlap, randomness, font, angles, language, methods used Show filtered photos as well as words Can deny accessibility to vision-impaired

Considering Accessibility
Government and everyone who does business with government must meet federal accessibility standards for disabilities. Serious legal penalties. Professional ethics requires everyone else to do the same, with lesser consequences. Often ignored by amateurs, but at risk of being considered rude. Very few CAPTCHAs are accessible. Solution (W3C): use both image & speech, manual approval; but chain only strong as weakest link.

Speech CAPTCHA
Usually spells out one-time-password in synthesized or recorded voices Voice recognition cracks simple case. Applied audio filters risk human misunderstanding. Used with image CAPTCHA for increased accessibility. If both use same OTP, easier to crack.

Animated CAPTCHA
Can use Flash, MPEG, animated GIF Often combined with speech Weaknesses of Image CAPTCHA apply Usually easier to crack due to extra data for pattern matching to analyze Much higher processor and traffic load Not practical in most cases

3D
Renders OTP in 3D space to image Reputedly the most difficult to crack Server needs good graphics card to be practical (rare) Can be combined with other methods Not yet common (tEABAG_3D) Might see more in future

Circumventing CAPTCHA
Social engineering can foil most CAPTCHAs. How? Scrape captcha from origin, pose to human for free access to other content (adult, news, search, blogs) User unaware of helping spammers

Which CAPTCHA?
Even simplest CAPTCHA can beat vast majority of scripts Even best CAPTCHA can be cracked by dedicated, sophisticated coders Weigh strength vs. cost (compute cycles, bandwidth, dollars) Be careful not to violate accessibility laws or open new holes.

Deploying CAPTCHA
Install existing software (pro or free) Use remote CAPTCHA service Develop own CAPTCHA or customize open source scripts.

Existing Software
Hundreds or thousands of options Narrow choices by price, server requirements, standards compliance, thirdparty testing results Big targets cracking a popular control opens hundreds of sites to spammers Like antivirus, ineffective unless frequently updated.

CAPTCHA Svc Providers


Work even with servers not configured to generate images or sound. Server sends encrypted OTP to service, which sends image to client. Code is easy to embed (botblock) Service updates itself automatically. Saves bandwidth and processor time. captchaS.net (experimental, but free) Trust issues when outsourcing security.

Custom CAPTCHA
Starting from Open Source or public domain code, not too difficult to customize. Customizing can make your implementation resistant to all but direct assaults. CAPTCHA volunteers may help you test and improve your algorithm. Can be stronger than using a service or preconfigured software.

CAPTCHA Beyond the Web


Prevent dictionary attacks in any password system (Pinkas & Sander) Protect e-mail systems from worms, spam, other malware-- if sender not in address book or message is suspect, challenge sender with CAPTCHA. Deter unwanted macro-scripting of a standalone application.

My Project
Survey CAPTCHA alternatives. Select and install one. Test on MAMP (Mac / PHP) Deploy on LAMP (Linux) Evaluate and submit to my company for use with Wiki-based CMS

Project Status
Several false starts First few selections either did not install, did not meet requirements or failed accessibility tests Best bet now is on the service at http://www.captchas.net Asked for two-week extension to finish installation and paper.

You might also like