P. 1
Processing Real-world HTML

Processing Real-world HTML

|Views: 459|Likes:
Published by Best Tech Videos
Edward O'Connor from djangosd gives an overview of html5lib, a major-desktop-browser-compatible HTML parser and tokenizer for both Ruby and Python.

This talk was part of the DjangoSD/SD Ruby mashup meeting.

Watch a video at http://www.bestechvideos.com/2009/12/21/sd-ruby-episode-70-processing-real-world-html
Edward O'Connor from djangosd gives an overview of html5lib, a major-desktop-browser-compatible HTML parser and tokenizer for both Ruby and Python.

This talk was part of the DjangoSD/SD Ruby mashup meeting.

Watch a video at http://www.bestechvideos.com/2009/12/21/sd-ruby-episode-70-processing-real-world-html

More info:

Published by: Best Tech Videos on Dec 21, 2009
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

01/13/2013

pdf

text

original

Processing real-world HTML: a quick introduction to html5lib

previous

next

Processing real-world HTML
a quick introduction to html5lib
Edward O’Connor
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

So you’ve got some HTML. You found it out in the wild. Or some user typed it into a form in your webapp.
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

<B>Hi, <I>Joe</b>! <p/> So good to </i><BLINK>finally meet you & stuff.

… You have got to be kidding me.
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Image: Soupe de Tags by ~Thanh
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Tag Soup
Browsers handle such markup well and mostly uniformly. The browser vendors have spent countless developer-hours reverseengineering each others’ error recovery methods.
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

But tool developers have, historically, been screwed. We usually resort to running text through Tidy, Beautiful Soup, or something similar. These tools have their own tag soup error recovery, that often doesn’t match what browsers do.
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

When your error recovery doesn’t match browsers’ error recovery, users get screwed. Your app is buggy. This was the state of the art in 2004.
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

HTML 5
Standardizing an HTML parsing algoritm that matches browser behavior.
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

html5lib
An implementation of the HTML 5 parsing algorithm in Ruby and Python (including Python 3).
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Really easy to use.
import html5lib f = open("mydocument.html") parser = html5lib.HTMLParser() document = parser.parse(f)

Edward O’Connor, Django San Diego

Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Just as easy in Ruby.
require 'html5lib/html5parser' include HTML5 f = File.open("mydocument.html") document = HTMLParser.parse(f)

Edward O’Connor, Django San Diego

Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Thousands of tests
"Python html5lib implements the spec so well, it even implements an infinite loop." — @gsnedders
fixed in html5lib 8 days ago: revision 21ce65db1e fixed in HTML5 spec yesterday: r3538
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Tree building
Plugs into your favorite DOM or DOMlike API Python: minidom, ElementTree, lxml, Beafutiful Soup Ruby: REXML, Hpricot
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Tree walking
Python: dom, ElementTree, genshi, lxml, pulldom, Beautiful Soup Ruby: REXML, Hpricot
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Filters
Sanitizer (whitelists) Conformance checker (validator)
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Liberal character set detection (chardet )
My skeertuig is vol palings • !"#$%&'( )%*+,-./ 0-1&2.#.3&4.5$6 • !" #$%&'() *' + #,%,-./01 • 我的氣 船裝滿了 魚 • Mia kusenveturilo estas plena je angiloj • !"#$%&'( )*&+ "&, -$#.)
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Infoset coercion (ihatexml.py )
Can happily take in real-world HTML as input into an XML toolchain
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Liberal XML parser
Think the Universal Feed Parser, but for any XML.
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

Processing real-world HTML: a quick introduction to html5lib

previous

next

Questions?
http://edward.oconnor.cx/2009/08/djangosdhtml5lib

CC BY-SA 3.0
Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->