You are on page 1of 19

Processing real-world HTML: a quick introduction to html5lib previous next

Processing real-world
HTML
a quick introduction to html5lib
Edward O’Connor

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

So you’ve got some HTML.

You found it out in the wild.

Or some user typed it into a form in


your webapp.

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

<B>Hi, <I>Joe</b>!
<p/>
So good to </i><BLINK>finally
meet you & stuff.

You have got to be kidding me.

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Image: Soupe de Tags by ~Thanh

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Tag Soup
Browsers handle such markup well and
mostly uniformly.

The browser vendors have spent


countless developer-hours reverse-
engineering each others’ error recovery
methods.

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

But tool developers have, historically,


been screwed.

We usually resort to running text


through Tidy, Beautiful Soup, or
something similar.

These tools have their own tag soup


error recovery, that often doesn’t
match what browsers do.

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

When your error recovery doesn’t


match browsers’ error recovery, users
get screwed. Your app is buggy.

This was the state of the art in 2004.

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

HTML 5

Standardizing an HTML parsing


algoritm that matches browser
behavior.

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

html5lib

An implementation of the HTML 5


parsing algorithm in Ruby and Python
(including Python 3).

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Really easy to use.

import html5lib
f = open("mydocument.html")
parser = html5lib.HTMLParser()
document = parser.parse(f)

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Just as easy in Ruby.

require 'html5lib/html5parser'
include HTML5
f = File.open("mydocument.html")
document = HTMLParser.parse(f)

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Thousands of tests
"Python html5lib implements the spec
so well, it even implements an infinite
loop." — @gsnedders

fixed in html5lib 8 days ago: revision


21ce65db1e
fixed in HTML5 spec yesterday: r3538

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Tree building
Plugs into your favorite DOM or DOM-
like API

Python: minidom, ElementTree, lxml, Beafutiful


Soup
Ruby: REXML, Hpricot

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Tree walking
Python: dom, ElementTree, genshi, lxml, pulldom,
Beautiful Soup
Ruby: REXML, Hpricot

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Filters
Sanitizer (whitelists)
Conformance checker (validator)

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Liberal character set detection


(chardet )
My skeertuig is vol palings • !"#$%&'( )%*+,-./
0-1&2.#.3&4.5$6 • !" #$%&'() *' + #,%,-./01 •
我的氣 船裝滿了 魚 • Mia
kusenveturilo estas plena je angiloj •
!"#$%&'( )*&+ "&, -$#.)

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Infoset coercion (ihatexml.py )


Can happily take in real-world HTML
as input into an XML toolchain

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Liberal XML parser


Think the Universal Feed Parser, but
for any XML.

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009
Processing real-world HTML: a quick introduction to html5lib previous next

Questions?
http://edward.oconnor.cx/2009/08/djangosd-
html5lib

CC BY-SA 3.0

Edward O’Connor, Django San Diego Django San Diego / SD Ruby Joint Meeting, 6 August 2009