Beautiful Soup Documentation

7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Beautiful Soup Documentation

Beautiful Soup is a Python library for pulling data out of
HTML and XML files. It works with your favorite parser to
provide idiomatic ways of navigating, searching, and
modifying the parse tree. It commonly saves programmers
hoursordaysofwork.
These instructions illustrate all major features of Beautiful
Soup4,withexamples.Ishowyouwhatthelibraryisgood
for,howitworks,howtouseit,howtomakeitdowhatyou
want,andwhattodowhenitviolatesyourexpectations.
Theexamplesinthisdocumentationshouldworkthesame
wayinPython2.7andPython3.2.
YoumightbelookingforthedocumentationforBeautifulSoup3.Ifso,youshouldknowthat
BeautifulSoup3isnolongerbeingdeveloped,andthatBeautifulSoup4isrecommended
forallnewprojects.IfyouwanttolearnaboutthedifferencesbetweenBeautifulSoup3and
BeautifulSoup4,seePortingcodetoBS4.
ThisdocumentationhasbeentranslatedintootherlanguagesbyBeautifulSoupusers:
.
()
.()
Getting help
If you have questions about Beautiful Soup, or run into problems, send mail to the
discussiongroup.IfyourprobleminvolvesparsinganHTMLdocument,besuretomention
whatthediagnose()functionsaysaboutthatdocument.
Quick Start
HeresanHTMLdocumentIllbeusingasanexamplethroughoutthisdocument.Itspartof
astoryfromAliceinWonderland:
html_doc="""
<html><head><title>TheDormouse'sstory</title></head>
<body>
<pclass="title">TheDormouse'sstory
<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere
<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,
<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
1/56
7/14/2015
<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;
andtheylivedatthebottomofawell.
<pclass="story">...
"""
Running the three sisters document through Beautiful Soup gives us a

object,whichrepresentsthedocumentasanesteddatastructure:
BeautifulSoup
frombs4importBeautifulSoup
soup=BeautifulSoup(html_doc,'html.parser')
print(soup.prettify())
#<html>
#<head>
#<title>
#TheDormouse'sstory
#</title>
#</head>
#<body>
#<pclass="title">
#
#TheDormouse'sstory
#
#
#<pclass="story">
#Onceuponatimetherewerethreelittlesisters;andtheirnameswere
#<aclass="sister"href="http://example.com/elsie"id="link1">
#Elsie
#</a>
#,
#<aclass="sister"href="http://example.com/lacie"id="link2">
#Lacie
#</a>
#and
#<aclass="sister"href="http://example.com/tillie"id="link2">
#Tillie
#</a>
#;andtheylivedatthebottomofawell.
#
#<pclass="story">
#...
#
#</body>
#</html>
Herearesomesimplewaystonavigatethatdatastructure:
soup.title
#<title>TheDormouse'sstory</title>
soup.title.name
#u'title'
soup.title.string
#u'TheDormouse'sstory'
soup.title.parent.name
2/56
7/14/2015
#u'head'
soup.p
#<pclass="title">TheDormouse'sstory
soup.p['class']
#u'title'
soup.a
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
soup.find_all('a')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.find(id="link3")
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>
OnecommontaskisextractingalltheURLsfoundwithinapages<a>tags:
forlinkinsoup.find_all('a'):
print(link.get('href'))
#http://example.com/elsie
#http://example.com/lacie
#http://example.com/tillie
Anothercommontaskisextractingallthetextfromapage:
print(soup.get_text())
#TheDormouse'sstory
#
#TheDormouse'sstory
#
#Onceuponatimetherewerethreelittlesisters;andtheirnameswere
#Elsie,
#Lacieand
#Tillie;
#andtheylivedatthebottomofawell.
#
#...
Doesthislooklikewhatyouneed?Ifso,readon.
Installing Beautiful Soup

If youre using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup
withthesystempackagemanager:
$aptgetinstallpythonbs4
Beautiful Soup 4 is published through PyPi, so if you cant install it with the system
packager, you can install it with easy_install or pip . The package name is beautifulsoup4 ,
3/56
7/14/2015
andthesamepackageworksonPython2andPython3.
$easy_installbeautifulsoup4
$pipinstallbeautifulsoup4
(The BeautifulSoup package is probably not what you want. Thats the previous major
release, Beautiful Soup 3. Lots of software uses BS3, so its still available, but if youre
writingnewcodeyoushouldinstall beautifulsoup4 .)
Ifyoudonthave easy_install or pip installed,youcandownloadtheBeautifulSoup4source
tarballandinstallitwith setup.py .
$pythonsetup.pyinstall
If all else fails, the license for Beautiful Soup allows you to package the entire library with
yourapplication.Youcandownloadthetarball,copyits bs4 directoryintoyourapplications
codebase,anduseBeautifulSoupwithoutinstallingitatall.
I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it should work with other
recentversions.
Problems after installation

BeautifulSoupispackagedasPython2code.WhenyouinstallitforusewithPython3,its
automaticallyconvertedtoPython3code.Ifyoudontinstallthepackage,thecodewontbe
converted.TherehavealsobeenreportsonWindowsmachinesofthewrongversionbeing
installed.
If you get the ImportError No module named HTMLParser, your problem is that youre
runningthePython2versionofthecodeunderPython3.
If you get the ImportError No module named html.parser, your problem is that youre
runningthePython3versionofthecodeunderPython2.
In both cases, your best bet is to completely remove the Beautiful Soup installation from
your system (including any directory created when you unzipped the tarball) and try the
installationagain.
Ifyougetthe SyntaxError Invalidsyntaxontheline ROOT_TAG_NAME=u'[document]' ,youneed
toconvertthePython2codetoPython3.Youcandothiseitherbyinstallingthepackage:
$python3setup.pyinstall
orbymanuallyrunningPythons 2to3 conversionscriptonthe bs4 directory:

$2to33.2wbs4
4/56
7/14/2015
Installing a parser
BeautifulSoupsupportstheHTMLparserincludedinPythonsstandardlibrary,butitalso
supportsanumberofthirdpartyPythonparsers.Oneisthelxmlparser.Dependingonyour
setup,youmightinstalllxmlwithoneofthesecommands:
$aptgetinstallpythonlxml
$easy_installlxml
$pipinstalllxml
AnotheralternativeisthepurePythonhtml5libparser,whichparsesHTMLthewayaweb
browser does. Depending on your setup, you might install html5lib with one of these
commands:
$aptgetinstallpythonhtml5lib
$easy_installhtml5lib
$pipinstallhtml5lib
Thistablesummarizestheadvantagesanddisadvantagesofeachparserlibrary:
Parser
Pythons
html.parser
Typicalusage
lxmlsHTML
parser
BeautifulSoup(markup,"lxml")
Veryfast
Lenient
External C
dependency
lxmlsXML
parser
BeautifulSoup(markup,"lxml
xml") BeautifulSoup(markup,
"xml")
Veryfast
The
only
currently
supported XML
parser
External C
dependency
html5lib
BeautifulSoup(markup,
"html5lib")
Extremely
lenient
Parses pages
thesamewaya
web
browser
does
Veryslow
External
Python
dependency
BeautifulSoup(markup,
"html.parser")
Advantages
Disadvantages
Batteries
Not
very
included
lenient
Decentspeed
(before
Lenient (as of
Python
Python
2.7.3
2.7.3
or
and3.2.)
3.2.2)
5/56
7/14/2015
Creates
HTML5
valid
If you can, I recommend you install and use lxml for speed. If youre using a version of
Python2earlierthan2.7.3,oraversionofPython3earlierthan3.2.2,itsessentialthatyou
installlxmlorhtml5libPythonsbuiltinHTMLparserisjustnotverygoodinolderversions.
Note that if a document is invalid, different parsers will generate different Beautiful Soup
treesforit.SeeDifferencesbetweenparsersfordetails.
Making the soup

Toparseadocument,passitintothe BeautifulSoup constructor.Youcanpassinastringor
anopenfilehandle:
soup=BeautifulSoup(open("index.html"))
soup=BeautifulSoup("<html>data</html>")
First, the document is converted to Unicode, and HTML entities are converted to Unicode
characters:
BeautifulSoup("Sacrébleu!")
<html><head></head><body>Sacrbleu!</body></html>
Beautiful Soup then parses the document using the best available parser. It will use an
HTMLparserunlessyouspecificallytellittouseanXMLparser.(SeeParsingXML.)
Kinds of objects
Beautiful Soup transforms a complex HTML document into a complex tree of Python
objects. But youll only ever have to deal with about four kinds of objects: Tag ,
NavigableString , BeautifulSoup ,and Comment .
Tag
A Tag objectcorrespondstoanXMLorHTMLtagintheoriginaldocument:
soup=BeautifulSoup('<bclass="boldest">Extremelybold')
tag=soup.b
type(tag)
#<class'bs4.element.Tag'>
6/56
7/14/2015
Tagshavealotofattributesandmethods,andIllcovermostoftheminNavigatingthetree
and Searching the tree. For now, the most important features of a tag are its name and
attributes.
Name
Everytaghasaname,accessibleas .name :
tag.name
#u'b'
Ifyouchangeatagsname,thechangewillbereflectedinanyHTMLmarkupgeneratedby
BeautifulSoup:
tag.name="blockquote"
tag
#<blockquoteclass="boldest">Extremelybold</blockquote>
Attributes
A tag may have any number of attributes. The tag has an attribute
classwhosevalueisboldest.Youcanaccessatagsattributesbytreatingthetaglikea
dictionary:
tag['class']
#u'boldest'
Youcanaccessthatdictionarydirectlyas .attrs :
tag.attrs
#{u'class':u'boldest'}
Youcanadd,remove,andmodifyatagsattributes.Again,thisisdonebytreatingthetag
asadictionary:
tag['class']='verybold'
tag['id']=1
tag
#<blockquoteclass="verybold"id="1">Extremelybold</blockquote>
deltag['class']
deltag['id']
tag
#<blockquote>Extremelybold</blockquote>
tag['class']
#KeyError:'class'
print(tag.get('class'))
#None
7/56
7/14/2015
Multivalued attributes
HTML4definesafewattributesthatcanhavemultiplevalues.HTML5removesacoupleof
them, but defines a few more. The most common multivalued attribute is class (that is, a
tagcanhavemorethanoneCSSclass).Othersinclude rel , rev , acceptcharset , headers ,and
accesskey .BeautifulSouppresentsthevalue(s)ofamultivaluedattributeasalist:
css_soup=BeautifulSoup('<pclass="bodystrikeout">')
css_soup.p['class']
#["body","strikeout"]
css_soup=BeautifulSoup('<pclass="body">')
css_soup.p['class']
#["body"]
Ifanattributelookslikeithasmorethanonevalue,butitsnotamultivaluedattributeas
definedbyanyversionoftheHTMLstandard,BeautifulSoupwillleavetheattributealone:
id_soup=BeautifulSoup('<pid="myid">')
id_soup.p['id']
#'myid'
Whenyouturnatagbackintoastring,multipleattributevaluesareconsolidated:
rel_soup=BeautifulSoup('Backtothe<arel="index">homepage</a>')
rel_soup.a['rel']
#['index']
rel_soup.a['rel']=['index','contents']
print(rel_soup.p)
#Backtothe<arel="indexcontents">homepage</a>
IfyouparseadocumentasXML,therearenomultivaluedattributes:
xml_soup=BeautifulSoup('<pclass="bodystrikeout">','xml')
xml_soup.p['class']
#u'bodystrikeout'
NavigableString
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString
classtocontainthesebitsoftext:
tag.string
#u'Extremelybold'
type(tag.string)
#<class'bs4.element.NavigableString'>
A NavigableString is just like a Python Unicode string, except that it also supports some of
the features described in Navigating the tree and Searching the tree. You can convert a
8/56
7/14/2015
NavigableString toaUnicodestringwith unicode() :
unicode_string=unicode(tag.string)
unicode_string
#u'Extremelybold'
type(unicode_string)
#<type'unicode'>
You cant edit a string in place, but you can replace one string with another, using
replace_with():
tag.string.replace_with("Nolongerbold")
tag
#<blockquote>Nolongerbold</blockquote>
NavigableString supportsmostofthefeaturesdescribedinNavigatingthetreeandSearching
thetree,butnotallofthem.Inparticular,sinceastringcantcontainanything(thewayatag
maycontainastringoranothertag),stringsdontsupportthe .contents or .string attributes,
orthe find() method.
Ifyouwanttousea NavigableString outsideofBeautifulSoup,youshouldcall unicode() onit
to turn it into a normal Python Unicode string. If you dont, your string will carry around a
reference to the entire Beautiful Soup parse tree, even when youre done using Beautiful
Soup.Thisisabigwasteofmemory.
BeautifulSoup
The BeautifulSoup objectitselfrepresentsthedocumentasawhole.Formostpurposes,you
can treat it as a Tag object. This means it supports most of the methods described in
NavigatingthetreeandSearchingthetree.
Sincethe BeautifulSoup objectdoesntcorrespondtoanactualHTMLorXMLtag,ithasno
nameandnoattributes.Butsometimesitsusefultolookatits .name ,soitsbeengiventhe
special .name [document]:
soup.name
#u'[document]'
Comments and other special strings

Tag , NavigableString , and BeautifulSoup cover almost everything youll see in an HTML or
XMLfile,butthereareafewleftoverbits.Theonlyoneyoullprobablyeverneedtoworry
aboutisthecomment:
markup="<!Hey,buddy.Wanttobuyausedparser?>"
soup=BeautifulSoup(markup)
comment=soup.b.string
9/56
7/14/2015
type(comment)
#<class'bs4.element.Comment'>
The Comment objectisjustaspecialtypeof NavigableString :

comment
#u'Hey,buddy.Wanttobuyausedparser'
But when it appears as part of an HTML document, a Comment is displayed with special
formatting:
print(soup.b.prettify())
#
#<!Hey,buddy.Wanttobuyausedparser?>
#
BeautifulSoupdefinesclassesforanythingelsethatmightshowupinanXMLdocument:
CData , ProcessingInstruction , Declaration , and Doctype . Just like Comment , these classes are
subclassesof NavigableString thataddsomethingextratothestring.Heresanexamplethat
replacesthecommentwithaCDATAblock:
frombs4importCData
cdata=CData("ACDATAblock")
comment.replace_with(cdata)
print(soup.b.prettify())
#
#<![CDATA[ACDATAblock]]>
#
Navigating the tree

HerestheThreesistersHTMLdocumentagain:
html_doc="""
<body>
"""
10/56
7/14/2015
Ill use this as an example to show you how to move from one part of a document to
another.
Going down
Tagsmaycontainstringsandothertags.Theseelementsarethetagschildren. Beautiful
Soupprovidesalotofdifferentattributesfornavigatinganditeratingoveratagschildren.
NotethatBeautifulSoupstringsdontsupportanyoftheseattributes,becauseastringcant
havechildren.
Navigating using tag names

Thesimplestwaytonavigatetheparsetreeistosaythenameofthetagyouwant.Ifyou
wantthe<head>tag,justsay soup.head :
soup.head
#<head><title>TheDormouse'sstory</title></head>
soup.title
Youcandousethistrickagainandagaintozoominonacertainpartoftheparsetree.This
codegetsthefirsttagbeneaththe<body>tag:
soup.body.b
#TheDormouse'sstory
Usingatagnameasanattributewillgiveyouonlythefirsttagbythatname:
soup.a
Ifyouneedtoget all the <a> tags, or anything more complicated than the first tag with a
certainname,youllneedtouseoneofthemethodsdescribedinSearchingthetree, such
asfind_all():
soup.find_all('a')
.contents
and
.children
Atagschildrenareavailableinalistcalled .contents :
11/56
7/14/2015
head_tag=soup.head
head_tag
head_tag.contents
[<title>TheDormouse'sstory</title>]
title_tag=head_tag.contents[0]
title_tag
title_tag.contents
#[u'TheDormouse'sstory']
The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the
BeautifulSoup object.:
len(soup.contents)
#1
soup.contents[0].name
#u'html'
Astringdoesnothave .contents ,becauseitcantcontainanything:

text=title_tag.contents[0]
text.contents
#AttributeError:'NavigableString'objecthasnoattribute'contents'
Instead of getting them as a list, you can iterate over a tags children using the .children
generator:
forchildintitle_tag.children:
print(child)
#TheDormouse'sstory
.descendants
The .contents and .children attributesonlyconsideratagsdirectchildren.Forinstance,the
<head>taghasasingledirectchildthe<title>tag:
head_tag.contents
#[<title>TheDormouse'sstory</title>]
Butthe<title>tagitselfhasachild:thestringTheDormousesstory.Theresasensein
whichthatstringisalsoachildofthe<head>tag.The .descendants attributeletsyouiterate
overallofatagschildren,recursively:itsdirectchildren,thechildrenofitsdirectchildren,
andsoon:
forchildinhead_tag.descendants:
print(child)
12/56
7/14/2015
#TheDormouse'sstory
The <head> tag has only one child, but it has two descendants: the <title> tag and the
<title>tagschild.The BeautifulSoup objectonlyhasonedirectchild(the<html>tag),butit
hasawholelotofdescendants:
len(list(soup.children))
#1
len(list(soup.descendants))
#25
.string
Ifataghasonlyonechild,andthatchildisa NavigableString ,thechildismadeavailableas
.string :
title_tag.string
If a tags only child is another tag, and that tag has a .string , then the parent tag is
consideredtohavethesame .string asitschild:
head_tag.contents
head_tag.string
If a tag contains more than one thing, then its not clear what .string should refer to, so
.string isdefinedtobe None :
print(soup.html.string)
#None
.strings
and
stripped_strings
If theres more than one thing inside a tag, you can still look at just the strings. Use the
.strings generator:
forstringinsoup.strings:
print(repr(string))
#u"TheDormouse'sstory"
#u'\n\n'
#u'\n\n'
#u'Onceuponatimetherewerethreelittlesisters;andtheirnameswere\n'
#u'Elsie'
#u',\n'
#u'Lacie'
13/56
7/14/2015
#u'and\n'
#u'Tillie'
#u';\nandtheylivedatthebottomofawell.'
#u'\n\n'
#u'...'
#u'\n'
These strings tend to have a lot of extra whitespace, which you can remove by using the
.stripped_strings generatorinstead:
forstringinsoup.stripped_strings:
print(repr(string))
#u'Onceuponatimetherewerethreelittlesisters;andtheirnameswere'
#u'Elsie'
#u','
#u'Lacie'
#u'and'
#u'Tillie'
#u'...'
Here,stringsconsistingentirelyofwhitespaceareignored,andwhitespaceatthebeginning
andendofstringsisremoved.
Going up
Continuing the family tree analogy, every tag and every string has a parent: the tag that
containsit.
.parent
Youcanaccessanelementsparentwiththe .parent attribute.Intheexamplethreesisters
document,the<head>tagistheparentofthe<title>tag:
title_tag=soup.title
title_tag
title_tag.parent
Thetitlestringitselfhasaparent:the<title>tagthatcontainsit:
title_tag.string.parent
Theparentofatopleveltaglike<html>isthe BeautifulSoup objectitself:

html_tag=soup.html
14/56
7/14/2015
type(html_tag.parent)
#<class'bs4.BeautifulSoup'>
Andthe .parent ofa BeautifulSoup objectisdefinedasNone:

print(soup.parent)
#None
.parents
Youcaniterateoverallofanelementsparentswith .parents .Thisexampleuses .parents to
travelfroman<a>tagburieddeepwithinthedocument,totheverytopofthedocument:
link=soup.a
link
forparentinlink.parents:
ifparentisNone:
print(parent)
else:
print(parent.name)
#p
#body
#html
#[document]
#None
Going sideways
Considerasimpledocumentlikethis:
sibling_soup=BeautifulSoup("<a>text1<c>text2</c></a>")
print(sibling_soup.prettify())
#<html>
#<body>
#<a>
#
#text1
#
#<c>
#text2
#</c>
#</a>
#</body>
#</html>
Thetagandthe<c>tagareatthesamelevel:theyrebothdirectchildrenofthesame
tag.Wecallthemsiblings.Whenadocumentisprettyprinted,siblingsshowupatthesame
indentationlevel.Youcanalsousethisrelationshipinthecodeyouwrite.
.next_sibling
and
.previous_sibling
15/56
7/14/2015
Youcanuse .next_sibling and .previous_sibling tonavigatebetweenpageelementsthatare

onthesameleveloftheparsetree:
sibling_soup.b.next_sibling
#<c>text2</c>
sibling_soup.c.previous_sibling
#text1
The tag has a .next_sibling , but no .previous_sibling , because theres nothing before
the tag on the same level of the tree. For the same reason, the <c> tag has a
.previous_sibling butno .next_sibling :
print(sibling_soup.b.previous_sibling)
#None
print(sibling_soup.c.next_sibling)
#None
Thestringstext1andtext2arenotsiblings,becausetheydonthavethesameparent:
sibling_soup.b.string
#u'text1'
print(sibling_soup.b.string.next_sibling)
#None
In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string

containingwhitespace.Goingbacktothethreesistersdocument:
<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>
<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>
<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>
Youmightthinkthatthe .next_sibling ofthefirst<a>tagwouldbethesecond<a>tag.But

actually, its a string: the comma and newline that separate the first <a> tag from the
second:
link=soup.a
link
link.next_sibling
#u',\n'
Thesecond<a>tagisactuallythe .next_sibling ofthecomma:

link.next_sibling.next_sibling
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>
16/56
7/14/2015
.next_siblings
and
.previous_siblings
Youcaniterateoveratagssiblingswith .next_siblings or .previous_siblings :

forsiblinginsoup.a.next_siblings:
print(repr(sibling))
#u',\n'
#u'and\n'
#u';andtheylivedatthebottomofawell.'
#None
forsiblinginsoup.find(id="link3").previous_siblings:
print(repr(sibling))
#'and\n'
#u',\n'
#None
Going back and forth

Takealookatthebeginningofthethreesistersdocument:
AnHTMLparsertakesthisstringofcharactersandturnsitintoaseriesofevents:openan
<html>tag,opena<head>tag,opena<title>tag,addastring,closethe<title>tag,
openatag,andsoon.BeautifulSoupofferstoolsforreconstructingtheinitialparseof
thedocument.
.next_element
and
.previous_element
The .next_element attribute of a string or tag points to whatever was parsed immediately
afterwards.Itmightbethesameas .next_sibling ,butitsusuallydrasticallydifferent.
Heres the final <a> tag in the three sisters document. Its .next_sibling is a string: the
conclusionofthesentencethatwasinterruptedbythestartofthe<a>tag.:
last_a_tag=soup.find("a",id="link3")
last_a_tag
last_a_tag.next_sibling
#';andtheylivedatthebottomofawell.'
17/56
7/14/2015
Butthe .next_element of that <a> tag, the thing that was parsed immediately after the <a>
tag,isnottherestofthatsentence:itsthewordTillie:
last_a_tag.next_element
#u'Tillie'
Thatsbecauseintheoriginalmarkup,thewordTillieappearedbeforethatsemicolon.The
parser encountered an <a> tag, then the word Tillie, then the closing </a> tag, then the
semicolonandrestofthesentence.Thesemicolonisonthesamelevelasthe<a>tag,but
thewordTilliewasencounteredfirst.
The .previous_element attribute is the exact opposite of .next_element . It points to whatever
elementwasparsedimmediatelybeforethisone:
last_a_tag.previous_element
#u'and\n'
last_a_tag.previous_element.next_element
.next_elements
and
.previous_elements
Youshouldgettheideabynow.Youcanusetheseiteratorstomoveforwardorbackward
inthedocumentasitwasparsed:
forelementinlast_a_tag.next_elements:
print(repr(element))
#u'Tillie'
#u'\n\n'
#<pclass="story">...
#u'...'
#u'\n'
#None
Searching the tree

Beautiful Soup defines a lot of methods for searching the parse tree, but theyre all very
similar.Imgoingtospendalotoftimeexplainingthetwomostpopularmethods: find() and
find_all() . The other methods take almost exactly the same arguments, so Ill just cover
thembriefly.
Onceagain,Illbeusingthethreesistersdocumentasanexample:
html_doc="""
<body>
18/56
7/14/2015
"""
By passing in a filter to an argument like find_all() , you can zoom in on the parts of the
documentyoureinterestedin.
Kinds of filters
Before talking in detail about find_all() and similar methods, I want to show examples of
different filters you can pass into these methods. These filters show up again and again,
throughout the search API. You can use them to filter based on a tags name, on its
attributes,onthetextofastring,oronsomecombinationofthese.
A string
The simplest filter is a string. Pass a string to a search method and Beautiful Soup will
performamatchagainstthatexactstring.Thiscodefindsallthetagsinthedocument:
soup.find_all('b')
#[TheDormouse'sstory]
Ifyoupassinabytestring,BeautifulSoupwillassumethestringisencodedasUTF8.You
canavoidthisbypassinginaUnicodestringinstead.
A regular expression
If you pass in a regular expression object, Beautiful Soup will filter against that regular
expressionusingits match() method.Thiscodefindsallthetagswhosenamesstartwiththe
letterbinthiscase,the<body>tagandthetag:
importre
fortaginsoup.find_all(re.compile("^b")):
print(tag.name)
#body
#b
Thiscodefindsallthetagswhosenamescontainthelettert:
fortaginsoup.find_all(re.compile("t")):
print(tag.name)
19/56
7/14/2015
#html
#title
A list
Ifyoupassinalist,BeautifulSoupwillallowastringmatchagainstanyiteminthatlist.This
codefindsallthe<a>tagsandallthetags:
soup.find_all(["a","b"])
#[TheDormouse'sstory,
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
True
Thevalue True matcheseverythingitcan.Thiscodefindsallthetagsinthedocument,but
noneofthetextstrings:
fortaginsoup.find_all(True):
print(tag.name)
#html
#head
#title
#body
#p
#b
#p
#a
#a
#a
#p
A function
Ifnoneoftheothermatchesworkforyou,defineafunctionthattakesanelementasitsonly
argument.Thefunctionshouldreturn True iftheargumentmatches,and False otherwise.
Heresafunctionthatreturns True ifatagdefinestheclassattributebutdoesntdefinethe
idattribute:
defhas_class_but_no_id(tag):
returntag.has_attr('class')andnottag.has_attr('id')
Passthisfunctioninto find_all() andyoullpickupallthetags:

soup.find_all(has_class_but_no_id)
#[<pclass="title">TheDormouse'sstory,
#<pclass="story">Onceuponatimetherewere...,
20/56
7/14/2015
#<pclass="story">...]
Thisfunctiononlypicksupthetags.Itdoesntpickupthe<a>tags,becausethosetags
define both class and id. It doesnt pick up tags like <html> and <title>, because those
tagsdontdefineclass.
Ifyoupassinafunctiontofilteronaspecificattributelike href ,theargumentpassedinto
thefunctionwillbetheattributevalue,notthewholetag.Heresafunctionthatfindsall a
tagswhose href attributedoesnotmatcharegularexpression:
defnot_lacie(href):
returnhrefandnotre.compile("lacie").search(href)
soup.find_all(href=not_lacie)
Thefunctioncanbeascomplicatedasyouneedittobe.Heresafunctionthatreturns True
ifatagissurroundedbystringobjects:
frombs4importNavigableString
defsurrounded_by_strings(tag):
return(isinstance(tag.next_element,NavigableString)
andisinstance(tag.previous_element,NavigableString))
fortaginsoup.find_all(surrounded_by_strings):
printtag.name
#p
#a
#a
#a
#p
Nowwerereadytolookatthesearchmethodsindetail.
find_all()
Signature:find_all(name,attrs,recursive,string,limit,**kwargs)
The find_all() methodlooksthroughatagsdescendantsandretrievesalldescendantsthat
matchyourfilters.IgaveseveralexamplesinKindsoffilters,buthereareafewmore:
soup.find_all("title")
soup.find_all("p","title")
#[<pclass="title">TheDormouse'sstory]
soup.find_all("a")
21/56
7/14/2015
soup.find_all(id="link2")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]
importre
soup.find(string=re.compile("sisters"))
Some of these should look familiar, but others are new. What does it mean to pass in a
valuefor string ,or id ?Whydoes find_all("p","title") findatagwiththeCSSclass
title?Letslookattheargumentsto find_all() .
The
name
argument
Pass in a value for name and youll tell Beautiful Soup to only consider tags with certain
names.Textstringswillbeignored,aswilltagswhosenamesthatdontmatch.
Thisisthesimplestusage:
soup.find_all("title")
RecallfromKindsoffiltersthatthevalueto name canbeastring,aregularexpression,alist,

afunction,orthevalueTrue.
The keyword arguments

Anyargumentthatsnotrecognizedwillbeturnedintoafilterononeofatagsattributes.If
youpassinavalueforanargumentcalled id ,BeautifulSoupwillfilteragainsteachtagsid
attribute:
soup.find_all(id='link2')
Ifyoupassinavaluefor href ,BeautifulSoupwillfilteragainsteachtagshrefattribute:

soup.find_all(href=re.compile("elsie"))
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]
Youcanfilteranattributebasedonastring,aregularexpression,alist, a function, or the

valueTrue.
Thiscodefindsalltagswhose id attributehasavalue,regardlessofwhatthevalueis:
soup.find_all(id=True)
22/56
7/14/2015
Youcanfiltermultipleattributesatoncebypassinginmorethanonekeywordargument:
soup.find_all(href=re.compile("elsie"),id='link1')
#[<aclass="sister"href="http://example.com/elsie"id="link1">three</a>]
Someattributes,likethedata*attributesinHTML5,havenamesthatcantbeusedasthe
namesofkeywordarguments:
data_soup=BeautifulSoup('<divdatafoo="value">foo!</div>')
data_soup.find_all(datafoo="value")
#SyntaxError:keywordcan'tbeanexpression
Youcanusetheseattributesinsearchesbyputtingthemintoadictionaryandpassingthe
dictionaryinto find_all() asthe attrs argument:
data_soup.find_all(attrs={"datafoo":"value"})
#[<divdatafoo="value">foo!</div>]
Searching by CSS class

ItsveryusefultosearchforatagthathasacertainCSSclass,butthenameoftheCSS
attribute,class,isareservedwordinPython.Using class asakeywordargumentwillgive
you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the
keywordargument class_ :
soup.find_all("a",class_="sister")
As with any keyword argument, you can pass

function,or True :
class_
a string, a regular expression, a
soup.find_all(class_=re.compile("itl"))
#[<pclass="title">TheDormouse'sstory]
defhas_six_characters(css_class):
returncss_classisnotNoneandlen(css_class)==6
soup.find_all(class_=has_six_characters)
Remember that a single tag can have multiple values for its class attribute. When you
searchforatagthatmatchesacertainCSSclass,yourematchingagainstanyofitsCSS
classes:
css_soup=BeautifulSoup('<pclass="bodystrikeout">')
23/56
7/14/2015
css_soup.find_all("p",class_="strikeout")
#[<pclass="bodystrikeout">]
css_soup.find_all("p",class_="body")
Youcanalsosearchfortheexactstringvalueofthe class attribute:

css_soup.find_all("p",class_="bodystrikeout")
Butsearchingforvariantsofthestringvaluewontwork:
css_soup.find_all("p",class_="strikeoutbody")
#[]
IfyouwanttosearchfortagsthatmatchtwoormoreCSSclasses,youshoulduseaCSS
selector:
css_soup.select("p.strikeout.body")
InolderversionsofBeautifulSoup, whichdont havethe class_ shortcut,you canusethe

attrs trick mentioned above. Create a dictionary whose value for class is the string (or
regularexpression,orwhatever)youwanttosearchfor:
soup.find_all("a",attrs={"class":"sister"})
The
string
argument
With string you can search for strings instead of tags. As with name and the keyword
arguments, you can pass in a string, a regular expression, a list, a function, or the value
True.Herearesomeexamples:
soup.find_all(string="Elsie")
#[u'Elsie']
soup.find_all(string=["Tillie","Elsie","Lacie"])
#[u'Elsie',u'Lacie',u'Tillie']
soup.find_all(string=re.compile("Dormouse"))
[u"TheDormouse'sstory",u"TheDormouse'sstory"]
defis_the_only_string_within_a_tag(s):
"""ReturnTrueifthisstringistheonlychildofitsparenttag."""
return(s==s.parent.string)
soup.find_all(string=is_the_only_string_within_a_tag)
24/56
7/14/2015
#[u"TheDormouse'sstory",u"TheDormouse'sstory",u'Elsie',u'Lacie',u'Tillie',u'...']
Although string is for finding strings, you can combine it with arguments that find tags:
BeautifulSoupwillfindalltagswhose .string matchesyourvaluefor string .Thiscodefinds
the<a>tagswhose .string isElsie:
soup.find_all("a",string="Elsie")
#[<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>]
The string argumentisnewinBeautifulSoup4.4.0.Inearlierversionsitwascalled text :

soup.find_all("a",text="Elsie")
#[<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>]
The
limit
argument
find_all() returnsallthetagsandstringsthatmatchyourfilters.Thiscantakeawhileifthe
documentislarge.Ifyoudontneedalltheresults,youcanpassinanumberfor limit .This

works just like the LIMIT keyword in SQL. It tells Beautiful Soup to stop gathering results
afteritsfoundacertainnumber.
Therearethreelinksinthethreesistersdocument,butthiscodeonlyfindsthefirsttwo:
soup.find_all("a",limit=2)
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]
The
recursive
argument
If you call mytag.find_all() , Beautiful Soup will examine all the descendants of mytag : its
children,itschildrenschildren,andsoon.IfyouonlywantBeautifulSouptoconsiderdirect
children,youcanpassin recursive=False .Seethedifferencehere:
soup.html.find_all("title")
soup.html.find_all("title",recursive=False)
#[]
Heresthatpartofthedocument:
<html>
<head>
<title>
TheDormouse'sstory
</title>
</head>
25/56
7/14/2015
...
The <title> tag is beneath the <html> tag, but its not directly beneath the <html> tag: the
<head>tagisintheway.BeautifulSoupfindsthe<title>tagwhenitsallowedtolookatall
descendants of the <html> tag, but when recursive=False restricts it to the <html> tags
immediatechildren,itfindsnothing.
BeautifulSoupoffersalotoftreesearchingmethods(coveredbelow),andtheymostlytake
the same arguments as find_all() : name , attrs , string , limit , and the keyword arguments.
But the recursive argument is different: find_all() and find() are the only methods that
supportit.Passing recursive=False intoamethodlike find_parents() wouldntbeveryuseful.
Calling a tag is like calling
find_all()
Because find_all() isthemostpopularmethodintheBeautifulSoupsearchAPI,youcan

useashortcutforit.Ifyoutreatthe BeautifulSoup objectora Tag objectasthoughitwerea
function,thenitsthesameascalling find_all() onthatobject.Thesetwolinesofcodeare
equivalent:
soup.find_all("a")
soup("a")
Thesetwolinesarealsoequivalent:
soup.title.find_all(string=True)
soup.title(string=True)
find()
Signature:find(name,attrs,recursive,string,**kwargs)
The find_all() method scans the entire document looking for results, but sometimes you
onlywanttofindoneresult.Ifyouknowadocumentonlyhasone<body>tag,itsawaste
oftimetoscantheentiredocumentlookingformore.Ratherthanpassingin limit=1 every
timeyoucall find_all ,youcanusethe find() method.Thesetwolinesofcodearenearly
equivalent:
soup.find_all('title',limit=1)
soup.find('title')
The only difference is that find_all() returns a list containing the single result, and find()
justreturnstheresult.
26/56
7/14/2015
If find_all() cantfindanything,itreturnsanemptylist.If find() cantfindanything,itreturns

None :
print(soup.find("nosuchtag"))
#None
Rememberthe soup.head.title trick from Navigating using tag names? That trick works by
repeatedlycalling find() :
soup.head.title
soup.find("head").find("title")
find_parents()
and
find_parent()
Signature:find_parents(name,attrs,string,limit,**kwargs)
Signature:find_parent(name,attrs,string,**kwargs)
Ispentalotoftimeabovecovering find_all() and find() .TheBeautifulSoupAPIdefines
ten other methods for searching the tree, but dont be afraid. Five of these methods are
basicallythesameas find_all() ,andtheotherfive arebasicallythe sameas find() . The
onlydifferencesareinwhatpartsofthetreetheysearch.
First lets consider find_parents() and find_parent() . Remember that find_all() and find()
worktheirwaydownthetree,lookingattagsdescendants.Thesemethodsdotheopposite:
theyworktheirwayupthetree,lookingatatags(orastrings)parents.Letstrythemout,
startingfromastringburieddeepinthethreedaughtersdocument:
a_string=soup.find(string="Lacie")
a_string
#u'Lacie'
a_string.find_parents("a")
a_string.find_parent("p")
#<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>and
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>;
#andtheylivedatthebottomofawell.
a_string.find_parents("p",class="title")
#[]
Oneofthethree<a>tagsisthedirectparentofthestringinquestion,sooursearchfindsit.
One of the three tags is an indirect parent of the string, and our search finds that as
27/56
7/14/2015
well.TheresatagwiththeCSSclasstitlesomewhere in the document, but its not

oneofthisstringsparents,sowecantfinditwith find_parents() .
You may have made the connection between find_parent() and find_parents() , and the
.parent and .parents attributes mentioned earlier. The connection is very strong. These
search methods actually use .parents to iterate over all the parents, and check each one
againsttheprovidedfiltertoseeifitmatches.
find_next_siblings()
and
find_next_sibling()
Signature:find_next_siblings(name,attrs,string,limit,**kwargs)
Signature:find_next_sibling(name,attrs,string,**kwargs)
Thesemethodsuse.next_siblingstoiterateovertherestofanelementssiblingsinthetree.
The find_next_siblings() methodreturnsallthesiblingsthatmatch,and find_next_sibling()
onlyreturnsthefirstone:
first_link=soup.a
first_link
first_link.find_next_siblings("a")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
first_story_paragraph=soup.find("p","story")
first_story_paragraph.find_next_sibling("p")
find_previous_siblings()
and
find_previous_sibling()
Signature:find_previous_siblings(name,attrs,string,limit,**kwargs)
Signature:find_previous_sibling(name,attrs,string,**kwargs)
Thesemethodsuse.previous_siblingstoiterateoveranelementssiblingsthatprecedeitin
the tree. The find_previous_siblings() method returns all the siblings that match, and
find_previous_sibling() onlyreturnsthefirstone:
last_link=soup.find("a",id="link3")
last_link
last_link.find_previous_siblings("a")
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]
first_story_paragraph=soup.find("p","story")
first_story_paragraph.find_previous_sibling("p")
28/56
7/14/2015
#<pclass="title">TheDormouse'sstory
find_all_next()
and
find_next()
Signature:find_all_next(name,attrs,string,limit,**kwargs)
Signature:find_next(name,attrs,string,**kwargs)
These methods use .next_elements to iterate over whatever tags and strings that come
after it in the document. The find_all_next() method returns all matches, and find_next()
onlyreturnsthefirstmatch:
first_link=soup.a
first_link
first_link.find_all_next(string=True)
#[u'Elsie',u',\n',u'Lacie',u'and\n',u'Tillie',
#u';\nandtheylivedatthebottomofawell.',u'\n\n',u'...',u'\n']
first_link.find_next("p")
Inthefirstexample,thestringElsieshowedup,eventhoughitwascontainedwithinthe
<a>tagwestartedfrom.Inthesecondexample,thelasttaginthedocumentshowed
up, even though its not in the same part of the tree as the <a> tag we started from. For
thesemethods,allthatmattersisthatanelementmatchthefilter,andshowuplaterinthe
documentthanthestartingelement.
find_all_previous()
and
find_previous()
Signature:find_all_previous(name,attrs,string,limit,**kwargs)
Signature:find_previous(name,attrs,string,**kwargs)
These methods use .previous_elements to iterate over the tags and strings that came
before it in the document. The find_all_previous() method returns all matches, and
find_previous() onlyreturnsthefirstmatch:
first_link=soup.a
first_link
first_link.find_all_previous("p")
#[<pclass="story">Onceuponatimetherewerethreelittlesisters;...,
#<pclass="title">TheDormouse'sstory]
first_link.find_previous("title")
29/56
7/14/2015
The call to find_all_previous("p") found the first paragraph in the document (the one with
class=title),butitalsofindsthesecondparagraph,thetagthatcontainsthe<a>tag
westartedwith.Thisshouldntbetoosurprising:werelookingatallthetagsthatshowup
earlier in the document than the one we started with. A tag that contains an <a> tag
musthaveshownupbeforethe<a>tagitcontains.
CSS selectors
BeautifulSoupsupportsthemostcommonlyusedCSSselectors.Justpassastringintothe
.select() methodofa Tag objectorthe BeautifulSoup objectitself.
Youcanfindtags:
soup.select("title")
soup.select("pnthoftype(3)")
#[<pclass="story">...]
Findtagsbeneathothertags:
soup.select("bodya")
soup.select("htmlheadtitle")
Findtagsdirectlybeneathothertags:
soup.select("head>title")
soup.select("p>a")
soup.select("p>a:nthoftype(2)")
soup.select("p>#link1")
soup.select("body>a")
#[]
Findthesiblingsoftags:
soup.select("#link1~.sister")
30/56
7/14/2015
soup.select("#link1+.sister")
FindtagsbyCSSclass:
soup.select(".sister")
soup.select("[class~=sister]")
FindtagsbyID:
soup.select("#link1")
soup.select("a#link2")
Findtagsthatmatchanyselectorfromalistofselectors:
soup.select(#link1,#link2)#[<aclass=sisterhref=http://example.com/elsie
id=link1>Elsie</a>,#<aclass=sisterhref=http://example.com/lacie
id=link2>Lacie</a>]
Testfortheexistenceofanattribute:
soup.select('a[href]')
Findtagsbyattributevalue:
soup.select('a[href="http://example.com/elsie"]')
soup.select('a[href^="http://example.com/"]')
soup.select('a[href$="tillie"]')
#[<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select('a[href*=".com/el"]')
31/56
7/14/2015
Matchlanguagecodes:
multilingual_markup="""
<plang="en">Hello
<plang="enus">Howdy,y'all
<plang="engb">Pippip,oldfruit
<plang="fr">Bonjourmesamis
"""
multilingual_soup=BeautifulSoup(multilingual_markup)
multilingual_soup.select('p[lang|=en]')
#[<plang="en">Hello,
#<plang="enus">Howdy,y'all,
#<plang="engb">Pippip,oldfruit]
Findonlythefirsttagthatmatchesaselector:
soup.select_one(".sister")
ThisisallaconvenienceforuserswhoknowtheCSSselectorsyntax.Youcandoallthis
stuffwiththeBeautifulSoupAPI.AndifCSSselectorsareallyouneed,youmightaswell
use lxml directly: its a lot faster, and it supports more CSS selectors. But this lets you
combinesimpleCSSselectorswiththeBeautifulSoupAPI.
Modifying the tree

BeautifulSoupsmainstrengthisinsearchingtheparsetree,butyoucanalsomodifythe
treeandwriteyourchangesasanewHTMLorXMLdocument.
Changing tag names and attributes

Icoveredthisearlier,inAttributes,butitbearsrepeating.Youcanrenameatag,changethe
valuesofitsattributes,addnewattributes,anddeleteattributes:
soup=BeautifulSoup('<bclass="boldest">Extremelybold')
tag=soup.b
tag.name="blockquote"
tag['class']='verybold'
tag['id']=1
tag
#<blockquoteclass="verybold"id="1">Extremelybold</blockquote>
deltag['class']
deltag['id']
tag
#<blockquote>Extremelybold</blockquote>
Modifying
.string
32/56
7/14/2015
Ifyousetatags .string attribute,thetagscontentsarereplacedwiththestringyougive:

markup='<ahref="http://example.com/">Ilinkedtoexample.com</a>'
tag=soup.a
tag.string="Newlinktext."
tag
#<ahref="http://example.com/">Newlinktext.</a>
Becareful:ifthetagcontainedothertags,theyandalltheircontentswillbedestroyed.
append()
You can add to a tags contents with Tag.append() . It works just like calling .append() on a
Pythonlist:
soup=BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")
soup
#<html><head></head><body><a>FooBar</a></body></html>
soup.a.contents
#[u'Foo',u'Bar']
NavigableString()
and
.new_tag()
Ifyouneedtoaddastringtoadocument,noproblemyoucanpassaPythonstringinto
append() ,oryoucancallthe NavigableString constructor:
soup=BeautifulSoup("")
tag=soup.b
tag.append("Hello")
new_string=NavigableString("there")
tag.append(new_string)
tag
#Hellothere.
tag.contents
#[u'Hello',u'there']
If you want to create a comment or some other subclass of NavigableString , just call the
constructor:
frombs4importComment
new_comment=Comment("Nicetoseeyou.")
tag.append(new_comment)
tag
#Hellothere<!Nicetoseeyou.>
tag.contents
#[u'Hello',u'there',u'Nicetoseeyou.']
33/56
7/14/2015
(ThisisanewfeatureinBeautifulSoup4.4.0.)
Whatifyouneedtocreateawholenewtag?Thebestsolutionistocallthefactorymethod
BeautifulSoup.new_tag() :
soup=BeautifulSoup("")
original_tag=soup.b
new_tag=soup.new_tag("a",href="http://www.example.com")
original_tag.append(new_tag)
original_tag
#<ahref="http://www.example.com"></a>
new_tag.string="Linktext."
original_tag
#<ahref="http://www.example.com">Linktext.</a>
Onlythefirstargument,thetagname,isrequired.
insert()
Tag.insert() isjustlike Tag.append() , except the new element doesnt necessarily go at the
endofitsparents .contents .Itllbeinsertedatwhatevernumericpositionyousay.Itworks

justlike .insert() onaPythonlist:
tag=soup.a
tag.insert(1,"butdidnotendorse")
tag
#<ahref="http://example.com/">Ilinkedtobutdidnotendorseexample.com</a>
tag.contents
#[u'Ilinkedto',u'butdidnotendorse',example.com]
insert_before()
and
insert_after()
The insert_before() methodinsertsatagorstringimmediatelybeforesomethingelseinthe

parsetree:
soup=BeautifulSoup("stop")
tag=soup.new_tag("i")
tag.string="Don't"
soup.b.string.insert_before(tag)
soup.b
#Don'tstop
The insert_after() method moves a tag or string so that it immediately follows something
elseintheparsetree:
34/56
7/14/2015
soup.b.i.insert_after(soup.new_string("ever"))
soup.b
#Don'teverstop
soup.b.contents
#[Don't,u'ever',u'stop']
clear()
Tag.clear() removesthecontentsofatag:
tag=soup.a
tag.clear()
tag
#<ahref="http://example.com/"></a>
extract()
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that
wasextracted:
a_tag=soup.a
i_tag=soup.i.extract()
a_tag
#<ahref="http://example.com/">Ilinkedto</a>
i_tag
#example.com
print(i_tag.parent)
None
Atthispointyoueffectivelyhavetwoparsetrees:onerootedatthe BeautifulSoup objectyou

usedtoparsethedocument,andonerootedatthetagthatwasextracted.Youcangoonto
call extract onachildoftheelementyouextracted:
my_string=i_tag.string.extract()
my_string
#u'example.com'
print(my_string.parent)
#None
i_tag
#
35/56
7/14/2015
decompose()
Tag.decompose() removesatagfromthetree,thencompletelydestroysitanditscontents:
a_tag=soup.a
soup.i.decompose()
a_tag
#<ahref="http://example.com/">Ilinkedto</a>
replace_with()
PageElement.replace_with() removesatagorstringfromthetree,andreplacesitwiththetag
orstringofyourchoice:
a_tag=soup.a
new_tag=soup.new_tag("b")
new_tag.string="example.net"
a_tag.i.replace_with(new_tag)
a_tag
#<ahref="http://example.com/">Ilinkedtoexample.net</a>
replace_with() returnsthetagorstringthatwasreplaced,sothatyoucanexamineitoradd
itbacktoanotherpartofthetree.
wrap()
PageElement.wrap() wrapsanelementinthetagyouspecify.Itreturnsthenewwrapper:
soup=BeautifulSoup("IwishIwasbold.")
soup.p.string.wrap(soup.new_tag("b"))
#IwishIwasbold.
soup.p.wrap(soup.new_tag("div")
#<div>IwishIwasbold.</div>
ThismethodisnewinBeautifulSoup4.0.5.
unwrap()
Tag.unwrap() istheoppositeof wrap() . It replaces a tag with whatevers inside that tag. Its
goodforstrippingoutmarkup:
36/56
7/14/2015
a_tag=soup.a
a_tag.i.unwrap()
a_tag
#<ahref="http://example.com/">Ilinkedtoexample.com</a>
Like replace_with() , unwrap() returnsthetagthatwasreplaced.
Output
Prettyprinting
The prettify() methodwillturnaBeautifulSoupparsetreeintoanicelyformattedUnicode
string,witheachHTML/XMLtagonitsownline:
soup.prettify()
#'<html>\n<head>\n</head>\n<body>\n<ahref="http://example.com/">\n...'
#<html>
#<head>
#</head>
#<body>
#<ahref="http://example.com/">
#Ilinkedto
#
#example.com
#
#</a>
#</body>
#</html>
Youcancall prettify() onthetoplevel BeautifulSoup object,oronanyofits Tag objects:

print(soup.a.prettify())
#<ahref="http://example.com/">
#Ilinkedto
#
#example.com
#
#</a>
Nonpretty printing
If you just want a string, with no fancy formatting, you can call unicode() or str() on a
BeautifulSoup object,ora Tag withinit:
37/56
7/14/2015
str(soup)
#'<html><head></head><body><ahref="http://example.com/">Ilinkedtoexample.com</a></body>
unicode(soup.a)
#u'<ahref="http://example.com/">Ilinkedtoexample.com</a>'
The str() functionreturnsastringencodedinUTF8.SeeEncodingsforotheroptions.

Youcanalsocall encode() togetabytestring,and decode() togetUnicode.
Output formatters
IfyougiveBeautifulSoupadocumentthatcontainsHTMLentitieslike&lquot,theyllbe
convertedtoUnicodecharacters:
soup=BeautifulSoup("“Dammit!”hesaid.")
unicode(soup)
#u'<html><head></head><body>\u201cDammit!\u201dhesaid.</body></html>'
If you then convert the document to a string, the Unicode characters will be encoded as
UTF8.YouwontgettheHTMLentitiesback:
str(soup)
#'<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9dhesaid.</body></html>'
By default, the only characters that are escaped upon output are bare ampersands and
angle brackets. These get turned into &amp, &lt, and &gt, so that Beautiful Soup
doesntinadvertentlygenerateinvalidHTMLorXML:
soup=BeautifulSoup("ThelawfirmofDewey,Cheatem,&Howe")
soup.p
#ThelawfirmofDewey,Cheatem,&Howe
soup=BeautifulSoup('<ahref="http://example.com/?foo=val1&bar=val2">Alink</a>')
soup.a
#<ahref="http://example.com/?foo=val1&bar=val2">Alink</a>
Youcanchangethisbehaviorbyprovidingavalueforthe formatter argumentto prettify() ,

encode() ,or decode() .BeautifulSouprecognizesfourpossiblevaluesfor formatter .
The default is formatter="minimal" . Strings will only be processed enough to ensure that
BeautifulSoupgeneratesvalidHTML/XML:
french="Iladit<<Sacrébleu!>>"
soup=BeautifulSoup(french)
print(soup.prettify(formatter="minimal"))
#<html>
#<body>
#
38/56
7/14/2015
#Iladit<<Sacrbleu!>>
#
#</body>
#</html>
If you pass in formatter="html" , Beautiful Soup will convert Unicode characters to HTML
entitieswheneverpossible:
print(soup.prettify(formatter="html"))
#<html>
#<body>
#
#Iladit<<Sacrébleu!>>
#
#</body>
#</html>
Ifyoupassin formatter=None ,BeautifulSoupwillnotmodifystringsatallonoutput.Thisis

the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in
theseexamples:
print(soup.prettify(formatter=None))
#<html>
#<body>
#
#Iladit<<Sacrbleu!>>
#
#</body>
#</html>
link_soup=BeautifulSoup('<ahref="http://example.com/?foo=val1&bar=val2">Alink</a>')
print(link_soup.a.encode(formatter=None))
#<ahref="http://example.com/?foo=val1&bar=val2">Alink</a>
Finally,ifyoupassinafunctionfor formatter ,BeautifulSoupwillcallthatfunctiononcefor

every string and attribute value in the document. You can do whatever you want in this
function.Heresaformatterthatconvertsstringstouppercaseanddoesabsolutelynothing
else:
defuppercase(str):
returnstr.upper()
print(soup.prettify(formatter=uppercase))
#<html>
#<body>
#
#ILADIT<<SACRBLEU!>>
#
#</body>
#</html>
print(link_soup.a.prettify(formatter=uppercase))
#<ahref="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
#ALINK
#</a>
39/56
7/14/2015
Ifyourewritingyourownfunction,youshouldknowaboutthe EntitySubstitution classinthe

bs4.dammit module. This class implements Beautiful Soups standard formatters as class
methods: the html formatter is EntitySubstitution.substitute_html , and the minimal
formatter is EntitySubstitution.substitute_xml . You can use these functions to simulate
formatter=html or formatter==minimal ,butthendosomethingextra.
HeresanexamplethatreplacesUnicodecharacterswithHTMLentitieswheneverpossible,
butalsoconvertsallstringstouppercase:
frombs4.dammitimportEntitySubstitution
defuppercase_and_substitute_html_entities(str):
returnEntitySubstitution.substitute_html(str.upper())
print(soup.prettify(formatter=uppercase_and_substitute_html_entities))
#<html>
#<body>
#
#ILADIT<<SACRÉBLEU!>>
#
#</body>
#</html>
Onelastcaveat:ifyoucreatea CData object,thetextinsidethatobjectisalwayspresented

exactlyasitappears,withnoformatting.BeautifulSoupwillcalltheformattermethod,justin
case youve written a custom method that counts all the strings in the document or
something,butitwillignorethereturnvalue:
frombs4.elementimportCData
soup=BeautifulSoup("<a></a>")
soup.a.string=CData("one<three")
print(soup.a.prettify(formatter="xml"))
#<a>
#<![CDATA[one<three]]>
#</a>
get_text()
If you only want the text part of a document or tag, you can use the get_text() method.It
returnsallthetextinadocumentorbeneathatag,asasingleUnicodestring:
markup='<ahref="http://example.com/">\nIlinkedtoexample.com\n</a>'
soup.get_text()
u'\nIlinkedtoexample.com\n'
soup.i.get_text()
u'example.com'
Youcanspecifyastringtobeusedtojointhebitsoftexttogether:
40/56
7/14/2015
#soup.get_text("|")
u'\nIlinkedto|example.com|\n'
You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of
text:
#soup.get_text("|",strip=True)
u'Ilinkedto|example.com'
Butatthatpointyoumightwanttousethe.stripped_stringsgeneratorinstead,andprocess
thetextyourself:
[textfortextinsoup.stripped_strings]
#[u'Ilinkedto',u'example.com']
Specifying the parser to use

If you just need to parse some HTML, you can dump the markup into the BeautifulSoup
constructor,anditllprobablybefine.BeautifulSoupwillpickaparserforyouandparsethe
data.Butthereareafewadditionalargumentsyoucanpassintotheconstructortochange
whichparserisused.
The first argument to the BeautifulSoup constructor is a string or an open filehandlethe
markupyouwantparsed.Thesecondargumentishowyoudlikethemarkupparsed.
Ifyoudontspecifyanything,youllgetthebestHTMLparserthatsinstalled.BeautifulSoup
rankslxmlsparserasbeingthebest,thenhtml5libs,thenPythonsbuiltinparser.Youcan
overridethisbyspecifyingoneofthefollowing:
What type of markup you want to parse. Currently supported are html, xml, and
html5.
Thenameoftheparserlibraryyouwanttouse.Currentlysupportedoptionsarelxml,
html5lib,andhtml.parser(PythonsbuiltinHTMLparser).
ThesectionInstallingaparsercontraststhesupportedparsers.
Ifyoudonthaveanappropriateparserinstalled,BeautifulSoupwillignoreyourrequestand
pickadifferentparser.Rightnow,theonlysupportedXMLparserislxml.Ifyoudonthave
lxmlinstalled,askingforanXMLparserwontgiveyouone,andaskingforlxmlwontwork
either.
Differences between parsers

Beautiful Soup presents the same interface to a number of different parsers, but each
parser is different. Different parsers will create different parse trees from the same
document. The biggest differences are between the HTML parsers and the XML parsers.
41/56
7/14/2015
Heresashortdocument,parsedasHTML:
BeautifulSoup("<a></a>")
#<html><head></head><body><a></a></body></html>
SinceanemptytagisnotvalidHTML,theparserturnsitintoatagpair.
Heres the same document parsed as XML (running this requires that you have lxml
installed).Notethattheemptytagisleftalone,andthatthedocumentisgivenanXML
declarationinsteadofbeingputintoan<html>tag.:
BeautifulSoup("<a></a>","xml")
#<?xmlversion="1.0"encoding="utf8"?>
#<a></a>
There are also differences between HTML parsers. If you give Beautiful Soup a perfectly
formed HTML document, these differences wont matter. One parser will be faster than
another, but theyll all give you a data structure that looks exactly like the original HTML
document.
But if the document is not perfectlyformed, different parsers will give different results.
Heresashort,invaliddocumentparsedusinglxmlsHTMLparser.Notethatthedangling
tagissimplyignored:
BeautifulSoup("<a>","lxml")
#<html><body><a></a></body></html>
Heresthesamedocumentparsedusinghtml5lib:
BeautifulSoup("<a>","html5lib")
#<html><head></head><body><a></a></body></html>
Instead of ignoring the dangling tag, html5lib pairs it with an opening tag. This
parseralsoaddsanempty<head>tagtothedocument.
HeresthesamedocumentparsedwithPythonsbuiltinHTMLparser:
BeautifulSoup("<a>","html.parser")
#<a></a>
Likehtml5lib,thisparserignorestheclosingtag.Unlikehtml5lib,thisparsermakesno
attempt to create a wellformed HTML document by adding a <body> tag. Unlike lxml, it
doesntevenbothertoaddan<html>tag.
Since the document <a> is invalid, none of these techniques is the correct way to
handle it. The html5lib parser uses techniques that are part of the HTML5 standard, so it
hasthebestclaimonbeingthecorrectway,butallthreetechniquesarelegitimate.
42/56
7/14/2015
Differences between parsers can affect your script. If youre planning on distributing your
scripttootherpeople,orrunningitonmultiplemachines,youshouldspecifyaparserinthe
BeautifulSoup constructor. That will reduce the chances that your users parse a document
differentlyfromthewayyouparseit.
Encodings
AnyHTMLorXMLdocumentiswritteninaspecificencodinglikeASCIIorUTF8.Butwhen
youloadthatdocumentintoBeautifulSoup,youlldiscoveritsbeenconvertedtoUnicode:
markup="<h1>Sacr\xc3\xa9bleu!</h1>"
soup.h1
#<h1>Sacrbleu!</h1>
soup.h1.string
#u'Sacr\xe9bleu!'
Itsnotmagic.(Thatsurewouldbenice.)BeautifulSoupusesasublibrarycalledUnicode,
Dammit to detect a documents encoding and convert it to Unicode. The autodetected
encodingisavailableasthe .original_encoding attributeofthe BeautifulSoup object:
soup.original_encoding
'utf8'
Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes.
Sometimes it guesses correctly, but only after a bytebybyte search of the document that
takes a very long time. If you happen to know a documents encoding ahead of time, you
can avoid mistakes and delays by passing it to the BeautifulSoup constructor as
from_encoding .
HeresadocumentwritteninISO88598.ThedocumentissoshortthatUnicode,Dammit
cantgetagoodlockonit,andmisidentifiesitasISO88597:
markup=b"<h1>\xed\xe5\xec\xf9</h1>"
soup.h1
<h1></h1>
'ISO88597'
Wecanfixthisbypassinginthecorrect from_encoding :
soup=BeautifulSoup(markup,from_encoding="iso88598")
soup.h1
<h1></h1>
'iso88598'
43/56
7/14/2015
If you dont know what the correct encoding is, but you know that Unicode, Dammit is
guessingwrong,youcanpassthewrongguessesinas exclude_encodings :
soup=BeautifulSoup(markup,exclude_encodings=["ISO88597"])
soup.h1
<h1></h1>
'WINDOWS1255'
Windows1255isnt100%correct,butthatencodingisacompatiblesupersetofISO8859
8,soitscloseenough.( exclude_encodings isanewfeatureinBeautifulSoup4.4.0.)
Inrarecases(usuallywhenaUTF8documentcontainstextwritteninacompletelydifferent
encoding),theonlywaytogetUnicodemaybetoreplacesomecharacterswiththespecial
Unicode character REPLACEMENT CHARACTER (U+FFFD, ). If Unicode, Dammit
needs to do this, it will set the .contains_replacement_characters attribute to True on the
UnicodeDammit or BeautifulSoup object.ThisletsyouknowthattheUnicoderepresentationis
notanexactrepresentationoftheoriginalsomedatawaslost.Ifadocumentcontains,
but .contains_replacement_characters is False ,youllknowthatthewasthereoriginally(asit
isinthisparagraph)anddoesntstandinformissingdata.
Output encoding
WhenyouwriteoutadocumentfromBeautifulSoup,yougetaUTF8document,evenifthe
documentwasntinUTF8tobeginwith.HeresadocumentwrittenintheLatin1encoding:
markup=b'''
<html>
<head>
<metacontent="text/html;charset=ISOLatin1"httpequiv="Contenttype"/>
</head>
<body>
Sacr\xe9bleu!
</body>
</html>
'''
#<html>
#<head>
#<metacontent="text/html;charset=utf8"httpequiv="Contenttype"/>
#</head>
#<body>
#
#Sacrbleu!
#
#</body>
#</html>
Notethatthe<meta>taghasbeenrewrittentoreflectthefactthatthedocumentisnowin
44/56
7/14/2015
UTF8.
IfyoudontwantUTF8,youcanpassanencodinginto prettify() :
print(soup.prettify("latin1"))
#<html>
#<head>
#<metacontent="text/html;charset=latin1"httpequiv="Contenttype"/>
#...
Youcanalsocallencode()onthe BeautifulSoup object,oranyelementinthesoup,justasif

itwereaPythonstring:
soup.p.encode("latin1")
#'Sacr\xe9bleu!'
soup.p.encode("utf8")
#'Sacr\xc3\xa9bleu!'
Any characters that cant be represented in your chosen encoding will be converted into
numeric XML entity references. Heres a document that includes the Unicode character
SNOWMAN:
markup=u"\N{SNOWMAN}"
snowman_soup=BeautifulSoup(markup)
tag=snowman_soup.b
TheSNOWMANcharactercanbepartofaUTF8document(itlookslike),buttheresno
representationforthatcharacterinISOLatin1orASCII,soitsconvertedinto&#9731for
thoseencodings:
print(tag.encode("utf8"))
#
printtag.encode("latin1")
#☃
printtag.encode("ascii")
#☃
Unicode, Dammit
YoucanuseUnicode,DammitwithoutusingBeautifulSoup.Itsusefulwheneveryouhave
datainanunknownencodingandyoujustwantittobecomeUnicode:
frombs4importUnicodeDammit
dammit=UnicodeDammit("Sacr\xc3\xa9bleu!")
print(dammit.unicode_markup)
#Sacrbleu!
dammit.original_encoding
45/56
7/14/2015
#'utf8'
Unicode,Dammitsguesseswillgetalotmoreaccurateifyouinstallthe chardet or cchardet

Python libraries. The more data you give Unicode, Dammit, the more accurately it will
guess. If you have your own suspicions as to what the encoding might be, you can pass
theminasalist:
dammit=UnicodeDammit("Sacr\xe9bleu!",["latin1","iso88591"])
print(dammit.unicode_markup)
#Sacrbleu!
dammit.original_encoding
#'latin1'
Unicode,DammithastwospecialfeaturesthatBeautifulSoupdoesntuse.
Smart quotes
YoucanuseUnicode,DammittoconvertMicrosoftsmartquotestoHTMLorXMLentities:
markup=b"Ijust\x93love\x94MicrosoftWord\x92ssmartquotes"
UnicodeDammit(markup,["windows1252"],smart_quotes_to="html").unicode_markup
#u'Ijust“love”MicrosoftWord’ssmartquotes'
UnicodeDammit(markup,["windows1252"],smart_quotes_to="xml").unicode_markup
#u'Ijust“love”MicrosoftWord’ssmartquotes'
YoucanalsoconvertMicrosoftsmartquotestoASCIIquotes:
UnicodeDammit(markup,["windows1252"],smart_quotes_to="ascii").unicode_markup
#u'Ijust"love"MicrosoftWord\'ssmartquotes'
Hopefully youll find this feature useful, but Beautiful Soup doesnt use it. Beautiful Soup
prefers the default behavior, which is to convert Microsoft smart quotes to Unicode
charactersalongwitheverythingelse:
UnicodeDammit(markup,["windows1252"]).unicode_markup
#u'Ijust\u201clove\u201dMicrosoftWord\u2019ssmartquotes'
Inconsistent encodings
SometimesadocumentismostlyinUTF8,butcontainsWindows1252characterssuchas
(again)Microsoftsmartquotes.Thiscanhappenwhenawebsiteincludesdatafrommultiple
sources. You can use UnicodeDammit.detwingle() to turn such a document into pure UTF8.
Heresasimpleexample:
snowmen=(u"\N{SNOWMAN}"*3)
quote=(u"\N{LEFTDOUBLEQUOTATIONMARK}Ilikesnowmen!\N{RIGHTDOUBLEQUOTATIONMARK}")
46/56
7/14/2015
doc=snowmen.encode("utf8")+quote.encode("windows_1252")
Thisdocumentisamess.ThesnowmenareinUTF8andthequotesareinWindows1252.
Youcandisplaythesnowmenorthequotes,butnotboth:
print(doc)
#Ilikesnowmen!
print(doc.decode("windows1252"))
#Ilikesnowmen!
DecodingthedocumentasUTF8raisesa UnicodeDecodeError ,anddecodingitasWindows

1252 gives you gibberish. Fortunately, UnicodeDammit.detwingle() will convert the string to
pure UTF8, allowing you to decode it to Unicode and display the snowmen and quote
markssimultaneously:
new_doc=UnicodeDammit.detwingle(doc)
print(new_doc.decode("utf8"))
#Ilikesnowmen!
UnicodeDammit.detwingle() onlyknowshowtohandleWindows1252embeddedinUTF8(or
viceversa,Isuppose),butthisisthemostcommoncase.
Notethatyoumustknowtocall UnicodeDammit.detwingle() onyourdatabeforepassingitinto
BeautifulSoup orthe UnicodeDammit constructor.BeautifulSoupassumesthatadocumenthas
asingleencoding,whateveritmightbe.IfyoupassitadocumentthatcontainsbothUTF8
and Windows1252, its likely to think the whole document is Windows1252, and the
documentwillcomeoutlookinglike Ilikesnowmen! .
UnicodeDammit.detwingle() isnewinBeautifulSoup4.1.0.
Comparing objects for equality

BeautifulSoupsaysthattwo NavigableString or Tag objects are equal when they represent
the same HTML or XML markup. In this example, the two tags are treated as equal,
even though they live in different parts of the object tree, because they both look like
pizza:
markup="Iwantpizzaandmorepizza!"
soup=BeautifulSoup(markup,'html.parser')
first_b,second_b=soup.find_all('b')
printfirst_b==second_b
#True
printfirst_b.previous_element==second_b.previous_element
#False
Ifyouwanttoseewhethertwovariablesrefertoexactlythesameobject,useis:
47/56
7/14/2015
printfirst_bissecond_b
#False
Copying Beautiful Soup objects

Youcanuse copy.copy() tocreateacopyofany Tag or NavigableString :
importcopy
p_copy=copy.copy(soup.p)
printp_copy
#Iwantpizzaandmorepizza!
The copy is considered equal to the original, since it represents the same markup as the
original,butitsnotthesameobject:
printsoup.p==p_copy
#True
printsoup.pisp_copy
#False
TheonlyrealdifferenceisthatthecopyiscompletelydetachedfromtheoriginalBeautiful
Soupobjecttree,justasif extract() hadbeencalledonit:
printp_copy.parent
#None
Thisisbecausetwodifferent Tag objectscantoccupythesamespaceatthesametime.
Parsing only part of a document

LetssayyouwanttouseBeautifulSouplookatadocuments<a>tags.Itsawasteoftime
andmemorytoparsetheentiredocumentandthengooveritagainlookingfor<a>tags.It
would be much faster to ignore everything that wasnt an <a> tag in the first place. The
SoupStrainer class allows you to choose which parts of an incoming document are parsed.
Youjustcreatea SoupStrainer andpassitintothe BeautifulSoup constructorasthe parse_only
argument.
(Notethatthisfeaturewontworkifyoureusingthehtml5libparser.Ifyouusehtml5lib,the
whole document will be parsed, no matter what. This is because html5lib constantly
rearrangestheparsetreeasitworks,andifsomepartofthedocumentdidntactuallymake
it into the parse tree, itll crash. To avoid confusion, in the examples below Ill be forcing
BeautifulSouptousePythonsbuiltinparser.)
SoupStrainer
48/56
7/14/2015
The SoupStrainer class takes the same arguments as a typical method from Searching the
tree:name,attrs,string,and**kwargs.Herearethree SoupStrainer objects:
frombs4importSoupStrainer
only_a_tags=SoupStrainer("a")
only_tags_with_id_link2=SoupStrainer(id="link2")
defis_short_string(string):
returnlen(string)<10
only_short_strings=SoupStrainer(string=is_short_string)
Imgoingtobringbackthethreesistersdocumentonemoretime,andwellseewhatthe
documentlookslikewhenitsparsedwiththesethree SoupStrainer objects:
html_doc="""
<body>
"""
print(BeautifulSoup(html_doc,"html.parser",parse_only=only_a_tags).prettify())
#<aclass="sister"href="http://example.com/elsie"id="link1">
#Elsie
#</a>
#Lacie
#</a>
#<aclass="sister"href="http://example.com/tillie"id="link3">
#Tillie
#</a>
print(BeautifulSoup(html_doc,"html.parser",parse_only=only_tags_with_id_link2).prettify())
#Lacie
#</a>
print(BeautifulSoup(html_doc,"html.parser",parse_only=only_short_strings).prettify())
#Elsie
#,
#Lacie
#and
#Tillie
#...
#
49/56
7/14/2015
You can also pass a SoupStrainer into any of the methods covered in Searching the tree.
Thisprobablyisntterriblyuseful,butIthoughtIdmentionit:
soup=BeautifulSoup(html_doc)
soup.find_all(only_short_strings)
#[u'\n\n',u'\n\n',u'Elsie',u',\n',u'Lacie',u'and\n',u'Tillie',
#u'\n\n',u'...',u'\n']
Troubleshooting
diagnose()
If youre having trouble understanding what Beautiful Soup does to a document, pass the
document into the diagnose() function. (New in Beautiful Soup 4.2.0.) Beautiful Soup will
print out a report showing you how different parsers handle the document, and tell you if
youremissingaparserthatBeautifulSoupcouldbeusing:
frombs4.diagnoseimportdiagnose
data=open("bad.html").read()
diagnose(data)
#DiagnosticrunningonBeautifulSoup4.2.0
#Pythonversion2.7.3(default,Aug12012,05:16:07)
#Inoticedthathtml5libisnotinstalled.Installingitmayhelp.
#Foundlxmlversion2.3.2.0
#
#Tryingtoparseyourdatawithhtml.parser
#Here'swhathtml.parserdidwiththedocument:
#...
Just looking at the output of diagnose() may show you how to solve the problem. Even if
not,youcanpastetheoutputof diagnose() whenaskingforhelp.
Errors when parsing a document

There are two different kinds of parse errors. There are crashes, where you feed a
documenttoBeautifulSoupanditraisesanexception,usuallyan HTMLParser.HTMLParseError .
And there is unexpected behavior, where a Beautiful Soup parse tree looks a lot different
thanthedocumentusedtocreateit.
Almost none of these problems turn out to be problems with Beautiful Soup. This is not
because Beautiful Soup is an amazingly wellwritten piece of software. Its because
BeautifulSoupdoesntincludeanyparsingcode.Instead,itreliesonexternalparsers.Ifone
parserisntworkingonacertaindocument,thebestsolutionistotryadifferentparser.See
Installingaparserfordetailsandaparsercomparison.
The most common parse errors are
HTMLParser.HTMLParseError: malformed start tag
and
50/56
7/14/2015
HTMLParser.HTMLParseError:badendtag .ThesearebothgeneratedbyPythonsbuiltinHTML
parserlibrary,andthesolutionistoinstalllxmlorhtml5lib.
Themostcommontypeofunexpectedbehavioristhatyoucantfindatagthatyouknowis
inthedocument.Yousawitgoingin,but find_all() returns [] or find() returns None .Thisis
anothercommonproblemwithPythonsbuiltinHTMLparser,whichsometimesskipstagsit
doesntunderstand.Again,thesolutionistoinstalllxmlorhtml5lib.
Version mismatch problems

SyntaxError: Invalid syntax
(on the line ROOT_TAG_NAME = u'[document]' ): Caused by

runningthePython2versionofBeautifulSoupunderPython3,withoutconvertingthe
code.
ImportError: No module named HTMLParser Caused by running the Python 2 version of
BeautifulSoupunderPython3.
ImportError:Nomodulenamedhtml.parser Caused by running the Python 3 version of
BeautifulSoupunderPython2.
ImportError:NomodulenamedBeautifulSoup CausedbyrunningBeautifulSoup3code
on a system that doesnt have BS3 installed. Or, by writing Beautiful Soup 4 code
withoutknowingthatthepackagenamehaschangedto bs4 .
ImportError: No module named bs4 Caused by running Beautiful Soup 4 code on a
systemthatdoesnthaveBS4installed.
Parsing XML
Bydefault,BeautifulSoupparsesdocumentsasHTML.ToparseadocumentasXML,pass
inxmlasthesecondargumenttothe BeautifulSoup constructor:
soup=BeautifulSoup(markup,"xml")
Youllneedtohavelxmlinstalled.
Other parser problems

Ifyourscriptworksononecomputerbutnotanother,orinonevirtualenvironmentbut
notanother,oroutsidethevirtualenvironmentbutnotinside,itsprobablybecausethe
twoenvironmentshavedifferentparserlibrariesavailable.Forexample,youmayhave
developedthescriptonacomputerthathaslxmlinstalled,andthentriedtorunitona
computer that only has html5lib installed. See Differences between parsers for why
this matters, and fix the problem by mentioning a specific parser library in the
BeautifulSoup constructor.
Because HTML tags and attributes are caseinsensitive, all three HTML parsers
converttagandattributenamestolowercase.Thatis,themarkup<TAG></TAG>is
51/56
7/14/2015
convertedto<tag></tag>.Ifyouwanttopreservemixedcaseoruppercasetagsand
attributes,youllneedtoparsethedocumentasXML.
Miscellaneous
UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar (or
just about any other UnicodeEncodeError ) This is not a problem with Beautiful Soup.
Thisproblemshowsupintwomainsituations.First,whenyoutrytoprintaUnicode
characterthatyourconsoledoesntknowhowtodisplay.(SeethispageonthePython
wiki for help.) Second, when youre writing to a file and you pass in a Unicode
character thats not supported by your default encoding. In this case, the simplest
solutionistoexplicitlyencodetheUnicodestringintoUTF8with u.encode("utf8") .
KeyError:[attr] Caused by accessing tag['attr'] when the tag in question doesnt
definethe attr attribute.Themostcommonerrorsare KeyError:'href' and KeyError:
'class' .Use tag.get('attr') ifyourenotsure attr isdefined,justasyouwouldwitha
Pythondictionary.
AttributeError: 'ResultSet' object has no attribute 'foo' This usually happens
becauseyouexpected find_all() toreturnasingletagorstring.But find_all() returns
a _list_ of tags and stringsa ResultSet object. You need to iterate over the list and
look at the .foo of each one. Or, if you really only want one result, you need to use
find() insteadof find_all() .
AttributeError:'NoneType'objecthasnoattribute'foo' Thisusuallyhappensbecause
youcalled find() andthentriedtoaccessthe.foo`attributeoftheresult.Butinyour
case, find() didnt find anything, so it returned None , instead of returning a tag or a
string.Youneedtofigureoutwhyyour find() callisntreturninganything.
Improving Performance
Beautiful Soup will never be as fast as the parsers it sits on top of. If response time is
critical, if youre paying for computer time by the hour, or if theres any other reason why
computer time is more valuable than programmer time, you should forget about Beautiful
Soupandworkdirectlyatoplxml.
Thatsaid,therearethingsyoucandotospeedupBeautifulSoup.Ifyourenotusinglxml
as the underlying parser, my advice is to start. Beautiful Soup parses documents
significantlyfasterusinglxmlthanusinghtml.parserorhtml5lib.
Youcanspeedupencodingdetectionsignificantlybyinstallingthecchardetlibrary.
Parsingonlypartofadocumentwontsaveyoumuchtimeparsingthedocument,butitcan
savealotofmemory,anditllmakesearchingthedocumentmuchfaster.
Beautiful Soup 3
52/56
7/14/2015
BeautifulSoup3isthepreviousreleaseseries,andisnolongerbeingactivelydeveloped.
ItscurrentlypackagedwithallmajorLinuxdistributions:
$aptgetinstallpythonbeautifulsoup
ItsalsopublishedthroughPyPias BeautifulSoup .:
$easy_installBeautifulSoup
$pipinstallBeautifulSoup
YoucanalsodownloadatarballofBeautifulSoup3.2.0.
If you ran easy_install beautifulsoup or easy_install BeautifulSoup , but your code doesnt
work,youinstalledBeautifulSoup3bymistake.Youneedtorun easy_installbeautifulsoup4 .
ThedocumentationforBeautifulSoup3isarchivedonline.
Porting code to BS4

Most code written against Beautiful Soup 3 will work against Beautiful Soup 4 with one
simplechange.Allyoushouldhavetodoischangethepackagenamefrom BeautifulSoup to
bs4 .Sothis:
fromBeautifulSoupimportBeautifulSoup
becomesthis:
If you get the ImportError No module named BeautifulSoup, your problem is that
youre trying to run Beautiful Soup 3 code, but you only have Beautiful Soup 4
installed.
Ifyougetthe ImportError Nomodulenamedbs4,yourproblemisthatyouretryingto
runBeautifulSoup4code,butyouonlyhaveBeautifulSoup3installed.
Although BS4 is mostly backwardscompatible with BS3, most of its methods have been
deprecated and given new names for PEP 8 compliance. There are numerous other
renamesandchanges,andafewofthembreakbackwardscompatibility.
HereswhatyoullneedtoknowtoconvertyourBS3codeandhabitstoBS4:
You need a parser

BeautifulSoup3usedPythons SGMLParser ,amodulethatwasdeprecatedandremovedin
Python3.0.BeautifulSoup4uses html.parser bydefault,butyoucanpluginlxmlorhtml5lib
53/56
7/14/2015
andusethatinstead.SeeInstallingaparserforacomparison.
Since html.parser isnotthesameparseras SGMLParser ,youmayfindthatBeautifulSoup4
givesyouadifferentparsetreethanBeautifulSoup3forthesamemarkup.Ifyouswapout
html.parser for lxml or html5lib, you may find that the parse tree changes yet again. If this
happens,youllneedtoupdateyourscrapingcodetodealwiththenewtree.
Method names
renderContents > encode_contents
replaceWith > replace_with
replaceWithChildren > unwrap
findAll > find_all
findAllNext > find_all_next
findAllPrevious > find_all_previous
findNext > find_next
findNextSibling > find_next_sibling
findNextSiblings > find_next_siblings
findParent > find_parent
findParents > find_parents
findPrevious > find_previous
findPreviousSibling > find_previous_sibling
findPreviousSiblings > find_previous_siblings
nextSibling > next_sibling
previousSibling > previous_sibling
SomeargumentstotheBeautifulSoupconstructorwererenamedforthesamereasons:
BeautifulSoup(parseOnlyThese=...) > BeautifulSoup(parse_only=...)
BeautifulSoup(fromEncoding=...) > BeautifulSoup(from_encoding=...)
IrenamedonemethodforcompatibilitywithPython3:
Tag.has_key() > Tag.has_attr()
Irenamedoneattributetousemoreaccurateterminology:
Tag.isSelfClosing > Tag.is_empty_element
IrenamedthreeattributestoavoidusingwordsthathavespecialmeaningtoPython.Unlike
the others, these changes are not backwards compatible. If you used these attributes in
BS3,yourcodewillbreakonBS4untilyouchangethem.
UnicodeDammit.unicode > UnicodeDammit.unicode_markup
Tag.next > Tag.next_element
54/56
7/14/2015
Tag.previous > Tag.previous_element
Generators
IgavethegeneratorsPEP8compliantnames,andtransformedthemintoproperties:
childGenerator() > children
nextGenerator() > next_elements
nextSiblingGenerator() > next_siblings
previousGenerator() > previous_elements
previousSiblingGenerator() > previous_siblings
recursiveChildGenerator() > descendants
parentGenerator() > parents
Soinsteadofthis:
forparentintag.parentGenerator():
...
Youcanwritethis:
forparentintag.parents:
...
(Buttheoldcodewillstillwork.)
Someofthegeneratorsusedtoyield None aftertheyweredone,andthenstop.Thatwasa
bug.Nowthegeneratorsjuststop.
There are two new generators, .strings and .stripped_strings. .strings yields
NavigableString objects, and .stripped_strings yields Python strings that have had
whitespacestripped.
XML
There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in
xml as the second argument to the BeautifulSoup constructor. For the same reason, the
BeautifulSoup constructornolongerrecognizesthe isHTML argument.
BeautifulSoupshandlingofemptyelementXMLtagshasbeenimproved.Previouslywhen
youparsedXMLyouhadtoexplicitlysaywhichtagswereconsideredemptyelementtags.
The selfClosingTags argumenttotheconstructorisnolongerrecognized.Instead,Beautiful
Soupconsidersanyemptytagtobeanemptyelementtag.Ifyouaddachildtoanempty
elementtag,itstopsbeinganemptyelementtag.
55/56
7/14/2015
Entities
An incoming HTML or XML entity is always converted into the corresponding Unicode
character. Beautiful Soup 3 had a number of overlapping ways of dealing with entities,
which have been removed. The BeautifulSoup constructor no longer recognizes the
smartQuotesTo or convertEntities arguments.(Unicode,Dammitstillhas smart_quotes_to ,butits
defaultisnowtoturnsmartquotesintoUnicode.)Theconstants HTML_ENTITIES , XML_ENTITIES ,
and XHTML_ENTITIES havebeenremoved,sincetheyconfigureafeature(transformingsome
butnotallentitiesintoUnicodecharacters)thatnolongerexists.
IfyouwanttoturnUnicodecharactersbackintoHTMLentitiesonoutput,ratherthanturning
themintoUTF8characters,youneedtouseanoutputformatter.
Miscellaneous
Tag.stringnowoperatesrecursively.IftagAcontainsasingletagBandnothingelse,then
A.stringisthesameasB.string.(Previously,itwasNone.)
Multivaluedattributeslike class have lists of strings as their values, not strings. This may
affectthewayyousearchbyCSSclass.
If you pass one of the find* methods both string and a tagspecific argument like name,
BeautifulSoupwillsearchfortagsthatmatchyourtagspecificcriteriaandwhoseTag.string
matches your value for string. It will not find the strings themselves. Previously, Beautiful
Soupignoredthetagspecificargumentsandlookedforstrings.
The BeautifulSoup constructornolongerrecognizesthemarkupMassageargument.Itsnow
theparsersresponsibilitytohandlemarkupcorrectly.
The rarelyused alternate parser classes like ICantBelieveItsBeautifulSoup and BeautifulSOAP
havebeenremoved.Itsnowtheparsersdecisionhowtohandleambiguousmarkup.
The prettify() methodnowreturnsaUnicodestring,notabytestring.
56/56

Beautiful Soup Documentation - Beautiful Soup 4.4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Beautiful Soup Documentation - Beautiful Soup 4.4

Uploaded by

Copyright:

Available Formats

7/14/2015

Running the three sisters document through Beautiful Soup gives us a

Installing Beautiful Soup

Problems after installation

orbymanuallyrunningPythons 2to3 conversionscriptonthe bs4 directory:

Making the soup

NavigableString toaUnicodestringwith unicode() :

Comments and other special strings

The Comment objectisjustaspecialtypeof NavigableString :

Navigating the tree

Navigating using tag names

Astringdoesnothave .contents ,becauseitcantcontainanything:

Theparentofatopleveltaglike<html>isthe BeautifulSoup objectitself:

Andthe .parent ofa BeautifulSoup objectisdefinedasNone:

Youcanuse .next_sibling and .previous_sibling tonavigatebetweenpageelementsthatare

In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string

Youmightthinkthatthe .next_sibling ofthefirst<a>tagwouldbethesecond<a>tag.But

Thesecond<a>tagisactuallythe .next_sibling ofthecomma:

Youcaniterateoveratagssiblingswith .next_siblings or .previous_siblings :

Going back and forth

Searching the tree

Passthisfunctioninto find_all() andyoullpickupallthe<p>tags:

RecallfromKindsoffiltersthatthevalueto name canbeastring,aregularexpression,alist,

The keyword arguments

Ifyoupassinavaluefor href ,BeautifulSoupwillfilteragainsteachtagshrefattribute:

Youcanfilteranattributebasedonastring,aregularexpression,alist, a function, or the

Searching by CSS class

As with any keyword argument, you can pass

a string, a regular expression, a

Youcanalsosearchfortheexactstringvalueofthe class attribute:

InolderversionsofBeautifulSoup, whichdont havethe class_ shortcut,you canusethe

The string argumentisnewinBeautifulSoup4.4.0.Inearlierversionsitwascalled text :

documentislarge.Ifyoudontneedalltheresults,youcanpassinanumberfor limit .This

Calling a tag is like calling

Because find_all() isthemostpopularmethodintheBeautifulSoupsearchAPI,youcan

If find_all() cantfindanything,itreturnsanemptylist.If find() cantfindanything,itreturns

well.Theresa<p>tagwiththeCSSclasstitlesomewhere in the document, but its not

Modifying the tree

Changing tag names and attributes

Ifyousetatags .string attribute,thetagscontentsarereplacedwiththestringyougive:

endofitsparents .contents .Itllbeinsertedatwhatevernumericpositionyousay.Itworks

The insert_before() methodinsertsatagorstringimmediatelybeforesomethingelseinthe

Atthispointyoueffectivelyhavetwoparsetrees:onerootedatthe BeautifulSoup objectyou

Like replace_with() , unwrap() returnsthetagthatwasreplaced.

Youcancall prettify() onthetoplevel BeautifulSoup object,oronanyofits Tag objects:

The str() functionreturnsastringencodedinUTF8.SeeEncodingsforotheroptions.

Youcanchangethisbehaviorbyprovidingavalueforthe formatter argumentto prettify() ,

Ifyoupassin formatter=None ,BeautifulSoupwillnotmodifystringsatallonoutput.Thisis

Finally,ifyoupassinafunctionfor formatter ,BeautifulSoupwillcallthatfunctiononcefor

Ifyourewritingyourownfunction,youshouldknowaboutthe EntitySubstitution classinthe

Onelastcaveat:ifyoucreatea CData object,thetextinsidethatobjectisalwayspresented

Specifying the parser to use

Differences between parsers

Youcanalsocallencode()onthe BeautifulSoup object,oranyelementinthesoup,justasif

Unicode,Dammitsguesseswillgetalotmoreaccurateifyouinstallthe chardet or cchardet

DecodingthedocumentasUTF8raisesa UnicodeDecodeError ,anddecodingitasWindows

Comparing objects for equality

Copying Beautiful Soup objects

Thisisbecausetwodifferent Tag objectscantoccupythesamespaceatthesametime.

Parsing only part of a document

Errors when parsing a document