You are on page 1of 56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Beautiful Soup Documentation


Beautiful Soup is a Python library for pulling data out of
HTML and XML files. It works with your favorite parser to
provide idiomatic ways of navigating, searching, and
modifying the parse tree. It commonly saves programmers
hoursordaysofwork.
These instructions illustrate all major features of Beautiful
Soup4,withexamples.Ishowyouwhatthelibraryisgood
for,howitworks,howtouseit,howtomakeitdowhatyou
want,andwhattodowhenitviolatesyourexpectations.
Theexamplesinthisdocumentationshouldworkthesame
wayinPython2.7andPython3.2.
YoumightbelookingforthedocumentationforBeautifulSoup3.Ifso,youshouldknowthat
BeautifulSoup3isnolongerbeingdeveloped,andthatBeautifulSoup4isrecommended
forallnewprojects.IfyouwanttolearnaboutthedifferencesbetweenBeautifulSoup3and
BeautifulSoup4,seePortingcodetoBS4.
ThisdocumentationhasbeentranslatedintootherlanguagesbyBeautifulSoupusers:
.
()
.()

Getting help
If you have questions about Beautiful Soup, or run into problems, send mail to the
discussiongroup.IfyourprobleminvolvesparsinganHTMLdocument,besuretomention
whatthediagnose()functionsaysaboutthatdocument.

Quick Start
HeresanHTMLdocumentIllbeusingasanexamplethroughoutthisdocument.Itspartof
astoryfromAliceinWonderland:
html_doc="""
<html><head><title>TheDormouse'sstory</title></head>
<body>
<pclass="title"><b>TheDormouse'sstory</b></p>
<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere
<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,
<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

1/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;
andtheylivedatthebottomofawell.</p>
<pclass="story">...</p>
"""

Running the three sisters document through Beautiful Soup gives us a


object,whichrepresentsthedocumentasanesteddatastructure:

BeautifulSoup

frombs4importBeautifulSoup
soup=BeautifulSoup(html_doc,'html.parser')
print(soup.prettify())
#<html>
#<head>
#<title>
#TheDormouse'sstory
#</title>
#</head>
#<body>
#<pclass="title">
#<b>
#TheDormouse'sstory
#</b>
#</p>
#<pclass="story">
#Onceuponatimetherewerethreelittlesisters;andtheirnameswere
#<aclass="sister"href="http://example.com/elsie"id="link1">
#Elsie
#</a>
#,
#<aclass="sister"href="http://example.com/lacie"id="link2">
#Lacie
#</a>
#and
#<aclass="sister"href="http://example.com/tillie"id="link2">
#Tillie
#</a>
#;andtheylivedatthebottomofawell.
#</p>
#<pclass="story">
#...
#</p>
#</body>
#</html>

Herearesomesimplewaystonavigatethatdatastructure:
soup.title
#<title>TheDormouse'sstory</title>
soup.title.name
#u'title'
soup.title.string
#u'TheDormouse'sstory'
soup.title.parent.name
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

2/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

#u'head'
soup.p
#<pclass="title"><b>TheDormouse'sstory</b></p>
soup.p['class']
#u'title'
soup.a
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
soup.find_all('a')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.find(id="link3")
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>

OnecommontaskisextractingalltheURLsfoundwithinapages<a>tags:
forlinkinsoup.find_all('a'):
print(link.get('href'))
#http://example.com/elsie
#http://example.com/lacie
#http://example.com/tillie

Anothercommontaskisextractingallthetextfromapage:
print(soup.get_text())
#TheDormouse'sstory
#
#TheDormouse'sstory
#
#Onceuponatimetherewerethreelittlesisters;andtheirnameswere
#Elsie,
#Lacieand
#Tillie;
#andtheylivedatthebottomofawell.
#
#...

Doesthislooklikewhatyouneed?Ifso,readon.

Installing Beautiful Soup


If youre using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup
withthesystempackagemanager:
$aptgetinstallpythonbs4

Beautiful Soup 4 is published through PyPi, so if you cant install it with the system
packager, you can install it with easy_install or pip . The package name is beautifulsoup4 ,
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

3/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

andthesamepackageworksonPython2andPython3.
$easy_installbeautifulsoup4
$pipinstallbeautifulsoup4

(The BeautifulSoup package is probably not what you want. Thats the previous major
release, Beautiful Soup 3. Lots of software uses BS3, so its still available, but if youre
writingnewcodeyoushouldinstall beautifulsoup4 .)
Ifyoudonthave easy_install or pip installed,youcandownloadtheBeautifulSoup4source
tarballandinstallitwith setup.py .
$pythonsetup.pyinstall

If all else fails, the license for Beautiful Soup allows you to package the entire library with
yourapplication.Youcandownloadthetarball,copyits bs4 directoryintoyourapplications
codebase,anduseBeautifulSoupwithoutinstallingitatall.
I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it should work with other
recentversions.

Problems after installation


BeautifulSoupispackagedasPython2code.WhenyouinstallitforusewithPython3,its
automaticallyconvertedtoPython3code.Ifyoudontinstallthepackage,thecodewontbe
converted.TherehavealsobeenreportsonWindowsmachinesofthewrongversionbeing
installed.
If you get the ImportError No module named HTMLParser, your problem is that youre
runningthePython2versionofthecodeunderPython3.
If you get the ImportError No module named html.parser, your problem is that youre
runningthePython3versionofthecodeunderPython2.
In both cases, your best bet is to completely remove the Beautiful Soup installation from
your system (including any directory created when you unzipped the tarball) and try the
installationagain.
Ifyougetthe SyntaxError Invalidsyntaxontheline ROOT_TAG_NAME=u'[document]' ,youneed
toconvertthePython2codetoPython3.Youcandothiseitherbyinstallingthepackage:
$python3setup.pyinstall

orbymanuallyrunningPythons 2to3 conversionscriptonthe bs4 directory:


$2to33.2wbs4
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

4/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Installing a parser
BeautifulSoupsupportstheHTMLparserincludedinPythonsstandardlibrary,butitalso
supportsanumberofthirdpartyPythonparsers.Oneisthelxmlparser.Dependingonyour
setup,youmightinstalllxmlwithoneofthesecommands:
$aptgetinstallpythonlxml
$easy_installlxml
$pipinstalllxml

AnotheralternativeisthepurePythonhtml5libparser,whichparsesHTMLthewayaweb
browser does. Depending on your setup, you might install html5lib with one of these
commands:
$aptgetinstallpythonhtml5lib
$easy_installhtml5lib
$pipinstallhtml5lib

Thistablesummarizestheadvantagesanddisadvantagesofeachparserlibrary:
Parser
Pythons
html.parser

Typicalusage

lxmlsHTML
parser

BeautifulSoup(markup,"lxml")

Veryfast
Lenient

External C
dependency

lxmlsXML
parser

BeautifulSoup(markup,"lxml
xml") BeautifulSoup(markup,
"xml")

Veryfast
The
only
currently
supported XML
parser

External C
dependency

html5lib

BeautifulSoup(markup,
"html5lib")

Extremely
lenient
Parses pages
thesamewaya
web
browser
does

Veryslow
External
Python
dependency

BeautifulSoup(markup,
"html.parser")

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

Advantages
Disadvantages
Batteries
Not
very
included
lenient
Decentspeed
(before
Lenient (as of
Python
Python
2.7.3
2.7.3
or
and3.2.)
3.2.2)

5/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Creates
HTML5

valid

If you can, I recommend you install and use lxml for speed. If youre using a version of
Python2earlierthan2.7.3,oraversionofPython3earlierthan3.2.2,itsessentialthatyou
installlxmlorhtml5libPythonsbuiltinHTMLparserisjustnotverygoodinolderversions.
Note that if a document is invalid, different parsers will generate different Beautiful Soup
treesforit.SeeDifferencesbetweenparsersfordetails.

Making the soup


Toparseadocument,passitintothe BeautifulSoup constructor.Youcanpassinastringor
anopenfilehandle:
frombs4importBeautifulSoup
soup=BeautifulSoup(open("index.html"))
soup=BeautifulSoup("<html>data</html>")

First, the document is converted to Unicode, and HTML entities are converted to Unicode
characters:
BeautifulSoup("Sacr&eacute;bleu!")
<html><head></head><body>Sacrbleu!</body></html>

Beautiful Soup then parses the document using the best available parser. It will use an
HTMLparserunlessyouspecificallytellittouseanXMLparser.(SeeParsingXML.)

Kinds of objects
Beautiful Soup transforms a complex HTML document into a complex tree of Python
objects. But youll only ever have to deal with about four kinds of objects: Tag ,
NavigableString , BeautifulSoup ,and Comment .

Tag
A Tag objectcorrespondstoanXMLorHTMLtagintheoriginaldocument:
soup=BeautifulSoup('<bclass="boldest">Extremelybold</b>')
tag=soup.b
type(tag)
#<class'bs4.element.Tag'>

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

6/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Tagshavealotofattributesandmethods,andIllcovermostoftheminNavigatingthetree
and Searching the tree. For now, the most important features of a tag are its name and
attributes.

Name
Everytaghasaname,accessibleas .name :
tag.name
#u'b'

Ifyouchangeatagsname,thechangewillbereflectedinanyHTMLmarkupgeneratedby
BeautifulSoup:
tag.name="blockquote"
tag
#<blockquoteclass="boldest">Extremelybold</blockquote>

Attributes
A tag may have any number of attributes. The tag <b class="boldest"> has an attribute
classwhosevalueisboldest.Youcanaccessatagsattributesbytreatingthetaglikea
dictionary:
tag['class']
#u'boldest'

Youcanaccessthatdictionarydirectlyas .attrs :
tag.attrs
#{u'class':u'boldest'}

Youcanadd,remove,andmodifyatagsattributes.Again,thisisdonebytreatingthetag
asadictionary:
tag['class']='verybold'
tag['id']=1
tag
#<blockquoteclass="verybold"id="1">Extremelybold</blockquote>
deltag['class']
deltag['id']
tag
#<blockquote>Extremelybold</blockquote>
tag['class']
#KeyError:'class'
print(tag.get('class'))
#None
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

7/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Multivalued attributes
HTML4definesafewattributesthatcanhavemultiplevalues.HTML5removesacoupleof
them, but defines a few more. The most common multivalued attribute is class (that is, a
tagcanhavemorethanoneCSSclass).Othersinclude rel , rev , acceptcharset , headers ,and
accesskey .BeautifulSouppresentsthevalue(s)ofamultivaluedattributeasalist:
css_soup=BeautifulSoup('<pclass="bodystrikeout"></p>')
css_soup.p['class']
#["body","strikeout"]
css_soup=BeautifulSoup('<pclass="body"></p>')
css_soup.p['class']
#["body"]

Ifanattributelookslikeithasmorethanonevalue,butitsnotamultivaluedattributeas
definedbyanyversionoftheHTMLstandard,BeautifulSoupwillleavetheattributealone:
id_soup=BeautifulSoup('<pid="myid"></p>')
id_soup.p['id']
#'myid'

Whenyouturnatagbackintoastring,multipleattributevaluesareconsolidated:
rel_soup=BeautifulSoup('<p>Backtothe<arel="index">homepage</a></p>')
rel_soup.a['rel']
#['index']
rel_soup.a['rel']=['index','contents']
print(rel_soup.p)
#<p>Backtothe<arel="indexcontents">homepage</a></p>

IfyouparseadocumentasXML,therearenomultivaluedattributes:
xml_soup=BeautifulSoup('<pclass="bodystrikeout"></p>','xml')
xml_soup.p['class']
#u'bodystrikeout'

NavigableString
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString
classtocontainthesebitsoftext:
tag.string
#u'Extremelybold'
type(tag.string)
#<class'bs4.element.NavigableString'>

A NavigableString is just like a Python Unicode string, except that it also supports some of
the features described in Navigating the tree and Searching the tree. You can convert a
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

8/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

NavigableString toaUnicodestringwith unicode() :

unicode_string=unicode(tag.string)
unicode_string
#u'Extremelybold'
type(unicode_string)
#<type'unicode'>

You cant edit a string in place, but you can replace one string with another, using
replace_with():
tag.string.replace_with("Nolongerbold")
tag
#<blockquote>Nolongerbold</blockquote>
NavigableString supportsmostofthefeaturesdescribedinNavigatingthetreeandSearching

thetree,butnotallofthem.Inparticular,sinceastringcantcontainanything(thewayatag
maycontainastringoranothertag),stringsdontsupportthe .contents or .string attributes,
orthe find() method.
Ifyouwanttousea NavigableString outsideofBeautifulSoup,youshouldcall unicode() onit
to turn it into a normal Python Unicode string. If you dont, your string will carry around a
reference to the entire Beautiful Soup parse tree, even when youre done using Beautiful
Soup.Thisisabigwasteofmemory.

BeautifulSoup
The BeautifulSoup objectitselfrepresentsthedocumentasawhole.Formostpurposes,you
can treat it as a Tag object. This means it supports most of the methods described in
NavigatingthetreeandSearchingthetree.
Sincethe BeautifulSoup objectdoesntcorrespondtoanactualHTMLorXMLtag,ithasno
nameandnoattributes.Butsometimesitsusefultolookatits .name ,soitsbeengiventhe
special .name [document]:
soup.name
#u'[document]'

Comments and other special strings


Tag , NavigableString , and BeautifulSoup cover almost everything youll see in an HTML or

XMLfile,butthereareafewleftoverbits.Theonlyoneyoullprobablyeverneedtoworry
aboutisthecomment:
markup="<b><!Hey,buddy.Wanttobuyausedparser?></b>"
soup=BeautifulSoup(markup)
comment=soup.b.string
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

9/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

type(comment)
#<class'bs4.element.Comment'>

The Comment objectisjustaspecialtypeof NavigableString :


comment
#u'Hey,buddy.Wanttobuyausedparser'

But when it appears as part of an HTML document, a Comment is displayed with special
formatting:
print(soup.b.prettify())
#<b>
#<!Hey,buddy.Wanttobuyausedparser?>
#</b>

BeautifulSoupdefinesclassesforanythingelsethatmightshowupinanXMLdocument:
CData , ProcessingInstruction , Declaration , and Doctype . Just like Comment , these classes are
subclassesof NavigableString thataddsomethingextratothestring.Heresanexamplethat
replacesthecommentwithaCDATAblock:
frombs4importCData
cdata=CData("ACDATAblock")
comment.replace_with(cdata)
print(soup.b.prettify())
#<b>
#<![CDATA[ACDATAblock]]>
#</b>

Navigating the tree


HerestheThreesistersHTMLdocumentagain:
html_doc="""
<html><head><title>TheDormouse'sstory</title></head>
<body>
<pclass="title"><b>TheDormouse'sstory</b></p>
<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere
<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,
<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and
<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;
andtheylivedatthebottomofawell.</p>
<pclass="story">...</p>
"""
frombs4importBeautifulSoup
soup=BeautifulSoup(html_doc,'html.parser')

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

10/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Ill use this as an example to show you how to move from one part of a document to
another.

Going down
Tagsmaycontainstringsandothertags.Theseelementsarethetagschildren. Beautiful
Soupprovidesalotofdifferentattributesfornavigatinganditeratingoveratagschildren.
NotethatBeautifulSoupstringsdontsupportanyoftheseattributes,becauseastringcant
havechildren.

Navigating using tag names


Thesimplestwaytonavigatetheparsetreeistosaythenameofthetagyouwant.Ifyou
wantthe<head>tag,justsay soup.head :
soup.head
#<head><title>TheDormouse'sstory</title></head>
soup.title
#<title>TheDormouse'sstory</title>

Youcandousethistrickagainandagaintozoominonacertainpartoftheparsetree.This
codegetsthefirst<b>tagbeneaththe<body>tag:
soup.body.b
#<b>TheDormouse'sstory</b>

Usingatagnameasanattributewillgiveyouonlythefirsttagbythatname:
soup.a
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>

Ifyouneedtoget all the <a> tags, or anything more complicated than the first tag with a
certainname,youllneedtouseoneofthemethodsdescribedinSearchingthetree, such
asfind_all():
soup.find_all('a')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]

.contents

and

.children

Atagschildrenareavailableinalistcalled .contents :

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

11/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

head_tag=soup.head
head_tag
#<head><title>TheDormouse'sstory</title></head>
head_tag.contents
[<title>TheDormouse'sstory</title>]
title_tag=head_tag.contents[0]
title_tag
#<title>TheDormouse'sstory</title>
title_tag.contents
#[u'TheDormouse'sstory']

The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the
BeautifulSoup object.:
len(soup.contents)
#1
soup.contents[0].name
#u'html'

Astringdoesnothave .contents ,becauseitcantcontainanything:


text=title_tag.contents[0]
text.contents
#AttributeError:'NavigableString'objecthasnoattribute'contents'

Instead of getting them as a list, you can iterate over a tags children using the .children
generator:
forchildintitle_tag.children:
print(child)
#TheDormouse'sstory

.descendants
The .contents and .children attributesonlyconsideratagsdirectchildren.Forinstance,the
<head>taghasasingledirectchildthe<title>tag:
head_tag.contents
#[<title>TheDormouse'sstory</title>]

Butthe<title>tagitselfhasachild:thestringTheDormousesstory.Theresasensein
whichthatstringisalsoachildofthe<head>tag.The .descendants attributeletsyouiterate
overallofatagschildren,recursively:itsdirectchildren,thechildrenofitsdirectchildren,
andsoon:
forchildinhead_tag.descendants:
print(child)
#<title>TheDormouse'sstory</title>
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

12/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

#TheDormouse'sstory

The <head> tag has only one child, but it has two descendants: the <title> tag and the
<title>tagschild.The BeautifulSoup objectonlyhasonedirectchild(the<html>tag),butit
hasawholelotofdescendants:
len(list(soup.children))
#1
len(list(soup.descendants))
#25

.string
Ifataghasonlyonechild,andthatchildisa NavigableString ,thechildismadeavailableas
.string :
title_tag.string
#u'TheDormouse'sstory'

If a tags only child is another tag, and that tag has a .string , then the parent tag is
consideredtohavethesame .string asitschild:
head_tag.contents
#[<title>TheDormouse'sstory</title>]
head_tag.string
#u'TheDormouse'sstory'

If a tag contains more than one thing, then its not clear what .string should refer to, so
.string isdefinedtobe None :
print(soup.html.string)
#None

.strings

and

stripped_strings

If theres more than one thing inside a tag, you can still look at just the strings. Use the
.strings generator:
forstringinsoup.strings:
print(repr(string))
#u"TheDormouse'sstory"
#u'\n\n'
#u"TheDormouse'sstory"
#u'\n\n'
#u'Onceuponatimetherewerethreelittlesisters;andtheirnameswere\n'
#u'Elsie'
#u',\n'
#u'Lacie'
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

13/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

#u'and\n'
#u'Tillie'
#u';\nandtheylivedatthebottomofawell.'
#u'\n\n'
#u'...'
#u'\n'

These strings tend to have a lot of extra whitespace, which you can remove by using the
.stripped_strings generatorinstead:
forstringinsoup.stripped_strings:
print(repr(string))
#u"TheDormouse'sstory"
#u"TheDormouse'sstory"
#u'Onceuponatimetherewerethreelittlesisters;andtheirnameswere'
#u'Elsie'
#u','
#u'Lacie'
#u'and'
#u'Tillie'
#u';\nandtheylivedatthebottomofawell.'
#u'...'

Here,stringsconsistingentirelyofwhitespaceareignored,andwhitespaceatthebeginning
andendofstringsisremoved.

Going up
Continuing the family tree analogy, every tag and every string has a parent: the tag that
containsit.

.parent
Youcanaccessanelementsparentwiththe .parent attribute.Intheexamplethreesisters
document,the<head>tagistheparentofthe<title>tag:
title_tag=soup.title
title_tag
#<title>TheDormouse'sstory</title>
title_tag.parent
#<head><title>TheDormouse'sstory</title></head>

Thetitlestringitselfhasaparent:the<title>tagthatcontainsit:
title_tag.string.parent
#<title>TheDormouse'sstory</title>

Theparentofatopleveltaglike<html>isthe BeautifulSoup objectitself:


html_tag=soup.html
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

14/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

type(html_tag.parent)
#<class'bs4.BeautifulSoup'>

Andthe .parent ofa BeautifulSoup objectisdefinedasNone:


print(soup.parent)
#None

.parents
Youcaniterateoverallofanelementsparentswith .parents .Thisexampleuses .parents to
travelfroman<a>tagburieddeepwithinthedocument,totheverytopofthedocument:
link=soup.a
link
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
forparentinlink.parents:
ifparentisNone:
print(parent)
else:
print(parent.name)
#p
#body
#html
#[document]
#None

Going sideways
Considerasimpledocumentlikethis:
sibling_soup=BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())
#<html>
#<body>
#<a>
#<b>
#text1
#</b>
#<c>
#text2
#</c>
#</a>
#</body>
#</html>

The<b>tagandthe<c>tagareatthesamelevel:theyrebothdirectchildrenofthesame
tag.Wecallthemsiblings.Whenadocumentisprettyprinted,siblingsshowupatthesame
indentationlevel.Youcanalsousethisrelationshipinthecodeyouwrite.

.next_sibling

and

.previous_sibling

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

15/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Youcanuse .next_sibling and .previous_sibling tonavigatebetweenpageelementsthatare


onthesameleveloftheparsetree:
sibling_soup.b.next_sibling
#<c>text2</c>
sibling_soup.c.previous_sibling
#<b>text1</b>

The <b> tag has a .next_sibling , but no .previous_sibling , because theres nothing before
the <b> tag on the same level of the tree. For the same reason, the <c> tag has a
.previous_sibling butno .next_sibling :
print(sibling_soup.b.previous_sibling)
#None
print(sibling_soup.c.next_sibling)
#None

Thestringstext1andtext2arenotsiblings,becausetheydonthavethesameparent:
sibling_soup.b.string
#u'text1'
print(sibling_soup.b.string.next_sibling)
#None

In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string


containingwhitespace.Goingbacktothethreesistersdocument:
<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>
<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>
<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>

Youmightthinkthatthe .next_sibling ofthefirst<a>tagwouldbethesecond<a>tag.But


actually, its a string: the comma and newline that separate the first <a> tag from the
second:
link=soup.a
link
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
link.next_sibling
#u',\n'

Thesecond<a>tagisactuallythe .next_sibling ofthecomma:


link.next_sibling.next_sibling
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

16/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

.next_siblings

and

.previous_siblings

Youcaniterateoveratagssiblingswith .next_siblings or .previous_siblings :


forsiblinginsoup.a.next_siblings:
print(repr(sibling))
#u',\n'
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>
#u'and\n'
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>
#u';andtheylivedatthebottomofawell.'
#None
forsiblinginsoup.find(id="link3").previous_siblings:
print(repr(sibling))
#'and\n'
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>
#u',\n'
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
#u'Onceuponatimetherewerethreelittlesisters;andtheirnameswere\n'
#None

Going back and forth


Takealookatthebeginningofthethreesistersdocument:
<html><head><title>TheDormouse'sstory</title></head>
<pclass="title"><b>TheDormouse'sstory</b></p>

AnHTMLparsertakesthisstringofcharactersandturnsitintoaseriesofevents:openan
<html>tag,opena<head>tag,opena<title>tag,addastring,closethe<title>tag,
opena<p>tag,andsoon.BeautifulSoupofferstoolsforreconstructingtheinitialparseof
thedocument.

.next_element

and

.previous_element

The .next_element attribute of a string or tag points to whatever was parsed immediately
afterwards.Itmightbethesameas .next_sibling ,butitsusuallydrasticallydifferent.
Heres the final <a> tag in the three sisters document. Its .next_sibling is a string: the
conclusionofthesentencethatwasinterruptedbythestartofthe<a>tag.:
last_a_tag=soup.find("a",id="link3")
last_a_tag
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>
last_a_tag.next_sibling
#';andtheylivedatthebottomofawell.'

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

17/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Butthe .next_element of that <a> tag, the thing that was parsed immediately after the <a>
tag,isnottherestofthatsentence:itsthewordTillie:
last_a_tag.next_element
#u'Tillie'

Thatsbecauseintheoriginalmarkup,thewordTillieappearedbeforethatsemicolon.The
parser encountered an <a> tag, then the word Tillie, then the closing </a> tag, then the
semicolonandrestofthesentence.Thesemicolonisonthesamelevelasthe<a>tag,but
thewordTilliewasencounteredfirst.
The .previous_element attribute is the exact opposite of .next_element . It points to whatever
elementwasparsedimmediatelybeforethisone:
last_a_tag.previous_element
#u'and\n'
last_a_tag.previous_element.next_element
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>

.next_elements

and

.previous_elements

Youshouldgettheideabynow.Youcanusetheseiteratorstomoveforwardorbackward
inthedocumentasitwasparsed:
forelementinlast_a_tag.next_elements:
print(repr(element))
#u'Tillie'
#u';\nandtheylivedatthebottomofawell.'
#u'\n\n'
#<pclass="story">...</p>
#u'...'
#u'\n'
#None

Searching the tree


Beautiful Soup defines a lot of methods for searching the parse tree, but theyre all very
similar.Imgoingtospendalotoftimeexplainingthetwomostpopularmethods: find() and
find_all() . The other methods take almost exactly the same arguments, so Ill just cover
thembriefly.
Onceagain,Illbeusingthethreesistersdocumentasanexample:
html_doc="""
<html><head><title>TheDormouse'sstory</title></head>
<body>
<pclass="title"><b>TheDormouse'sstory</b></p>
<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

18/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,
<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and
<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;
andtheylivedatthebottomofawell.</p>
<pclass="story">...</p>
"""
frombs4importBeautifulSoup
soup=BeautifulSoup(html_doc,'html.parser')

By passing in a filter to an argument like find_all() , you can zoom in on the parts of the
documentyoureinterestedin.

Kinds of filters
Before talking in detail about find_all() and similar methods, I want to show examples of
different filters you can pass into these methods. These filters show up again and again,
throughout the search API. You can use them to filter based on a tags name, on its
attributes,onthetextofastring,oronsomecombinationofthese.

A string
The simplest filter is a string. Pass a string to a search method and Beautiful Soup will
performamatchagainstthatexactstring.Thiscodefindsallthe<b>tagsinthedocument:
soup.find_all('b')
#[<b>TheDormouse'sstory</b>]

Ifyoupassinabytestring,BeautifulSoupwillassumethestringisencodedasUTF8.You
canavoidthisbypassinginaUnicodestringinstead.

A regular expression
If you pass in a regular expression object, Beautiful Soup will filter against that regular
expressionusingits match() method.Thiscodefindsallthetagswhosenamesstartwiththe
letterbinthiscase,the<body>tagandthe<b>tag:
importre
fortaginsoup.find_all(re.compile("^b")):
print(tag.name)
#body
#b

Thiscodefindsallthetagswhosenamescontainthelettert:
fortaginsoup.find_all(re.compile("t")):
print(tag.name)
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

19/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

#html
#title

A list
Ifyoupassinalist,BeautifulSoupwillallowastringmatchagainstanyiteminthatlist.This
codefindsallthe<a>tagsandallthe<b>tags:
soup.find_all(["a","b"])
#[<b>TheDormouse'sstory</b>,
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]

True
Thevalue True matcheseverythingitcan.Thiscodefindsallthetagsinthedocument,but
noneofthetextstrings:
fortaginsoup.find_all(True):
print(tag.name)
#html
#head
#title
#body
#p
#b
#p
#a
#a
#a
#p

A function
Ifnoneoftheothermatchesworkforyou,defineafunctionthattakesanelementasitsonly
argument.Thefunctionshouldreturn True iftheargumentmatches,and False otherwise.
Heresafunctionthatreturns True ifatagdefinestheclassattributebutdoesntdefinethe
idattribute:
defhas_class_but_no_id(tag):
returntag.has_attr('class')andnottag.has_attr('id')

Passthisfunctioninto find_all() andyoullpickupallthe<p>tags:


soup.find_all(has_class_but_no_id)
#[<pclass="title"><b>TheDormouse'sstory</b></p>,
#<pclass="story">Onceuponatimetherewere...</p>,
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

20/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

#<pclass="story">...</p>]

Thisfunctiononlypicksupthe<p>tags.Itdoesntpickupthe<a>tags,becausethosetags
define both class and id. It doesnt pick up tags like <html> and <title>, because those
tagsdontdefineclass.
Ifyoupassinafunctiontofilteronaspecificattributelike href ,theargumentpassedinto
thefunctionwillbetheattributevalue,notthewholetag.Heresafunctionthatfindsall a
tagswhose href attributedoesnotmatcharegularexpression:
defnot_lacie(href):
returnhrefandnotre.compile("lacie").search(href)
soup.find_all(href=not_lacie)
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]

Thefunctioncanbeascomplicatedasyouneedittobe.Heresafunctionthatreturns True
ifatagissurroundedbystringobjects:
frombs4importNavigableString
defsurrounded_by_strings(tag):
return(isinstance(tag.next_element,NavigableString)
andisinstance(tag.previous_element,NavigableString))
fortaginsoup.find_all(surrounded_by_strings):
printtag.name
#p
#a
#a
#a
#p

Nowwerereadytolookatthesearchmethodsindetail.

find_all()
Signature:find_all(name,attrs,recursive,string,limit,**kwargs)
The find_all() methodlooksthroughatagsdescendantsandretrievesalldescendantsthat
matchyourfilters.IgaveseveralexamplesinKindsoffilters,buthereareafewmore:
soup.find_all("title")
#[<title>TheDormouse'sstory</title>]
soup.find_all("p","title")
#[<pclass="title"><b>TheDormouse'sstory</b></p>]
soup.find_all("a")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

21/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

soup.find_all(id="link2")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]
importre
soup.find(string=re.compile("sisters"))
#u'Onceuponatimetherewerethreelittlesisters;andtheirnameswere\n'

Some of these should look familiar, but others are new. What does it mean to pass in a
valuefor string ,or id ?Whydoes find_all("p","title") finda<p>tagwiththeCSSclass
title?Letslookattheargumentsto find_all() .

The

name

argument

Pass in a value for name and youll tell Beautiful Soup to only consider tags with certain
names.Textstringswillbeignored,aswilltagswhosenamesthatdontmatch.
Thisisthesimplestusage:
soup.find_all("title")
#[<title>TheDormouse'sstory</title>]

RecallfromKindsoffiltersthatthevalueto name canbeastring,aregularexpression,alist,


afunction,orthevalueTrue.

The keyword arguments


Anyargumentthatsnotrecognizedwillbeturnedintoafilterononeofatagsattributes.If
youpassinavalueforanargumentcalled id ,BeautifulSoupwillfilteragainsteachtagsid
attribute:
soup.find_all(id='link2')
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]

Ifyoupassinavaluefor href ,BeautifulSoupwillfilteragainsteachtagshrefattribute:


soup.find_all(href=re.compile("elsie"))
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]

Youcanfilteranattributebasedonastring,aregularexpression,alist, a function, or the


valueTrue.
Thiscodefindsalltagswhose id attributehasavalue,regardlessofwhatthevalueis:
soup.find_all(id=True)
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

22/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Youcanfiltermultipleattributesatoncebypassinginmorethanonekeywordargument:
soup.find_all(href=re.compile("elsie"),id='link1')
#[<aclass="sister"href="http://example.com/elsie"id="link1">three</a>]

Someattributes,likethedata*attributesinHTML5,havenamesthatcantbeusedasthe
namesofkeywordarguments:
data_soup=BeautifulSoup('<divdatafoo="value">foo!</div>')
data_soup.find_all(datafoo="value")
#SyntaxError:keywordcan'tbeanexpression

Youcanusetheseattributesinsearchesbyputtingthemintoadictionaryandpassingthe
dictionaryinto find_all() asthe attrs argument:
data_soup.find_all(attrs={"datafoo":"value"})
#[<divdatafoo="value">foo!</div>]

Searching by CSS class


ItsveryusefultosearchforatagthathasacertainCSSclass,butthenameoftheCSS
attribute,class,isareservedwordinPython.Using class asakeywordargumentwillgive
you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the
keywordargument class_ :
soup.find_all("a",class_="sister")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]

As with any keyword argument, you can pass


function,or True :

class_

a string, a regular expression, a

soup.find_all(class_=re.compile("itl"))
#[<pclass="title"><b>TheDormouse'sstory</b></p>]
defhas_six_characters(css_class):
returncss_classisnotNoneandlen(css_class)==6
soup.find_all(class_=has_six_characters)
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]

Remember that a single tag can have multiple values for its class attribute. When you
searchforatagthatmatchesacertainCSSclass,yourematchingagainstanyofitsCSS
classes:
css_soup=BeautifulSoup('<pclass="bodystrikeout"></p>')
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

23/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

css_soup.find_all("p",class_="strikeout")
#[<pclass="bodystrikeout"></p>]
css_soup.find_all("p",class_="body")
#[<pclass="bodystrikeout"></p>]

Youcanalsosearchfortheexactstringvalueofthe class attribute:


css_soup.find_all("p",class_="bodystrikeout")
#[<pclass="bodystrikeout"></p>]

Butsearchingforvariantsofthestringvaluewontwork:
css_soup.find_all("p",class_="strikeoutbody")
#[]

IfyouwanttosearchfortagsthatmatchtwoormoreCSSclasses,youshoulduseaCSS
selector:
css_soup.select("p.strikeout.body")
#[<pclass="bodystrikeout"></p>]

InolderversionsofBeautifulSoup, whichdont havethe class_ shortcut,you canusethe


attrs trick mentioned above. Create a dictionary whose value for class is the string (or
regularexpression,orwhatever)youwanttosearchfor:
soup.find_all("a",attrs={"class":"sister"})
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]

The

string

argument

With string you can search for strings instead of tags. As with name and the keyword
arguments, you can pass in a string, a regular expression, a list, a function, or the value
True.Herearesomeexamples:
soup.find_all(string="Elsie")
#[u'Elsie']
soup.find_all(string=["Tillie","Elsie","Lacie"])
#[u'Elsie',u'Lacie',u'Tillie']
soup.find_all(string=re.compile("Dormouse"))
[u"TheDormouse'sstory",u"TheDormouse'sstory"]
defis_the_only_string_within_a_tag(s):
"""ReturnTrueifthisstringistheonlychildofitsparenttag."""
return(s==s.parent.string)
soup.find_all(string=is_the_only_string_within_a_tag)
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

24/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

#[u"TheDormouse'sstory",u"TheDormouse'sstory",u'Elsie',u'Lacie',u'Tillie',u'...']

Although string is for finding strings, you can combine it with arguments that find tags:
BeautifulSoupwillfindalltagswhose .string matchesyourvaluefor string .Thiscodefinds
the<a>tagswhose .string isElsie:
soup.find_all("a",string="Elsie")
#[<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>]

The string argumentisnewinBeautifulSoup4.4.0.Inearlierversionsitwascalled text :


soup.find_all("a",text="Elsie")
#[<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>]

The

limit

argument

find_all() returnsallthetagsandstringsthatmatchyourfilters.Thiscantakeawhileifthe

documentislarge.Ifyoudontneedalltheresults,youcanpassinanumberfor limit .This


works just like the LIMIT keyword in SQL. It tells Beautiful Soup to stop gathering results
afteritsfoundacertainnumber.
Therearethreelinksinthethreesistersdocument,butthiscodeonlyfindsthefirsttwo:
soup.find_all("a",limit=2)
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]

The

recursive

argument

If you call mytag.find_all() , Beautiful Soup will examine all the descendants of mytag : its
children,itschildrenschildren,andsoon.IfyouonlywantBeautifulSouptoconsiderdirect
children,youcanpassin recursive=False .Seethedifferencehere:
soup.html.find_all("title")
#[<title>TheDormouse'sstory</title>]
soup.html.find_all("title",recursive=False)
#[]

Heresthatpartofthedocument:
<html>
<head>
<title>
TheDormouse'sstory
</title>
</head>
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

25/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

...

The <title> tag is beneath the <html> tag, but its not directly beneath the <html> tag: the
<head>tagisintheway.BeautifulSoupfindsthe<title>tagwhenitsallowedtolookatall
descendants of the <html> tag, but when recursive=False restricts it to the <html> tags
immediatechildren,itfindsnothing.
BeautifulSoupoffersalotoftreesearchingmethods(coveredbelow),andtheymostlytake
the same arguments as find_all() : name , attrs , string , limit , and the keyword arguments.
But the recursive argument is different: find_all() and find() are the only methods that
supportit.Passing recursive=False intoamethodlike find_parents() wouldntbeveryuseful.

Calling a tag is like calling

find_all()

Because find_all() isthemostpopularmethodintheBeautifulSoupsearchAPI,youcan


useashortcutforit.Ifyoutreatthe BeautifulSoup objectora Tag objectasthoughitwerea
function,thenitsthesameascalling find_all() onthatobject.Thesetwolinesofcodeare
equivalent:
soup.find_all("a")
soup("a")

Thesetwolinesarealsoequivalent:
soup.title.find_all(string=True)
soup.title(string=True)

find()
Signature:find(name,attrs,recursive,string,**kwargs)
The find_all() method scans the entire document looking for results, but sometimes you
onlywanttofindoneresult.Ifyouknowadocumentonlyhasone<body>tag,itsawaste
oftimetoscantheentiredocumentlookingformore.Ratherthanpassingin limit=1 every
timeyoucall find_all ,youcanusethe find() method.Thesetwolinesofcodearenearly
equivalent:
soup.find_all('title',limit=1)
#[<title>TheDormouse'sstory</title>]
soup.find('title')
#<title>TheDormouse'sstory</title>

The only difference is that find_all() returns a list containing the single result, and find()
justreturnstheresult.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

26/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

If find_all() cantfindanything,itreturnsanemptylist.If find() cantfindanything,itreturns


None :
print(soup.find("nosuchtag"))
#None

Rememberthe soup.head.title trick from Navigating using tag names? That trick works by
repeatedlycalling find() :
soup.head.title
#<title>TheDormouse'sstory</title>
soup.find("head").find("title")
#<title>TheDormouse'sstory</title>

find_parents()

and

find_parent()

Signature:find_parents(name,attrs,string,limit,**kwargs)
Signature:find_parent(name,attrs,string,**kwargs)
Ispentalotoftimeabovecovering find_all() and find() .TheBeautifulSoupAPIdefines
ten other methods for searching the tree, but dont be afraid. Five of these methods are
basicallythesameas find_all() ,andtheotherfive arebasicallythe sameas find() . The
onlydifferencesareinwhatpartsofthetreetheysearch.
First lets consider find_parents() and find_parent() . Remember that find_all() and find()
worktheirwaydownthetree,lookingattagsdescendants.Thesemethodsdotheopposite:
theyworktheirwayupthetree,lookingatatags(orastrings)parents.Letstrythemout,
startingfromastringburieddeepinthethreedaughtersdocument:
a_string=soup.find(string="Lacie")
a_string
#u'Lacie'
a_string.find_parents("a")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]
a_string.find_parent("p")
#<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>and
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>;
#andtheylivedatthebottomofawell.</p>
a_string.find_parents("p",class="title")
#[]

Oneofthethree<a>tagsisthedirectparentofthestringinquestion,sooursearchfindsit.
One of the three <p> tags is an indirect parent of the string, and our search finds that as
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

27/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

well.Theresa<p>tagwiththeCSSclasstitlesomewhere in the document, but its not


oneofthisstringsparents,sowecantfinditwith find_parents() .
You may have made the connection between find_parent() and find_parents() , and the
.parent and .parents attributes mentioned earlier. The connection is very strong. These
search methods actually use .parents to iterate over all the parents, and check each one
againsttheprovidedfiltertoseeifitmatches.

find_next_siblings()

and

find_next_sibling()

Signature:find_next_siblings(name,attrs,string,limit,**kwargs)
Signature:find_next_sibling(name,attrs,string,**kwargs)
Thesemethodsuse.next_siblingstoiterateovertherestofanelementssiblingsinthetree.
The find_next_siblings() methodreturnsallthesiblingsthatmatch,and find_next_sibling()
onlyreturnsthefirstone:
first_link=soup.a
first_link
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
first_link.find_next_siblings("a")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
first_story_paragraph=soup.find("p","story")
first_story_paragraph.find_next_sibling("p")
#<pclass="story">...</p>

find_previous_siblings()

and

find_previous_sibling()

Signature:find_previous_siblings(name,attrs,string,limit,**kwargs)
Signature:find_previous_sibling(name,attrs,string,**kwargs)
Thesemethodsuse.previous_siblingstoiterateoveranelementssiblingsthatprecedeitin
the tree. The find_previous_siblings() method returns all the siblings that match, and
find_previous_sibling() onlyreturnsthefirstone:
last_link=soup.find("a",id="link3")
last_link
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>
last_link.find_previous_siblings("a")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]
first_story_paragraph=soup.find("p","story")
first_story_paragraph.find_previous_sibling("p")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

28/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

#<pclass="title"><b>TheDormouse'sstory</b></p>

find_all_next()

and

find_next()

Signature:find_all_next(name,attrs,string,limit,**kwargs)
Signature:find_next(name,attrs,string,**kwargs)
These methods use .next_elements to iterate over whatever tags and strings that come
after it in the document. The find_all_next() method returns all matches, and find_next()
onlyreturnsthefirstmatch:
first_link=soup.a
first_link
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
first_link.find_all_next(string=True)
#[u'Elsie',u',\n',u'Lacie',u'and\n',u'Tillie',
#u';\nandtheylivedatthebottomofawell.',u'\n\n',u'...',u'\n']
first_link.find_next("p")
#<pclass="story">...</p>

Inthefirstexample,thestringElsieshowedup,eventhoughitwascontainedwithinthe
<a>tagwestartedfrom.Inthesecondexample,thelast<p>taginthedocumentshowed
up, even though its not in the same part of the tree as the <a> tag we started from. For
thesemethods,allthatmattersisthatanelementmatchthefilter,andshowuplaterinthe
documentthanthestartingelement.

find_all_previous()

and

find_previous()

Signature:find_all_previous(name,attrs,string,limit,**kwargs)
Signature:find_previous(name,attrs,string,**kwargs)
These methods use .previous_elements to iterate over the tags and strings that came
before it in the document. The find_all_previous() method returns all matches, and
find_previous() onlyreturnsthefirstmatch:
first_link=soup.a
first_link
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
first_link.find_all_previous("p")
#[<pclass="story">Onceuponatimetherewerethreelittlesisters;...</p>,
#<pclass="title"><b>TheDormouse'sstory</b></p>]
first_link.find_previous("title")
#<title>TheDormouse'sstory</title>
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

29/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

The call to find_all_previous("p") found the first paragraph in the document (the one with
class=title),butitalsofindsthesecondparagraph,the<p>tagthatcontainsthe<a>tag
westartedwith.Thisshouldntbetoosurprising:werelookingatallthetagsthatshowup
earlier in the document than the one we started with. A <p> tag that contains an <a> tag
musthaveshownupbeforethe<a>tagitcontains.

CSS selectors
BeautifulSoupsupportsthemostcommonlyusedCSSselectors.Justpassastringintothe
.select() methodofa Tag objectorthe BeautifulSoup objectitself.
Youcanfindtags:
soup.select("title")
#[<title>TheDormouse'sstory</title>]
soup.select("pnthoftype(3)")
#[<pclass="story">...</p>]

Findtagsbeneathothertags:
soup.select("bodya")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select("htmlheadtitle")
#[<title>TheDormouse'sstory</title>]

Findtagsdirectlybeneathothertags:
soup.select("head>title")
#[<title>TheDormouse'sstory</title>]
soup.select("p>a")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select("p>a:nthoftype(2)")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]
soup.select("p>#link1")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]
soup.select("body>a")
#[]

Findthesiblingsoftags:
soup.select("#link1~.sister")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

30/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select("#link1+.sister")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]

FindtagsbyCSSclass:
soup.select(".sister")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select("[class~=sister]")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]

FindtagsbyID:
soup.select("#link1")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]
soup.select("a#link2")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]

Findtagsthatmatchanyselectorfromalistofselectors:
soup.select(#link1,#link2)#[<aclass=sisterhref=http://example.com/elsie
id=link1>Elsie</a>,#<aclass=sisterhref=http://example.com/lacie
id=link2>Lacie</a>]
Testfortheexistenceofanattribute:
soup.select('a[href]')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]

Findtagsbyattributevalue:
soup.select('a[href="http://example.com/elsie"]')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]
soup.select('a[href^="http://example.com/"]')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select('a[href$="tillie"]')
#[<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select('a[href*=".com/el"]')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

31/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Matchlanguagecodes:
multilingual_markup="""
<plang="en">Hello</p>
<plang="enus">Howdy,y'all</p>
<plang="engb">Pippip,oldfruit</p>
<plang="fr">Bonjourmesamis</p>
"""
multilingual_soup=BeautifulSoup(multilingual_markup)
multilingual_soup.select('p[lang|=en]')
#[<plang="en">Hello</p>,
#<plang="enus">Howdy,y'all</p>,
#<plang="engb">Pippip,oldfruit</p>]

Findonlythefirsttagthatmatchesaselector:
soup.select_one(".sister")
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>

ThisisallaconvenienceforuserswhoknowtheCSSselectorsyntax.Youcandoallthis
stuffwiththeBeautifulSoupAPI.AndifCSSselectorsareallyouneed,youmightaswell
use lxml directly: its a lot faster, and it supports more CSS selectors. But this lets you
combinesimpleCSSselectorswiththeBeautifulSoupAPI.

Modifying the tree


BeautifulSoupsmainstrengthisinsearchingtheparsetree,butyoucanalsomodifythe
treeandwriteyourchangesasanewHTMLorXMLdocument.

Changing tag names and attributes


Icoveredthisearlier,inAttributes,butitbearsrepeating.Youcanrenameatag,changethe
valuesofitsattributes,addnewattributes,anddeleteattributes:
soup=BeautifulSoup('<bclass="boldest">Extremelybold</b>')
tag=soup.b
tag.name="blockquote"
tag['class']='verybold'
tag['id']=1
tag
#<blockquoteclass="verybold"id="1">Extremelybold</blockquote>
deltag['class']
deltag['id']
tag
#<blockquote>Extremelybold</blockquote>

Modifying

.string

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

32/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Ifyousetatags .string attribute,thetagscontentsarereplacedwiththestringyougive:


markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
tag=soup.a
tag.string="Newlinktext."
tag
#<ahref="http://example.com/">Newlinktext.</a>

Becareful:ifthetagcontainedothertags,theyandalltheircontentswillbedestroyed.

append()
You can add to a tags contents with Tag.append() . It works just like calling .append() on a
Pythonlist:
soup=BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")
soup
#<html><head></head><body><a>FooBar</a></body></html>
soup.a.contents
#[u'Foo',u'Bar']

NavigableString()

and

.new_tag()

Ifyouneedtoaddastringtoadocument,noproblemyoucanpassaPythonstringinto
append() ,oryoucancallthe NavigableString constructor:
soup=BeautifulSoup("<b></b>")
tag=soup.b
tag.append("Hello")
new_string=NavigableString("there")
tag.append(new_string)
tag
#<b>Hellothere.</b>
tag.contents
#[u'Hello',u'there']

If you want to create a comment or some other subclass of NavigableString , just call the
constructor:
frombs4importComment
new_comment=Comment("Nicetoseeyou.")
tag.append(new_comment)
tag
#<b>Hellothere<!Nicetoseeyou.></b>
tag.contents
#[u'Hello',u'there',u'Nicetoseeyou.']
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

33/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

(ThisisanewfeatureinBeautifulSoup4.4.0.)
Whatifyouneedtocreateawholenewtag?Thebestsolutionistocallthefactorymethod
BeautifulSoup.new_tag() :
soup=BeautifulSoup("<b></b>")
original_tag=soup.b
new_tag=soup.new_tag("a",href="http://www.example.com")
original_tag.append(new_tag)
original_tag
#<b><ahref="http://www.example.com"></a></b>
new_tag.string="Linktext."
original_tag
#<b><ahref="http://www.example.com">Linktext.</a></b>

Onlythefirstargument,thetagname,isrequired.

insert()
Tag.insert() isjustlike Tag.append() , except the new element doesnt necessarily go at the

endofitsparents .contents .Itllbeinsertedatwhatevernumericpositionyousay.Itworks


justlike .insert() onaPythonlist:
markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
tag=soup.a
tag.insert(1,"butdidnotendorse")
tag
#<ahref="http://example.com/">Ilinkedtobutdidnotendorse<i>example.com</i></a>
tag.contents
#[u'Ilinkedto',u'butdidnotendorse',<i>example.com</i>]

insert_before()

and

insert_after()

The insert_before() methodinsertsatagorstringimmediatelybeforesomethingelseinthe


parsetree:
soup=BeautifulSoup("<b>stop</b>")
tag=soup.new_tag("i")
tag.string="Don't"
soup.b.string.insert_before(tag)
soup.b
#<b><i>Don't</i>stop</b>

The insert_after() method moves a tag or string so that it immediately follows something
elseintheparsetree:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

34/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

soup.b.i.insert_after(soup.new_string("ever"))
soup.b
#<b><i>Don't</i>everstop</b>
soup.b.contents
#[<i>Don't</i>,u'ever',u'stop']

clear()
Tag.clear() removesthecontentsofatag:

markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
tag=soup.a
tag.clear()
tag
#<ahref="http://example.com/"></a>

extract()
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that

wasextracted:
markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
a_tag=soup.a
i_tag=soup.i.extract()
a_tag
#<ahref="http://example.com/">Ilinkedto</a>
i_tag
#<i>example.com</i>
print(i_tag.parent)
None

Atthispointyoueffectivelyhavetwoparsetrees:onerootedatthe BeautifulSoup objectyou


usedtoparsethedocument,andonerootedatthetagthatwasextracted.Youcangoonto
call extract onachildoftheelementyouextracted:
my_string=i_tag.string.extract()
my_string
#u'example.com'
print(my_string.parent)
#None
i_tag
#<i></i>

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

35/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

decompose()
Tag.decompose() removesatagfromthetree,thencompletelydestroysitanditscontents:

markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
a_tag=soup.a
soup.i.decompose()
a_tag
#<ahref="http://example.com/">Ilinkedto</a>

replace_with()
PageElement.replace_with() removesatagorstringfromthetree,andreplacesitwiththetag

orstringofyourchoice:
markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
a_tag=soup.a
new_tag=soup.new_tag("b")
new_tag.string="example.net"
a_tag.i.replace_with(new_tag)
a_tag
#<ahref="http://example.com/">Ilinkedto<b>example.net</b></a>
replace_with() returnsthetagorstringthatwasreplaced,sothatyoucanexamineitoradd

itbacktoanotherpartofthetree.

wrap()
PageElement.wrap() wrapsanelementinthetagyouspecify.Itreturnsthenewwrapper:

soup=BeautifulSoup("<p>IwishIwasbold.</p>")
soup.p.string.wrap(soup.new_tag("b"))
#<b>IwishIwasbold.</b>
soup.p.wrap(soup.new_tag("div")
#<div><p><b>IwishIwasbold.</b></p></div>

ThismethodisnewinBeautifulSoup4.0.5.

unwrap()
Tag.unwrap() istheoppositeof wrap() . It replaces a tag with whatevers inside that tag. Its

goodforstrippingoutmarkup:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

36/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
a_tag=soup.a
a_tag.i.unwrap()
a_tag
#<ahref="http://example.com/">Ilinkedtoexample.com</a>

Like replace_with() , unwrap() returnsthetagthatwasreplaced.

Output
Prettyprinting
The prettify() methodwillturnaBeautifulSoupparsetreeintoanicelyformattedUnicode
string,witheachHTML/XMLtagonitsownline:
markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
soup.prettify()
#'<html>\n<head>\n</head>\n<body>\n<ahref="http://example.com/">\n...'
print(soup.prettify())
#<html>
#<head>
#</head>
#<body>
#<ahref="http://example.com/">
#Ilinkedto
#<i>
#example.com
#</i>
#</a>
#</body>
#</html>

Youcancall prettify() onthetoplevel BeautifulSoup object,oronanyofits Tag objects:


print(soup.a.prettify())
#<ahref="http://example.com/">
#Ilinkedto
#<i>
#example.com
#</i>
#</a>

Nonpretty printing
If you just want a string, with no fancy formatting, you can call unicode() or str() on a
BeautifulSoup object,ora Tag withinit:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

37/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

str(soup)
#'<html><head></head><body><ahref="http://example.com/">Ilinkedto<i>example.com</i></a></body>
unicode(soup.a)
#u'<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'

The str() functionreturnsastringencodedinUTF8.SeeEncodingsforotheroptions.


Youcanalsocall encode() togetabytestring,and decode() togetUnicode.

Output formatters
IfyougiveBeautifulSoupadocumentthatcontainsHTMLentitieslike&lquot,theyllbe
convertedtoUnicodecharacters:
soup=BeautifulSoup("&ldquo;Dammit!&rdquo;hesaid.")
unicode(soup)
#u'<html><head></head><body>\u201cDammit!\u201dhesaid.</body></html>'

If you then convert the document to a string, the Unicode characters will be encoded as
UTF8.YouwontgettheHTMLentitiesback:
str(soup)
#'<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9dhesaid.</body></html>'

By default, the only characters that are escaped upon output are bare ampersands and
angle brackets. These get turned into &amp, &lt, and &gt, so that Beautiful Soup
doesntinadvertentlygenerateinvalidHTMLorXML:
soup=BeautifulSoup("<p>ThelawfirmofDewey,Cheatem,&Howe</p>")
soup.p
#<p>ThelawfirmofDewey,Cheatem,&amp;Howe</p>
soup=BeautifulSoup('<ahref="http://example.com/?foo=val1&bar=val2">Alink</a>')
soup.a
#<ahref="http://example.com/?foo=val1&amp;bar=val2">Alink</a>

Youcanchangethisbehaviorbyprovidingavalueforthe formatter argumentto prettify() ,


encode() ,or decode() .BeautifulSouprecognizesfourpossiblevaluesfor formatter .
The default is formatter="minimal" . Strings will only be processed enough to ensure that
BeautifulSoupgeneratesvalidHTML/XML:
french="<p>Iladit&lt;&lt;Sacr&eacute;bleu!&gt;&gt;</p>"
soup=BeautifulSoup(french)
print(soup.prettify(formatter="minimal"))
#<html>
#<body>
#<p>
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

38/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

#Iladit&lt;&lt;Sacrbleu!&gt;&gt;
#</p>
#</body>
#</html>

If you pass in formatter="html" , Beautiful Soup will convert Unicode characters to HTML
entitieswheneverpossible:
print(soup.prettify(formatter="html"))
#<html>
#<body>
#<p>
#Iladit&lt;&lt;Sacr&eacute;bleu!&gt;&gt;
#</p>
#</body>
#</html>

Ifyoupassin formatter=None ,BeautifulSoupwillnotmodifystringsatallonoutput.Thisis


the fastest option, but it may lead to Beautiful Soup generating invalid HTML/XML, as in
theseexamples:
print(soup.prettify(formatter=None))
#<html>
#<body>
#<p>
#Iladit<<Sacrbleu!>>
#</p>
#</body>
#</html>
link_soup=BeautifulSoup('<ahref="http://example.com/?foo=val1&bar=val2">Alink</a>')
print(link_soup.a.encode(formatter=None))
#<ahref="http://example.com/?foo=val1&bar=val2">Alink</a>

Finally,ifyoupassinafunctionfor formatter ,BeautifulSoupwillcallthatfunctiononcefor


every string and attribute value in the document. You can do whatever you want in this
function.Heresaformatterthatconvertsstringstouppercaseanddoesabsolutelynothing
else:
defuppercase(str):
returnstr.upper()
print(soup.prettify(formatter=uppercase))
#<html>
#<body>
#<p>
#ILADIT<<SACRBLEU!>>
#</p>
#</body>
#</html>
print(link_soup.a.prettify(formatter=uppercase))
#<ahref="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
#ALINK
#</a>
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

39/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Ifyourewritingyourownfunction,youshouldknowaboutthe EntitySubstitution classinthe


bs4.dammit module. This class implements Beautiful Soups standard formatters as class
methods: the html formatter is EntitySubstitution.substitute_html , and the minimal
formatter is EntitySubstitution.substitute_xml . You can use these functions to simulate
formatter=html or formatter==minimal ,butthendosomethingextra.
HeresanexamplethatreplacesUnicodecharacterswithHTMLentitieswheneverpossible,
butalsoconvertsallstringstouppercase:
frombs4.dammitimportEntitySubstitution
defuppercase_and_substitute_html_entities(str):
returnEntitySubstitution.substitute_html(str.upper())
print(soup.prettify(formatter=uppercase_and_substitute_html_entities))
#<html>
#<body>
#<p>
#ILADIT&lt;&lt;SACR&Eacute;BLEU!&gt;&gt;
#</p>
#</body>
#</html>

Onelastcaveat:ifyoucreatea CData object,thetextinsidethatobjectisalwayspresented


exactlyasitappears,withnoformatting.BeautifulSoupwillcalltheformattermethod,justin
case youve written a custom method that counts all the strings in the document or
something,butitwillignorethereturnvalue:
frombs4.elementimportCData
soup=BeautifulSoup("<a></a>")
soup.a.string=CData("one<three")
print(soup.a.prettify(formatter="xml"))
#<a>
#<![CDATA[one<three]]>
#</a>

get_text()
If you only want the text part of a document or tag, you can use the get_text() method.It
returnsallthetextinadocumentorbeneathatag,asasingleUnicodestring:
markup='<ahref="http://example.com/">\nIlinkedto<i>example.com</i>\n</a>'
soup=BeautifulSoup(markup)
soup.get_text()
u'\nIlinkedtoexample.com\n'
soup.i.get_text()
u'example.com'

Youcanspecifyastringtobeusedtojointhebitsoftexttogether:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

40/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

#soup.get_text("|")
u'\nIlinkedto|example.com|\n'

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of
text:
#soup.get_text("|",strip=True)
u'Ilinkedto|example.com'

Butatthatpointyoumightwanttousethe.stripped_stringsgeneratorinstead,andprocess
thetextyourself:
[textfortextinsoup.stripped_strings]
#[u'Ilinkedto',u'example.com']

Specifying the parser to use


If you just need to parse some HTML, you can dump the markup into the BeautifulSoup
constructor,anditllprobablybefine.BeautifulSoupwillpickaparserforyouandparsethe
data.Butthereareafewadditionalargumentsyoucanpassintotheconstructortochange
whichparserisused.
The first argument to the BeautifulSoup constructor is a string or an open filehandlethe
markupyouwantparsed.Thesecondargumentishowyoudlikethemarkupparsed.
Ifyoudontspecifyanything,youllgetthebestHTMLparserthatsinstalled.BeautifulSoup
rankslxmlsparserasbeingthebest,thenhtml5libs,thenPythonsbuiltinparser.Youcan
overridethisbyspecifyingoneofthefollowing:
What type of markup you want to parse. Currently supported are html, xml, and
html5.
Thenameoftheparserlibraryyouwanttouse.Currentlysupportedoptionsarelxml,
html5lib,andhtml.parser(PythonsbuiltinHTMLparser).
ThesectionInstallingaparsercontraststhesupportedparsers.
Ifyoudonthaveanappropriateparserinstalled,BeautifulSoupwillignoreyourrequestand
pickadifferentparser.Rightnow,theonlysupportedXMLparserislxml.Ifyoudonthave
lxmlinstalled,askingforanXMLparserwontgiveyouone,andaskingforlxmlwontwork
either.

Differences between parsers


Beautiful Soup presents the same interface to a number of different parsers, but each
parser is different. Different parsers will create different parse trees from the same
document. The biggest differences are between the HTML parsers and the XML parsers.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

41/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Heresashortdocument,parsedasHTML:
BeautifulSoup("<a><b/></a>")
#<html><head></head><body><a><b></b></a></body></html>

Sinceanempty<b/>tagisnotvalidHTML,theparserturnsitintoa<b></b>tagpair.
Heres the same document parsed as XML (running this requires that you have lxml
installed).Notethattheempty<b/>tagisleftalone,andthatthedocumentisgivenanXML
declarationinsteadofbeingputintoan<html>tag.:
BeautifulSoup("<a><b/></a>","xml")
#<?xmlversion="1.0"encoding="utf8"?>
#<a><b/></a>

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly
formed HTML document, these differences wont matter. One parser will be faster than
another, but theyll all give you a data structure that looks exactly like the original HTML
document.
But if the document is not perfectlyformed, different parsers will give different results.
Heresashort,invaliddocumentparsedusinglxmlsHTMLparser.Notethatthedangling
</p>tagissimplyignored:
BeautifulSoup("<a></p>","lxml")
#<html><body><a></a></body></html>

Heresthesamedocumentparsedusinghtml5lib:
BeautifulSoup("<a></p>","html5lib")
#<html><head></head><body><a><p></p></a></body></html>

Instead of ignoring the dangling </p> tag, html5lib pairs it with an opening <p> tag. This
parseralsoaddsanempty<head>tagtothedocument.
HeresthesamedocumentparsedwithPythonsbuiltinHTMLparser:
BeautifulSoup("<a></p>","html.parser")
#<a></a>

Likehtml5lib,thisparserignorestheclosing</p>tag.Unlikehtml5lib,thisparsermakesno
attempt to create a wellformed HTML document by adding a <body> tag. Unlike lxml, it
doesntevenbothertoaddan<html>tag.
Since the document <a></p> is invalid, none of these techniques is the correct way to
handle it. The html5lib parser uses techniques that are part of the HTML5 standard, so it
hasthebestclaimonbeingthecorrectway,butallthreetechniquesarelegitimate.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

42/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Differences between parsers can affect your script. If youre planning on distributing your
scripttootherpeople,orrunningitonmultiplemachines,youshouldspecifyaparserinthe
BeautifulSoup constructor. That will reduce the chances that your users parse a document
differentlyfromthewayyouparseit.

Encodings
AnyHTMLorXMLdocumentiswritteninaspecificencodinglikeASCIIorUTF8.Butwhen
youloadthatdocumentintoBeautifulSoup,youlldiscoveritsbeenconvertedtoUnicode:
markup="<h1>Sacr\xc3\xa9bleu!</h1>"
soup=BeautifulSoup(markup)
soup.h1
#<h1>Sacrbleu!</h1>
soup.h1.string
#u'Sacr\xe9bleu!'

Itsnotmagic.(Thatsurewouldbenice.)BeautifulSoupusesasublibrarycalledUnicode,
Dammit to detect a documents encoding and convert it to Unicode. The autodetected
encodingisavailableasthe .original_encoding attributeofthe BeautifulSoup object:
soup.original_encoding
'utf8'

Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes.
Sometimes it guesses correctly, but only after a bytebybyte search of the document that
takes a very long time. If you happen to know a documents encoding ahead of time, you
can avoid mistakes and delays by passing it to the BeautifulSoup constructor as
from_encoding .
HeresadocumentwritteninISO88598.ThedocumentissoshortthatUnicode,Dammit
cantgetagoodlockonit,andmisidentifiesitasISO88597:
markup=b"<h1>\xed\xe5\xec\xf9</h1>"
soup=BeautifulSoup(markup)
soup.h1
<h1></h1>
soup.original_encoding
'ISO88597'

Wecanfixthisbypassinginthecorrect from_encoding :
soup=BeautifulSoup(markup,from_encoding="iso88598")
soup.h1
<h1></h1>
soup.original_encoding
'iso88598'

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

43/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

If you dont know what the correct encoding is, but you know that Unicode, Dammit is
guessingwrong,youcanpassthewrongguessesinas exclude_encodings :
soup=BeautifulSoup(markup,exclude_encodings=["ISO88597"])
soup.h1
<h1></h1>
soup.original_encoding
'WINDOWS1255'

Windows1255isnt100%correct,butthatencodingisacompatiblesupersetofISO8859
8,soitscloseenough.( exclude_encodings isanewfeatureinBeautifulSoup4.4.0.)
Inrarecases(usuallywhenaUTF8documentcontainstextwritteninacompletelydifferent
encoding),theonlywaytogetUnicodemaybetoreplacesomecharacterswiththespecial
Unicode character REPLACEMENT CHARACTER (U+FFFD, ). If Unicode, Dammit
needs to do this, it will set the .contains_replacement_characters attribute to True on the
UnicodeDammit or BeautifulSoup object.ThisletsyouknowthattheUnicoderepresentationis
notanexactrepresentationoftheoriginalsomedatawaslost.Ifadocumentcontains,
but .contains_replacement_characters is False ,youllknowthatthewasthereoriginally(asit
isinthisparagraph)anddoesntstandinformissingdata.

Output encoding
WhenyouwriteoutadocumentfromBeautifulSoup,yougetaUTF8document,evenifthe
documentwasntinUTF8tobeginwith.HeresadocumentwrittenintheLatin1encoding:
markup=b'''
<html>
<head>
<metacontent="text/html;charset=ISOLatin1"httpequiv="Contenttype"/>
</head>
<body>
<p>Sacr\xe9bleu!</p>
</body>
</html>
'''
soup=BeautifulSoup(markup)
print(soup.prettify())
#<html>
#<head>
#<metacontent="text/html;charset=utf8"httpequiv="Contenttype"/>
#</head>
#<body>
#<p>
#Sacrbleu!
#</p>
#</body>
#</html>

Notethatthe<meta>taghasbeenrewrittentoreflectthefactthatthedocumentisnowin
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

44/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

UTF8.
IfyoudontwantUTF8,youcanpassanencodinginto prettify() :
print(soup.prettify("latin1"))
#<html>
#<head>
#<metacontent="text/html;charset=latin1"httpequiv="Contenttype"/>
#...

Youcanalsocallencode()onthe BeautifulSoup object,oranyelementinthesoup,justasif


itwereaPythonstring:
soup.p.encode("latin1")
#'<p>Sacr\xe9bleu!</p>'
soup.p.encode("utf8")
#'<p>Sacr\xc3\xa9bleu!</p>'

Any characters that cant be represented in your chosen encoding will be converted into
numeric XML entity references. Heres a document that includes the Unicode character
SNOWMAN:
markup=u"<b>\N{SNOWMAN}</b>"
snowman_soup=BeautifulSoup(markup)
tag=snowman_soup.b

TheSNOWMANcharactercanbepartofaUTF8document(itlookslike),buttheresno
representationforthatcharacterinISOLatin1orASCII,soitsconvertedinto&#9731for
thoseencodings:
print(tag.encode("utf8"))
#<b></b>
printtag.encode("latin1")
#<b>&#9731;</b>
printtag.encode("ascii")
#<b>&#9731;</b>

Unicode, Dammit
YoucanuseUnicode,DammitwithoutusingBeautifulSoup.Itsusefulwheneveryouhave
datainanunknownencodingandyoujustwantittobecomeUnicode:
frombs4importUnicodeDammit
dammit=UnicodeDammit("Sacr\xc3\xa9bleu!")
print(dammit.unicode_markup)
#Sacrbleu!
dammit.original_encoding
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

45/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

#'utf8'

Unicode,Dammitsguesseswillgetalotmoreaccurateifyouinstallthe chardet or cchardet


Python libraries. The more data you give Unicode, Dammit, the more accurately it will
guess. If you have your own suspicions as to what the encoding might be, you can pass
theminasalist:
dammit=UnicodeDammit("Sacr\xe9bleu!",["latin1","iso88591"])
print(dammit.unicode_markup)
#Sacrbleu!
dammit.original_encoding
#'latin1'

Unicode,DammithastwospecialfeaturesthatBeautifulSoupdoesntuse.

Smart quotes
YoucanuseUnicode,DammittoconvertMicrosoftsmartquotestoHTMLorXMLentities:
markup=b"<p>Ijust\x93love\x94MicrosoftWord\x92ssmartquotes</p>"
UnicodeDammit(markup,["windows1252"],smart_quotes_to="html").unicode_markup
#u'<p>Ijust&ldquo;love&rdquo;MicrosoftWord&rsquo;ssmartquotes</p>'
UnicodeDammit(markup,["windows1252"],smart_quotes_to="xml").unicode_markup
#u'<p>Ijust&#x201C;love&#x201D;MicrosoftWord&#x2019;ssmartquotes</p>'

YoucanalsoconvertMicrosoftsmartquotestoASCIIquotes:
UnicodeDammit(markup,["windows1252"],smart_quotes_to="ascii").unicode_markup
#u'<p>Ijust"love"MicrosoftWord\'ssmartquotes</p>'

Hopefully youll find this feature useful, but Beautiful Soup doesnt use it. Beautiful Soup
prefers the default behavior, which is to convert Microsoft smart quotes to Unicode
charactersalongwitheverythingelse:
UnicodeDammit(markup,["windows1252"]).unicode_markup
#u'<p>Ijust\u201clove\u201dMicrosoftWord\u2019ssmartquotes</p>'

Inconsistent encodings
SometimesadocumentismostlyinUTF8,butcontainsWindows1252characterssuchas
(again)Microsoftsmartquotes.Thiscanhappenwhenawebsiteincludesdatafrommultiple
sources. You can use UnicodeDammit.detwingle() to turn such a document into pure UTF8.
Heresasimpleexample:
snowmen=(u"\N{SNOWMAN}"*3)
quote=(u"\N{LEFTDOUBLEQUOTATIONMARK}Ilikesnowmen!\N{RIGHTDOUBLEQUOTATIONMARK}")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

46/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

doc=snowmen.encode("utf8")+quote.encode("windows_1252")

Thisdocumentisamess.ThesnowmenareinUTF8andthequotesareinWindows1252.
Youcandisplaythesnowmenorthequotes,butnotboth:
print(doc)
#Ilikesnowmen!
print(doc.decode("windows1252"))
#Ilikesnowmen!

DecodingthedocumentasUTF8raisesa UnicodeDecodeError ,anddecodingitasWindows


1252 gives you gibberish. Fortunately, UnicodeDammit.detwingle() will convert the string to
pure UTF8, allowing you to decode it to Unicode and display the snowmen and quote
markssimultaneously:
new_doc=UnicodeDammit.detwingle(doc)
print(new_doc.decode("utf8"))
#Ilikesnowmen!
UnicodeDammit.detwingle() onlyknowshowtohandleWindows1252embeddedinUTF8(or

viceversa,Isuppose),butthisisthemostcommoncase.
Notethatyoumustknowtocall UnicodeDammit.detwingle() onyourdatabeforepassingitinto
BeautifulSoup orthe UnicodeDammit constructor.BeautifulSoupassumesthatadocumenthas
asingleencoding,whateveritmightbe.IfyoupassitadocumentthatcontainsbothUTF8
and Windows1252, its likely to think the whole document is Windows1252, and the
documentwillcomeoutlookinglike Ilikesnowmen! .
UnicodeDammit.detwingle() isnewinBeautifulSoup4.1.0.

Comparing objects for equality


BeautifulSoupsaysthattwo NavigableString or Tag objects are equal when they represent
the same HTML or XML markup. In this example, the two <b> tags are treated as equal,
even though they live in different parts of the object tree, because they both look like
<b>pizza</b>:
markup="<p>Iwant<b>pizza</b>andmore<b>pizza</b>!</p>"
soup=BeautifulSoup(markup,'html.parser')
first_b,second_b=soup.find_all('b')
printfirst_b==second_b
#True
printfirst_b.previous_element==second_b.previous_element
#False

Ifyouwanttoseewhethertwovariablesrefertoexactlythesameobject,useis:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

47/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

printfirst_bissecond_b
#False

Copying Beautiful Soup objects


Youcanuse copy.copy() tocreateacopyofany Tag or NavigableString :
importcopy
p_copy=copy.copy(soup.p)
printp_copy
#<p>Iwant<b>pizza</b>andmore<b>pizza</b>!</p>

The copy is considered equal to the original, since it represents the same markup as the
original,butitsnotthesameobject:
printsoup.p==p_copy
#True
printsoup.pisp_copy
#False

TheonlyrealdifferenceisthatthecopyiscompletelydetachedfromtheoriginalBeautiful
Soupobjecttree,justasif extract() hadbeencalledonit:
printp_copy.parent
#None

Thisisbecausetwodifferent Tag objectscantoccupythesamespaceatthesametime.

Parsing only part of a document


LetssayyouwanttouseBeautifulSouplookatadocuments<a>tags.Itsawasteoftime
andmemorytoparsetheentiredocumentandthengooveritagainlookingfor<a>tags.It
would be much faster to ignore everything that wasnt an <a> tag in the first place. The
SoupStrainer class allows you to choose which parts of an incoming document are parsed.
Youjustcreatea SoupStrainer andpassitintothe BeautifulSoup constructorasthe parse_only
argument.
(Notethatthisfeaturewontworkifyoureusingthehtml5libparser.Ifyouusehtml5lib,the
whole document will be parsed, no matter what. This is because html5lib constantly
rearrangestheparsetreeasitworks,andifsomepartofthedocumentdidntactuallymake
it into the parse tree, itll crash. To avoid confusion, in the examples below Ill be forcing
BeautifulSouptousePythonsbuiltinparser.)

SoupStrainer
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

48/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

The SoupStrainer class takes the same arguments as a typical method from Searching the
tree:name,attrs,string,and**kwargs.Herearethree SoupStrainer objects:
frombs4importSoupStrainer
only_a_tags=SoupStrainer("a")
only_tags_with_id_link2=SoupStrainer(id="link2")
defis_short_string(string):
returnlen(string)<10
only_short_strings=SoupStrainer(string=is_short_string)

Imgoingtobringbackthethreesistersdocumentonemoretime,andwellseewhatthe
documentlookslikewhenitsparsedwiththesethree SoupStrainer objects:
html_doc="""
<html><head><title>TheDormouse'sstory</title></head>
<body>
<pclass="title"><b>TheDormouse'sstory</b></p>
<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere
<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,
<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and
<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;
andtheylivedatthebottomofawell.</p>
<pclass="story">...</p>
"""
print(BeautifulSoup(html_doc,"html.parser",parse_only=only_a_tags).prettify())
#<aclass="sister"href="http://example.com/elsie"id="link1">
#Elsie
#</a>
#<aclass="sister"href="http://example.com/lacie"id="link2">
#Lacie
#</a>
#<aclass="sister"href="http://example.com/tillie"id="link3">
#Tillie
#</a>
print(BeautifulSoup(html_doc,"html.parser",parse_only=only_tags_with_id_link2).prettify())
#<aclass="sister"href="http://example.com/lacie"id="link2">
#Lacie
#</a>
print(BeautifulSoup(html_doc,"html.parser",parse_only=only_short_strings).prettify())
#Elsie
#,
#Lacie
#and
#Tillie
#...
#

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

49/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

You can also pass a SoupStrainer into any of the methods covered in Searching the tree.
Thisprobablyisntterriblyuseful,butIthoughtIdmentionit:
soup=BeautifulSoup(html_doc)
soup.find_all(only_short_strings)
#[u'\n\n',u'\n\n',u'Elsie',u',\n',u'Lacie',u'and\n',u'Tillie',
#u'\n\n',u'...',u'\n']

Troubleshooting
diagnose()
If youre having trouble understanding what Beautiful Soup does to a document, pass the
document into the diagnose() function. (New in Beautiful Soup 4.2.0.) Beautiful Soup will
print out a report showing you how different parsers handle the document, and tell you if
youremissingaparserthatBeautifulSoupcouldbeusing:
frombs4.diagnoseimportdiagnose
data=open("bad.html").read()
diagnose(data)
#DiagnosticrunningonBeautifulSoup4.2.0
#Pythonversion2.7.3(default,Aug12012,05:16:07)
#Inoticedthathtml5libisnotinstalled.Installingitmayhelp.
#Foundlxmlversion2.3.2.0
#
#Tryingtoparseyourdatawithhtml.parser
#Here'swhathtml.parserdidwiththedocument:
#...

Just looking at the output of diagnose() may show you how to solve the problem. Even if
not,youcanpastetheoutputof diagnose() whenaskingforhelp.

Errors when parsing a document


There are two different kinds of parse errors. There are crashes, where you feed a
documenttoBeautifulSoupanditraisesanexception,usuallyan HTMLParser.HTMLParseError .
And there is unexpected behavior, where a Beautiful Soup parse tree looks a lot different
thanthedocumentusedtocreateit.
Almost none of these problems turn out to be problems with Beautiful Soup. This is not
because Beautiful Soup is an amazingly wellwritten piece of software. Its because
BeautifulSoupdoesntincludeanyparsingcode.Instead,itreliesonexternalparsers.Ifone
parserisntworkingonacertaindocument,thebestsolutionistotryadifferentparser.See
Installingaparserfordetailsandaparsercomparison.
The most common parse errors are

HTMLParser.HTMLParseError: malformed start tag

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

and
50/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

HTMLParser.HTMLParseError:badendtag .ThesearebothgeneratedbyPythonsbuiltinHTML

parserlibrary,andthesolutionistoinstalllxmlorhtml5lib.
Themostcommontypeofunexpectedbehavioristhatyoucantfindatagthatyouknowis
inthedocument.Yousawitgoingin,but find_all() returns [] or find() returns None .Thisis
anothercommonproblemwithPythonsbuiltinHTMLparser,whichsometimesskipstagsit
doesntunderstand.Again,thesolutionistoinstalllxmlorhtml5lib.

Version mismatch problems


SyntaxError: Invalid syntax

(on the line ROOT_TAG_NAME = u'[document]' ): Caused by


runningthePython2versionofBeautifulSoupunderPython3,withoutconvertingthe
code.
ImportError: No module named HTMLParser Caused by running the Python 2 version of
BeautifulSoupunderPython3.
ImportError:Nomodulenamedhtml.parser Caused by running the Python 3 version of
BeautifulSoupunderPython2.
ImportError:NomodulenamedBeautifulSoup CausedbyrunningBeautifulSoup3code
on a system that doesnt have BS3 installed. Or, by writing Beautiful Soup 4 code
withoutknowingthatthepackagenamehaschangedto bs4 .
ImportError: No module named bs4 Caused by running Beautiful Soup 4 code on a
systemthatdoesnthaveBS4installed.

Parsing XML
Bydefault,BeautifulSoupparsesdocumentsasHTML.ToparseadocumentasXML,pass
inxmlasthesecondargumenttothe BeautifulSoup constructor:
soup=BeautifulSoup(markup,"xml")

Youllneedtohavelxmlinstalled.

Other parser problems


Ifyourscriptworksononecomputerbutnotanother,orinonevirtualenvironmentbut
notanother,oroutsidethevirtualenvironmentbutnotinside,itsprobablybecausethe
twoenvironmentshavedifferentparserlibrariesavailable.Forexample,youmayhave
developedthescriptonacomputerthathaslxmlinstalled,andthentriedtorunitona
computer that only has html5lib installed. See Differences between parsers for why
this matters, and fix the problem by mentioning a specific parser library in the
BeautifulSoup constructor.
Because HTML tags and attributes are caseinsensitive, all three HTML parsers
converttagandattributenamestolowercase.Thatis,themarkup<TAG></TAG>is
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

51/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

convertedto<tag></tag>.Ifyouwanttopreservemixedcaseoruppercasetagsand
attributes,youllneedtoparsethedocumentasXML.

Miscellaneous
UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar (or

just about any other UnicodeEncodeError ) This is not a problem with Beautiful Soup.
Thisproblemshowsupintwomainsituations.First,whenyoutrytoprintaUnicode
characterthatyourconsoledoesntknowhowtodisplay.(SeethispageonthePython
wiki for help.) Second, when youre writing to a file and you pass in a Unicode
character thats not supported by your default encoding. In this case, the simplest
solutionistoexplicitlyencodetheUnicodestringintoUTF8with u.encode("utf8") .
KeyError:[attr] Caused by accessing tag['attr'] when the tag in question doesnt
definethe attr attribute.Themostcommonerrorsare KeyError:'href' and KeyError:
'class' .Use tag.get('attr') ifyourenotsure attr isdefined,justasyouwouldwitha
Pythondictionary.
AttributeError: 'ResultSet' object has no attribute 'foo' This usually happens
becauseyouexpected find_all() toreturnasingletagorstring.But find_all() returns
a _list_ of tags and stringsa ResultSet object. You need to iterate over the list and
look at the .foo of each one. Or, if you really only want one result, you need to use
find() insteadof find_all() .
AttributeError:'NoneType'objecthasnoattribute'foo' Thisusuallyhappensbecause
youcalled find() andthentriedtoaccessthe.foo`attributeoftheresult.Butinyour
case, find() didnt find anything, so it returned None , instead of returning a tag or a
string.Youneedtofigureoutwhyyour find() callisntreturninganything.

Improving Performance
Beautiful Soup will never be as fast as the parsers it sits on top of. If response time is
critical, if youre paying for computer time by the hour, or if theres any other reason why
computer time is more valuable than programmer time, you should forget about Beautiful
Soupandworkdirectlyatoplxml.
Thatsaid,therearethingsyoucandotospeedupBeautifulSoup.Ifyourenotusinglxml
as the underlying parser, my advice is to start. Beautiful Soup parses documents
significantlyfasterusinglxmlthanusinghtml.parserorhtml5lib.
Youcanspeedupencodingdetectionsignificantlybyinstallingthecchardetlibrary.
Parsingonlypartofadocumentwontsaveyoumuchtimeparsingthedocument,butitcan
savealotofmemory,anditllmakesearchingthedocumentmuchfaster.

Beautiful Soup 3
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

52/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

BeautifulSoup3isthepreviousreleaseseries,andisnolongerbeingactivelydeveloped.
ItscurrentlypackagedwithallmajorLinuxdistributions:
$aptgetinstallpythonbeautifulsoup

ItsalsopublishedthroughPyPias BeautifulSoup .:
$easy_installBeautifulSoup
$pipinstallBeautifulSoup

YoucanalsodownloadatarballofBeautifulSoup3.2.0.
If you ran easy_install beautifulsoup or easy_install BeautifulSoup , but your code doesnt
work,youinstalledBeautifulSoup3bymistake.Youneedtorun easy_installbeautifulsoup4 .
ThedocumentationforBeautifulSoup3isarchivedonline.

Porting code to BS4


Most code written against Beautiful Soup 3 will work against Beautiful Soup 4 with one
simplechange.Allyoushouldhavetodoischangethepackagenamefrom BeautifulSoup to
bs4 .Sothis:
fromBeautifulSoupimportBeautifulSoup

becomesthis:
frombs4importBeautifulSoup

If you get the ImportError No module named BeautifulSoup, your problem is that
youre trying to run Beautiful Soup 3 code, but you only have Beautiful Soup 4
installed.
Ifyougetthe ImportError Nomodulenamedbs4,yourproblemisthatyouretryingto
runBeautifulSoup4code,butyouonlyhaveBeautifulSoup3installed.
Although BS4 is mostly backwardscompatible with BS3, most of its methods have been
deprecated and given new names for PEP 8 compliance. There are numerous other
renamesandchanges,andafewofthembreakbackwardscompatibility.
HereswhatyoullneedtoknowtoconvertyourBS3codeandhabitstoBS4:

You need a parser


BeautifulSoup3usedPythons SGMLParser ,amodulethatwasdeprecatedandremovedin
Python3.0.BeautifulSoup4uses html.parser bydefault,butyoucanpluginlxmlorhtml5lib
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

53/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

andusethatinstead.SeeInstallingaparserforacomparison.
Since html.parser isnotthesameparseras SGMLParser ,youmayfindthatBeautifulSoup4
givesyouadifferentparsetreethanBeautifulSoup3forthesamemarkup.Ifyouswapout
html.parser for lxml or html5lib, you may find that the parse tree changes yet again. If this
happens,youllneedtoupdateyourscrapingcodetodealwiththenewtree.

Method names
renderContents > encode_contents
replaceWith > replace_with
replaceWithChildren > unwrap
findAll > find_all
findAllNext > find_all_next
findAllPrevious > find_all_previous
findNext > find_next
findNextSibling > find_next_sibling
findNextSiblings > find_next_siblings
findParent > find_parent
findParents > find_parents
findPrevious > find_previous
findPreviousSibling > find_previous_sibling
findPreviousSiblings > find_previous_siblings
nextSibling > next_sibling
previousSibling > previous_sibling

SomeargumentstotheBeautifulSoupconstructorwererenamedforthesamereasons:
BeautifulSoup(parseOnlyThese=...) > BeautifulSoup(parse_only=...)
BeautifulSoup(fromEncoding=...) > BeautifulSoup(from_encoding=...)

IrenamedonemethodforcompatibilitywithPython3:
Tag.has_key() > Tag.has_attr()

Irenamedoneattributetousemoreaccurateterminology:
Tag.isSelfClosing > Tag.is_empty_element

IrenamedthreeattributestoavoidusingwordsthathavespecialmeaningtoPython.Unlike
the others, these changes are not backwards compatible. If you used these attributes in
BS3,yourcodewillbreakonBS4untilyouchangethem.
UnicodeDammit.unicode > UnicodeDammit.unicode_markup
Tag.next > Tag.next_element
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

54/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Tag.previous > Tag.previous_element

Generators
IgavethegeneratorsPEP8compliantnames,andtransformedthemintoproperties:
childGenerator() > children
nextGenerator() > next_elements
nextSiblingGenerator() > next_siblings
previousGenerator() > previous_elements
previousSiblingGenerator() > previous_siblings
recursiveChildGenerator() > descendants
parentGenerator() > parents

Soinsteadofthis:
forparentintag.parentGenerator():
...

Youcanwritethis:
forparentintag.parents:
...

(Buttheoldcodewillstillwork.)
Someofthegeneratorsusedtoyield None aftertheyweredone,andthenstop.Thatwasa
bug.Nowthegeneratorsjuststop.
There are two new generators, .strings and .stripped_strings. .strings yields
NavigableString objects, and .stripped_strings yields Python strings that have had
whitespacestripped.

XML
There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in
xml as the second argument to the BeautifulSoup constructor. For the same reason, the
BeautifulSoup constructornolongerrecognizesthe isHTML argument.
BeautifulSoupshandlingofemptyelementXMLtagshasbeenimproved.Previouslywhen
youparsedXMLyouhadtoexplicitlysaywhichtagswereconsideredemptyelementtags.
The selfClosingTags argumenttotheconstructorisnolongerrecognized.Instead,Beautiful
Soupconsidersanyemptytagtobeanemptyelementtag.Ifyouaddachildtoanempty
elementtag,itstopsbeinganemptyelementtag.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

55/56

7/14/2015

BeautifulSoupDocumentationBeautifulSoup4.4.0documentation

Entities
An incoming HTML or XML entity is always converted into the corresponding Unicode
character. Beautiful Soup 3 had a number of overlapping ways of dealing with entities,
which have been removed. The BeautifulSoup constructor no longer recognizes the
smartQuotesTo or convertEntities arguments.(Unicode,Dammitstillhas smart_quotes_to ,butits
defaultisnowtoturnsmartquotesintoUnicode.)Theconstants HTML_ENTITIES , XML_ENTITIES ,
and XHTML_ENTITIES havebeenremoved,sincetheyconfigureafeature(transformingsome
butnotallentitiesintoUnicodecharacters)thatnolongerexists.
IfyouwanttoturnUnicodecharactersbackintoHTMLentitiesonoutput,ratherthanturning
themintoUTF8characters,youneedtouseanoutputformatter.

Miscellaneous
Tag.stringnowoperatesrecursively.IftagAcontainsasingletagBandnothingelse,then
A.stringisthesameasB.string.(Previously,itwasNone.)
Multivaluedattributeslike class have lists of strings as their values, not strings. This may
affectthewayyousearchbyCSSclass.
If you pass one of the find* methods both string and a tagspecific argument like name,
BeautifulSoupwillsearchfortagsthatmatchyourtagspecificcriteriaandwhoseTag.string
matches your value for string. It will not find the strings themselves. Previously, Beautiful
Soupignoredthetagspecificargumentsandlookedforstrings.
The BeautifulSoup constructornolongerrecognizesthemarkupMassageargument.Itsnow
theparsersresponsibilitytohandlemarkupcorrectly.
The rarelyused alternate parser classes like ICantBelieveItsBeautifulSoup and BeautifulSOAP
havebeenremoved.Itsnowtheparsersdecisionhowtohandleambiguousmarkup.
The prettify() methodnowreturnsaUnicodestring,notabytestring.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree

56/56

You might also like