Professional Documents
Culture Documents
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Getting help
If you have questions about Beautiful Soup, or run into problems, send mail to the
discussiongroup.IfyourprobleminvolvesparsinganHTMLdocument,besuretomention
whatthediagnose()functionsaysaboutthatdocument.
Quick Start
HeresanHTMLdocumentIllbeusingasanexamplethroughoutthisdocument.Itspartof
astoryfromAliceinWonderland:
html_doc="""
<html><head><title>TheDormouse'sstory</title></head>
<body>
<pclass="title"><b>TheDormouse'sstory</b></p>
<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere
<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,
<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
1/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;
andtheylivedatthebottomofawell.</p>
<pclass="story">...</p>
"""
BeautifulSoup
frombs4importBeautifulSoup
soup=BeautifulSoup(html_doc,'html.parser')
print(soup.prettify())
#<html>
#<head>
#<title>
#TheDormouse'sstory
#</title>
#</head>
#<body>
#<pclass="title">
#<b>
#TheDormouse'sstory
#</b>
#</p>
#<pclass="story">
#Onceuponatimetherewerethreelittlesisters;andtheirnameswere
#<aclass="sister"href="http://example.com/elsie"id="link1">
#Elsie
#</a>
#,
#<aclass="sister"href="http://example.com/lacie"id="link2">
#Lacie
#</a>
#and
#<aclass="sister"href="http://example.com/tillie"id="link2">
#Tillie
#</a>
#;andtheylivedatthebottomofawell.
#</p>
#<pclass="story">
#...
#</p>
#</body>
#</html>
Herearesomesimplewaystonavigatethatdatastructure:
soup.title
#<title>TheDormouse'sstory</title>
soup.title.name
#u'title'
soup.title.string
#u'TheDormouse'sstory'
soup.title.parent.name
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
2/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
#u'head'
soup.p
#<pclass="title"><b>TheDormouse'sstory</b></p>
soup.p['class']
#u'title'
soup.a
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
soup.find_all('a')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.find(id="link3")
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>
OnecommontaskisextractingalltheURLsfoundwithinapages<a>tags:
forlinkinsoup.find_all('a'):
print(link.get('href'))
#http://example.com/elsie
#http://example.com/lacie
#http://example.com/tillie
Anothercommontaskisextractingallthetextfromapage:
print(soup.get_text())
#TheDormouse'sstory
#
#TheDormouse'sstory
#
#Onceuponatimetherewerethreelittlesisters;andtheirnameswere
#Elsie,
#Lacieand
#Tillie;
#andtheylivedatthebottomofawell.
#
#...
Doesthislooklikewhatyouneed?Ifso,readon.
Beautiful Soup 4 is published through PyPi, so if you cant install it with the system
packager, you can install it with easy_install or pip . The package name is beautifulsoup4 ,
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
3/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
andthesamepackageworksonPython2andPython3.
$easy_installbeautifulsoup4
$pipinstallbeautifulsoup4
(The BeautifulSoup package is probably not what you want. Thats the previous major
release, Beautiful Soup 3. Lots of software uses BS3, so its still available, but if youre
writingnewcodeyoushouldinstall beautifulsoup4 .)
Ifyoudonthave easy_install or pip installed,youcandownloadtheBeautifulSoup4source
tarballandinstallitwith setup.py .
$pythonsetup.pyinstall
If all else fails, the license for Beautiful Soup allows you to package the entire library with
yourapplication.Youcandownloadthetarball,copyits bs4 directoryintoyourapplications
codebase,anduseBeautifulSoupwithoutinstallingitatall.
I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it should work with other
recentversions.
4/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Installing a parser
BeautifulSoupsupportstheHTMLparserincludedinPythonsstandardlibrary,butitalso
supportsanumberofthirdpartyPythonparsers.Oneisthelxmlparser.Dependingonyour
setup,youmightinstalllxmlwithoneofthesecommands:
$aptgetinstallpythonlxml
$easy_installlxml
$pipinstalllxml
AnotheralternativeisthepurePythonhtml5libparser,whichparsesHTMLthewayaweb
browser does. Depending on your setup, you might install html5lib with one of these
commands:
$aptgetinstallpythonhtml5lib
$easy_installhtml5lib
$pipinstallhtml5lib
Thistablesummarizestheadvantagesanddisadvantagesofeachparserlibrary:
Parser
Pythons
html.parser
Typicalusage
lxmlsHTML
parser
BeautifulSoup(markup,"lxml")
Veryfast
Lenient
External C
dependency
lxmlsXML
parser
BeautifulSoup(markup,"lxml
xml") BeautifulSoup(markup,
"xml")
Veryfast
The
only
currently
supported XML
parser
External C
dependency
html5lib
BeautifulSoup(markup,
"html5lib")
Extremely
lenient
Parses pages
thesamewaya
web
browser
does
Veryslow
External
Python
dependency
BeautifulSoup(markup,
"html.parser")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
Advantages
Disadvantages
Batteries
Not
very
included
lenient
Decentspeed
(before
Lenient (as of
Python
Python
2.7.3
2.7.3
or
and3.2.)
3.2.2)
5/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Creates
HTML5
valid
If you can, I recommend you install and use lxml for speed. If youre using a version of
Python2earlierthan2.7.3,oraversionofPython3earlierthan3.2.2,itsessentialthatyou
installlxmlorhtml5libPythonsbuiltinHTMLparserisjustnotverygoodinolderversions.
Note that if a document is invalid, different parsers will generate different Beautiful Soup
treesforit.SeeDifferencesbetweenparsersfordetails.
First, the document is converted to Unicode, and HTML entities are converted to Unicode
characters:
BeautifulSoup("Sacrébleu!")
<html><head></head><body>Sacrbleu!</body></html>
Beautiful Soup then parses the document using the best available parser. It will use an
HTMLparserunlessyouspecificallytellittouseanXMLparser.(SeeParsingXML.)
Kinds of objects
Beautiful Soup transforms a complex HTML document into a complex tree of Python
objects. But youll only ever have to deal with about four kinds of objects: Tag ,
NavigableString , BeautifulSoup ,and Comment .
Tag
A Tag objectcorrespondstoanXMLorHTMLtagintheoriginaldocument:
soup=BeautifulSoup('<bclass="boldest">Extremelybold</b>')
tag=soup.b
type(tag)
#<class'bs4.element.Tag'>
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
6/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Tagshavealotofattributesandmethods,andIllcovermostoftheminNavigatingthetree
and Searching the tree. For now, the most important features of a tag are its name and
attributes.
Name
Everytaghasaname,accessibleas .name :
tag.name
#u'b'
Ifyouchangeatagsname,thechangewillbereflectedinanyHTMLmarkupgeneratedby
BeautifulSoup:
tag.name="blockquote"
tag
#<blockquoteclass="boldest">Extremelybold</blockquote>
Attributes
A tag may have any number of attributes. The tag <b class="boldest"> has an attribute
classwhosevalueisboldest.Youcanaccessatagsattributesbytreatingthetaglikea
dictionary:
tag['class']
#u'boldest'
Youcanaccessthatdictionarydirectlyas .attrs :
tag.attrs
#{u'class':u'boldest'}
Youcanadd,remove,andmodifyatagsattributes.Again,thisisdonebytreatingthetag
asadictionary:
tag['class']='verybold'
tag['id']=1
tag
#<blockquoteclass="verybold"id="1">Extremelybold</blockquote>
deltag['class']
deltag['id']
tag
#<blockquote>Extremelybold</blockquote>
tag['class']
#KeyError:'class'
print(tag.get('class'))
#None
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
7/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Multivalued attributes
HTML4definesafewattributesthatcanhavemultiplevalues.HTML5removesacoupleof
them, but defines a few more. The most common multivalued attribute is class (that is, a
tagcanhavemorethanoneCSSclass).Othersinclude rel , rev , acceptcharset , headers ,and
accesskey .BeautifulSouppresentsthevalue(s)ofamultivaluedattributeasalist:
css_soup=BeautifulSoup('<pclass="bodystrikeout"></p>')
css_soup.p['class']
#["body","strikeout"]
css_soup=BeautifulSoup('<pclass="body"></p>')
css_soup.p['class']
#["body"]
Ifanattributelookslikeithasmorethanonevalue,butitsnotamultivaluedattributeas
definedbyanyversionoftheHTMLstandard,BeautifulSoupwillleavetheattributealone:
id_soup=BeautifulSoup('<pid="myid"></p>')
id_soup.p['id']
#'myid'
Whenyouturnatagbackintoastring,multipleattributevaluesareconsolidated:
rel_soup=BeautifulSoup('<p>Backtothe<arel="index">homepage</a></p>')
rel_soup.a['rel']
#['index']
rel_soup.a['rel']=['index','contents']
print(rel_soup.p)
#<p>Backtothe<arel="indexcontents">homepage</a></p>
IfyouparseadocumentasXML,therearenomultivaluedattributes:
xml_soup=BeautifulSoup('<pclass="bodystrikeout"></p>','xml')
xml_soup.p['class']
#u'bodystrikeout'
NavigableString
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString
classtocontainthesebitsoftext:
tag.string
#u'Extremelybold'
type(tag.string)
#<class'bs4.element.NavigableString'>
A NavigableString is just like a Python Unicode string, except that it also supports some of
the features described in Navigating the tree and Searching the tree. You can convert a
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
8/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
unicode_string=unicode(tag.string)
unicode_string
#u'Extremelybold'
type(unicode_string)
#<type'unicode'>
You cant edit a string in place, but you can replace one string with another, using
replace_with():
tag.string.replace_with("Nolongerbold")
tag
#<blockquote>Nolongerbold</blockquote>
NavigableString supportsmostofthefeaturesdescribedinNavigatingthetreeandSearching
thetree,butnotallofthem.Inparticular,sinceastringcantcontainanything(thewayatag
maycontainastringoranothertag),stringsdontsupportthe .contents or .string attributes,
orthe find() method.
Ifyouwanttousea NavigableString outsideofBeautifulSoup,youshouldcall unicode() onit
to turn it into a normal Python Unicode string. If you dont, your string will carry around a
reference to the entire Beautiful Soup parse tree, even when youre done using Beautiful
Soup.Thisisabigwasteofmemory.
BeautifulSoup
The BeautifulSoup objectitselfrepresentsthedocumentasawhole.Formostpurposes,you
can treat it as a Tag object. This means it supports most of the methods described in
NavigatingthetreeandSearchingthetree.
Sincethe BeautifulSoup objectdoesntcorrespondtoanactualHTMLorXMLtag,ithasno
nameandnoattributes.Butsometimesitsusefultolookatits .name ,soitsbeengiventhe
special .name [document]:
soup.name
#u'[document]'
XMLfile,butthereareafewleftoverbits.Theonlyoneyoullprobablyeverneedtoworry
aboutisthecomment:
markup="<b><!Hey,buddy.Wanttobuyausedparser?></b>"
soup=BeautifulSoup(markup)
comment=soup.b.string
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
9/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
type(comment)
#<class'bs4.element.Comment'>
But when it appears as part of an HTML document, a Comment is displayed with special
formatting:
print(soup.b.prettify())
#<b>
#<!Hey,buddy.Wanttobuyausedparser?>
#</b>
BeautifulSoupdefinesclassesforanythingelsethatmightshowupinanXMLdocument:
CData , ProcessingInstruction , Declaration , and Doctype . Just like Comment , these classes are
subclassesof NavigableString thataddsomethingextratothestring.Heresanexamplethat
replacesthecommentwithaCDATAblock:
frombs4importCData
cdata=CData("ACDATAblock")
comment.replace_with(cdata)
print(soup.b.prettify())
#<b>
#<![CDATA[ACDATAblock]]>
#</b>
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
10/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Ill use this as an example to show you how to move from one part of a document to
another.
Going down
Tagsmaycontainstringsandothertags.Theseelementsarethetagschildren. Beautiful
Soupprovidesalotofdifferentattributesfornavigatinganditeratingoveratagschildren.
NotethatBeautifulSoupstringsdontsupportanyoftheseattributes,becauseastringcant
havechildren.
Youcandousethistrickagainandagaintozoominonacertainpartoftheparsetree.This
codegetsthefirst<b>tagbeneaththe<body>tag:
soup.body.b
#<b>TheDormouse'sstory</b>
Usingatagnameasanattributewillgiveyouonlythefirsttagbythatname:
soup.a
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
Ifyouneedtoget all the <a> tags, or anything more complicated than the first tag with a
certainname,youllneedtouseoneofthemethodsdescribedinSearchingthetree, such
asfind_all():
soup.find_all('a')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
.contents
and
.children
Atagschildrenareavailableinalistcalled .contents :
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
11/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
head_tag=soup.head
head_tag
#<head><title>TheDormouse'sstory</title></head>
head_tag.contents
[<title>TheDormouse'sstory</title>]
title_tag=head_tag.contents[0]
title_tag
#<title>TheDormouse'sstory</title>
title_tag.contents
#[u'TheDormouse'sstory']
The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the
BeautifulSoup object.:
len(soup.contents)
#1
soup.contents[0].name
#u'html'
Instead of getting them as a list, you can iterate over a tags children using the .children
generator:
forchildintitle_tag.children:
print(child)
#TheDormouse'sstory
.descendants
The .contents and .children attributesonlyconsideratagsdirectchildren.Forinstance,the
<head>taghasasingledirectchildthe<title>tag:
head_tag.contents
#[<title>TheDormouse'sstory</title>]
Butthe<title>tagitselfhasachild:thestringTheDormousesstory.Theresasensein
whichthatstringisalsoachildofthe<head>tag.The .descendants attributeletsyouiterate
overallofatagschildren,recursively:itsdirectchildren,thechildrenofitsdirectchildren,
andsoon:
forchildinhead_tag.descendants:
print(child)
#<title>TheDormouse'sstory</title>
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
12/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
#TheDormouse'sstory
The <head> tag has only one child, but it has two descendants: the <title> tag and the
<title>tagschild.The BeautifulSoup objectonlyhasonedirectchild(the<html>tag),butit
hasawholelotofdescendants:
len(list(soup.children))
#1
len(list(soup.descendants))
#25
.string
Ifataghasonlyonechild,andthatchildisa NavigableString ,thechildismadeavailableas
.string :
title_tag.string
#u'TheDormouse'sstory'
If a tags only child is another tag, and that tag has a .string , then the parent tag is
consideredtohavethesame .string asitschild:
head_tag.contents
#[<title>TheDormouse'sstory</title>]
head_tag.string
#u'TheDormouse'sstory'
If a tag contains more than one thing, then its not clear what .string should refer to, so
.string isdefinedtobe None :
print(soup.html.string)
#None
.strings
and
stripped_strings
If theres more than one thing inside a tag, you can still look at just the strings. Use the
.strings generator:
forstringinsoup.strings:
print(repr(string))
#u"TheDormouse'sstory"
#u'\n\n'
#u"TheDormouse'sstory"
#u'\n\n'
#u'Onceuponatimetherewerethreelittlesisters;andtheirnameswere\n'
#u'Elsie'
#u',\n'
#u'Lacie'
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
13/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
#u'and\n'
#u'Tillie'
#u';\nandtheylivedatthebottomofawell.'
#u'\n\n'
#u'...'
#u'\n'
These strings tend to have a lot of extra whitespace, which you can remove by using the
.stripped_strings generatorinstead:
forstringinsoup.stripped_strings:
print(repr(string))
#u"TheDormouse'sstory"
#u"TheDormouse'sstory"
#u'Onceuponatimetherewerethreelittlesisters;andtheirnameswere'
#u'Elsie'
#u','
#u'Lacie'
#u'and'
#u'Tillie'
#u';\nandtheylivedatthebottomofawell.'
#u'...'
Here,stringsconsistingentirelyofwhitespaceareignored,andwhitespaceatthebeginning
andendofstringsisremoved.
Going up
Continuing the family tree analogy, every tag and every string has a parent: the tag that
containsit.
.parent
Youcanaccessanelementsparentwiththe .parent attribute.Intheexamplethreesisters
document,the<head>tagistheparentofthe<title>tag:
title_tag=soup.title
title_tag
#<title>TheDormouse'sstory</title>
title_tag.parent
#<head><title>TheDormouse'sstory</title></head>
Thetitlestringitselfhasaparent:the<title>tagthatcontainsit:
title_tag.string.parent
#<title>TheDormouse'sstory</title>
14/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
type(html_tag.parent)
#<class'bs4.BeautifulSoup'>
.parents
Youcaniterateoverallofanelementsparentswith .parents .Thisexampleuses .parents to
travelfroman<a>tagburieddeepwithinthedocument,totheverytopofthedocument:
link=soup.a
link
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
forparentinlink.parents:
ifparentisNone:
print(parent)
else:
print(parent.name)
#p
#body
#html
#[document]
#None
Going sideways
Considerasimpledocumentlikethis:
sibling_soup=BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())
#<html>
#<body>
#<a>
#<b>
#text1
#</b>
#<c>
#text2
#</c>
#</a>
#</body>
#</html>
The<b>tagandthe<c>tagareatthesamelevel:theyrebothdirectchildrenofthesame
tag.Wecallthemsiblings.Whenadocumentisprettyprinted,siblingsshowupatthesame
indentationlevel.Youcanalsousethisrelationshipinthecodeyouwrite.
.next_sibling
and
.previous_sibling
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
15/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
The <b> tag has a .next_sibling , but no .previous_sibling , because theres nothing before
the <b> tag on the same level of the tree. For the same reason, the <c> tag has a
.previous_sibling butno .next_sibling :
print(sibling_soup.b.previous_sibling)
#None
print(sibling_soup.c.next_sibling)
#None
Thestringstext1andtext2arenotsiblings,becausetheydonthavethesameparent:
sibling_soup.b.string
#u'text1'
print(sibling_soup.b.string.next_sibling)
#None
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
16/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
.next_siblings
and
.previous_siblings
AnHTMLparsertakesthisstringofcharactersandturnsitintoaseriesofevents:openan
<html>tag,opena<head>tag,opena<title>tag,addastring,closethe<title>tag,
opena<p>tag,andsoon.BeautifulSoupofferstoolsforreconstructingtheinitialparseof
thedocument.
.next_element
and
.previous_element
The .next_element attribute of a string or tag points to whatever was parsed immediately
afterwards.Itmightbethesameas .next_sibling ,butitsusuallydrasticallydifferent.
Heres the final <a> tag in the three sisters document. Its .next_sibling is a string: the
conclusionofthesentencethatwasinterruptedbythestartofthe<a>tag.:
last_a_tag=soup.find("a",id="link3")
last_a_tag
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>
last_a_tag.next_sibling
#';andtheylivedatthebottomofawell.'
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
17/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Butthe .next_element of that <a> tag, the thing that was parsed immediately after the <a>
tag,isnottherestofthatsentence:itsthewordTillie:
last_a_tag.next_element
#u'Tillie'
Thatsbecauseintheoriginalmarkup,thewordTillieappearedbeforethatsemicolon.The
parser encountered an <a> tag, then the word Tillie, then the closing </a> tag, then the
semicolonandrestofthesentence.Thesemicolonisonthesamelevelasthe<a>tag,but
thewordTilliewasencounteredfirst.
The .previous_element attribute is the exact opposite of .next_element . It points to whatever
elementwasparsedimmediatelybeforethisone:
last_a_tag.previous_element
#u'and\n'
last_a_tag.previous_element.next_element
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>
.next_elements
and
.previous_elements
Youshouldgettheideabynow.Youcanusetheseiteratorstomoveforwardorbackward
inthedocumentasitwasparsed:
forelementinlast_a_tag.next_elements:
print(repr(element))
#u'Tillie'
#u';\nandtheylivedatthebottomofawell.'
#u'\n\n'
#<pclass="story">...</p>
#u'...'
#u'\n'
#None
18/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,
<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and
<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;
andtheylivedatthebottomofawell.</p>
<pclass="story">...</p>
"""
frombs4importBeautifulSoup
soup=BeautifulSoup(html_doc,'html.parser')
By passing in a filter to an argument like find_all() , you can zoom in on the parts of the
documentyoureinterestedin.
Kinds of filters
Before talking in detail about find_all() and similar methods, I want to show examples of
different filters you can pass into these methods. These filters show up again and again,
throughout the search API. You can use them to filter based on a tags name, on its
attributes,onthetextofastring,oronsomecombinationofthese.
A string
The simplest filter is a string. Pass a string to a search method and Beautiful Soup will
performamatchagainstthatexactstring.Thiscodefindsallthe<b>tagsinthedocument:
soup.find_all('b')
#[<b>TheDormouse'sstory</b>]
Ifyoupassinabytestring,BeautifulSoupwillassumethestringisencodedasUTF8.You
canavoidthisbypassinginaUnicodestringinstead.
A regular expression
If you pass in a regular expression object, Beautiful Soup will filter against that regular
expressionusingits match() method.Thiscodefindsallthetagswhosenamesstartwiththe
letterbinthiscase,the<body>tagandthe<b>tag:
importre
fortaginsoup.find_all(re.compile("^b")):
print(tag.name)
#body
#b
Thiscodefindsallthetagswhosenamescontainthelettert:
fortaginsoup.find_all(re.compile("t")):
print(tag.name)
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
19/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
#html
#title
A list
Ifyoupassinalist,BeautifulSoupwillallowastringmatchagainstanyiteminthatlist.This
codefindsallthe<a>tagsandallthe<b>tags:
soup.find_all(["a","b"])
#[<b>TheDormouse'sstory</b>,
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
True
Thevalue True matcheseverythingitcan.Thiscodefindsallthetagsinthedocument,but
noneofthetextstrings:
fortaginsoup.find_all(True):
print(tag.name)
#html
#head
#title
#body
#p
#b
#p
#a
#a
#a
#p
A function
Ifnoneoftheothermatchesworkforyou,defineafunctionthattakesanelementasitsonly
argument.Thefunctionshouldreturn True iftheargumentmatches,and False otherwise.
Heresafunctionthatreturns True ifatagdefinestheclassattributebutdoesntdefinethe
idattribute:
defhas_class_but_no_id(tag):
returntag.has_attr('class')andnottag.has_attr('id')
20/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
#<pclass="story">...</p>]
Thisfunctiononlypicksupthe<p>tags.Itdoesntpickupthe<a>tags,becausethosetags
define both class and id. It doesnt pick up tags like <html> and <title>, because those
tagsdontdefineclass.
Ifyoupassinafunctiontofilteronaspecificattributelike href ,theargumentpassedinto
thefunctionwillbetheattributevalue,notthewholetag.Heresafunctionthatfindsall a
tagswhose href attributedoesnotmatcharegularexpression:
defnot_lacie(href):
returnhrefandnotre.compile("lacie").search(href)
soup.find_all(href=not_lacie)
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
Thefunctioncanbeascomplicatedasyouneedittobe.Heresafunctionthatreturns True
ifatagissurroundedbystringobjects:
frombs4importNavigableString
defsurrounded_by_strings(tag):
return(isinstance(tag.next_element,NavigableString)
andisinstance(tag.previous_element,NavigableString))
fortaginsoup.find_all(surrounded_by_strings):
printtag.name
#p
#a
#a
#a
#p
Nowwerereadytolookatthesearchmethodsindetail.
find_all()
Signature:find_all(name,attrs,recursive,string,limit,**kwargs)
The find_all() methodlooksthroughatagsdescendantsandretrievesalldescendantsthat
matchyourfilters.IgaveseveralexamplesinKindsoffilters,buthereareafewmore:
soup.find_all("title")
#[<title>TheDormouse'sstory</title>]
soup.find_all("p","title")
#[<pclass="title"><b>TheDormouse'sstory</b></p>]
soup.find_all("a")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
21/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
soup.find_all(id="link2")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]
importre
soup.find(string=re.compile("sisters"))
#u'Onceuponatimetherewerethreelittlesisters;andtheirnameswere\n'
Some of these should look familiar, but others are new. What does it mean to pass in a
valuefor string ,or id ?Whydoes find_all("p","title") finda<p>tagwiththeCSSclass
title?Letslookattheargumentsto find_all() .
The
name
argument
Pass in a value for name and youll tell Beautiful Soup to only consider tags with certain
names.Textstringswillbeignored,aswilltagswhosenamesthatdontmatch.
Thisisthesimplestusage:
soup.find_all("title")
#[<title>TheDormouse'sstory</title>]
22/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Youcanfiltermultipleattributesatoncebypassinginmorethanonekeywordargument:
soup.find_all(href=re.compile("elsie"),id='link1')
#[<aclass="sister"href="http://example.com/elsie"id="link1">three</a>]
Someattributes,likethedata*attributesinHTML5,havenamesthatcantbeusedasthe
namesofkeywordarguments:
data_soup=BeautifulSoup('<divdatafoo="value">foo!</div>')
data_soup.find_all(datafoo="value")
#SyntaxError:keywordcan'tbeanexpression
Youcanusetheseattributesinsearchesbyputtingthemintoadictionaryandpassingthe
dictionaryinto find_all() asthe attrs argument:
data_soup.find_all(attrs={"datafoo":"value"})
#[<divdatafoo="value">foo!</div>]
class_
soup.find_all(class_=re.compile("itl"))
#[<pclass="title"><b>TheDormouse'sstory</b></p>]
defhas_six_characters(css_class):
returncss_classisnotNoneandlen(css_class)==6
soup.find_all(class_=has_six_characters)
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
Remember that a single tag can have multiple values for its class attribute. When you
searchforatagthatmatchesacertainCSSclass,yourematchingagainstanyofitsCSS
classes:
css_soup=BeautifulSoup('<pclass="bodystrikeout"></p>')
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
23/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
css_soup.find_all("p",class_="strikeout")
#[<pclass="bodystrikeout"></p>]
css_soup.find_all("p",class_="body")
#[<pclass="bodystrikeout"></p>]
Butsearchingforvariantsofthestringvaluewontwork:
css_soup.find_all("p",class_="strikeoutbody")
#[]
IfyouwanttosearchfortagsthatmatchtwoormoreCSSclasses,youshoulduseaCSS
selector:
css_soup.select("p.strikeout.body")
#[<pclass="bodystrikeout"></p>]
The
string
argument
With string you can search for strings instead of tags. As with name and the keyword
arguments, you can pass in a string, a regular expression, a list, a function, or the value
True.Herearesomeexamples:
soup.find_all(string="Elsie")
#[u'Elsie']
soup.find_all(string=["Tillie","Elsie","Lacie"])
#[u'Elsie',u'Lacie',u'Tillie']
soup.find_all(string=re.compile("Dormouse"))
[u"TheDormouse'sstory",u"TheDormouse'sstory"]
defis_the_only_string_within_a_tag(s):
"""ReturnTrueifthisstringistheonlychildofitsparenttag."""
return(s==s.parent.string)
soup.find_all(string=is_the_only_string_within_a_tag)
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
24/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
#[u"TheDormouse'sstory",u"TheDormouse'sstory",u'Elsie',u'Lacie',u'Tillie',u'...']
Although string is for finding strings, you can combine it with arguments that find tags:
BeautifulSoupwillfindalltagswhose .string matchesyourvaluefor string .Thiscodefinds
the<a>tagswhose .string isElsie:
soup.find_all("a",string="Elsie")
#[<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>]
The
limit
argument
find_all() returnsallthetagsandstringsthatmatchyourfilters.Thiscantakeawhileifthe
The
recursive
argument
If you call mytag.find_all() , Beautiful Soup will examine all the descendants of mytag : its
children,itschildrenschildren,andsoon.IfyouonlywantBeautifulSouptoconsiderdirect
children,youcanpassin recursive=False .Seethedifferencehere:
soup.html.find_all("title")
#[<title>TheDormouse'sstory</title>]
soup.html.find_all("title",recursive=False)
#[]
Heresthatpartofthedocument:
<html>
<head>
<title>
TheDormouse'sstory
</title>
</head>
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
25/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
...
The <title> tag is beneath the <html> tag, but its not directly beneath the <html> tag: the
<head>tagisintheway.BeautifulSoupfindsthe<title>tagwhenitsallowedtolookatall
descendants of the <html> tag, but when recursive=False restricts it to the <html> tags
immediatechildren,itfindsnothing.
BeautifulSoupoffersalotoftreesearchingmethods(coveredbelow),andtheymostlytake
the same arguments as find_all() : name , attrs , string , limit , and the keyword arguments.
But the recursive argument is different: find_all() and find() are the only methods that
supportit.Passing recursive=False intoamethodlike find_parents() wouldntbeveryuseful.
find_all()
Thesetwolinesarealsoequivalent:
soup.title.find_all(string=True)
soup.title(string=True)
find()
Signature:find(name,attrs,recursive,string,**kwargs)
The find_all() method scans the entire document looking for results, but sometimes you
onlywanttofindoneresult.Ifyouknowadocumentonlyhasone<body>tag,itsawaste
oftimetoscantheentiredocumentlookingformore.Ratherthanpassingin limit=1 every
timeyoucall find_all ,youcanusethe find() method.Thesetwolinesofcodearenearly
equivalent:
soup.find_all('title',limit=1)
#[<title>TheDormouse'sstory</title>]
soup.find('title')
#<title>TheDormouse'sstory</title>
The only difference is that find_all() returns a list containing the single result, and find()
justreturnstheresult.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
26/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Rememberthe soup.head.title trick from Navigating using tag names? That trick works by
repeatedlycalling find() :
soup.head.title
#<title>TheDormouse'sstory</title>
soup.find("head").find("title")
#<title>TheDormouse'sstory</title>
find_parents()
and
find_parent()
Signature:find_parents(name,attrs,string,limit,**kwargs)
Signature:find_parent(name,attrs,string,**kwargs)
Ispentalotoftimeabovecovering find_all() and find() .TheBeautifulSoupAPIdefines
ten other methods for searching the tree, but dont be afraid. Five of these methods are
basicallythesameas find_all() ,andtheotherfive arebasicallythe sameas find() . The
onlydifferencesareinwhatpartsofthetreetheysearch.
First lets consider find_parents() and find_parent() . Remember that find_all() and find()
worktheirwaydownthetree,lookingattagsdescendants.Thesemethodsdotheopposite:
theyworktheirwayupthetree,lookingatatags(orastrings)parents.Letstrythemout,
startingfromastringburieddeepinthethreedaughtersdocument:
a_string=soup.find(string="Lacie")
a_string
#u'Lacie'
a_string.find_parents("a")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]
a_string.find_parent("p")
#<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>and
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>;
#andtheylivedatthebottomofawell.</p>
a_string.find_parents("p",class="title")
#[]
Oneofthethree<a>tagsisthedirectparentofthestringinquestion,sooursearchfindsit.
One of the three <p> tags is an indirect parent of the string, and our search finds that as
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
27/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
find_next_siblings()
and
find_next_sibling()
Signature:find_next_siblings(name,attrs,string,limit,**kwargs)
Signature:find_next_sibling(name,attrs,string,**kwargs)
Thesemethodsuse.next_siblingstoiterateovertherestofanelementssiblingsinthetree.
The find_next_siblings() methodreturnsallthesiblingsthatmatch,and find_next_sibling()
onlyreturnsthefirstone:
first_link=soup.a
first_link
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
first_link.find_next_siblings("a")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
first_story_paragraph=soup.find("p","story")
first_story_paragraph.find_next_sibling("p")
#<pclass="story">...</p>
find_previous_siblings()
and
find_previous_sibling()
Signature:find_previous_siblings(name,attrs,string,limit,**kwargs)
Signature:find_previous_sibling(name,attrs,string,**kwargs)
Thesemethodsuse.previous_siblingstoiterateoveranelementssiblingsthatprecedeitin
the tree. The find_previous_siblings() method returns all the siblings that match, and
find_previous_sibling() onlyreturnsthefirstone:
last_link=soup.find("a",id="link3")
last_link
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>
last_link.find_previous_siblings("a")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]
first_story_paragraph=soup.find("p","story")
first_story_paragraph.find_previous_sibling("p")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
28/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
#<pclass="title"><b>TheDormouse'sstory</b></p>
find_all_next()
and
find_next()
Signature:find_all_next(name,attrs,string,limit,**kwargs)
Signature:find_next(name,attrs,string,**kwargs)
These methods use .next_elements to iterate over whatever tags and strings that come
after it in the document. The find_all_next() method returns all matches, and find_next()
onlyreturnsthefirstmatch:
first_link=soup.a
first_link
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
first_link.find_all_next(string=True)
#[u'Elsie',u',\n',u'Lacie',u'and\n',u'Tillie',
#u';\nandtheylivedatthebottomofawell.',u'\n\n',u'...',u'\n']
first_link.find_next("p")
#<pclass="story">...</p>
Inthefirstexample,thestringElsieshowedup,eventhoughitwascontainedwithinthe
<a>tagwestartedfrom.Inthesecondexample,thelast<p>taginthedocumentshowed
up, even though its not in the same part of the tree as the <a> tag we started from. For
thesemethods,allthatmattersisthatanelementmatchthefilter,andshowuplaterinthe
documentthanthestartingelement.
find_all_previous()
and
find_previous()
Signature:find_all_previous(name,attrs,string,limit,**kwargs)
Signature:find_previous(name,attrs,string,**kwargs)
These methods use .previous_elements to iterate over the tags and strings that came
before it in the document. The find_all_previous() method returns all matches, and
find_previous() onlyreturnsthefirstmatch:
first_link=soup.a
first_link
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
first_link.find_all_previous("p")
#[<pclass="story">Onceuponatimetherewerethreelittlesisters;...</p>,
#<pclass="title"><b>TheDormouse'sstory</b></p>]
first_link.find_previous("title")
#<title>TheDormouse'sstory</title>
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
29/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
The call to find_all_previous("p") found the first paragraph in the document (the one with
class=title),butitalsofindsthesecondparagraph,the<p>tagthatcontainsthe<a>tag
westartedwith.Thisshouldntbetoosurprising:werelookingatallthetagsthatshowup
earlier in the document than the one we started with. A <p> tag that contains an <a> tag
musthaveshownupbeforethe<a>tagitcontains.
CSS selectors
BeautifulSoupsupportsthemostcommonlyusedCSSselectors.Justpassastringintothe
.select() methodofa Tag objectorthe BeautifulSoup objectitself.
Youcanfindtags:
soup.select("title")
#[<title>TheDormouse'sstory</title>]
soup.select("pnthoftype(3)")
#[<pclass="story">...</p>]
Findtagsbeneathothertags:
soup.select("bodya")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select("htmlheadtitle")
#[<title>TheDormouse'sstory</title>]
Findtagsdirectlybeneathothertags:
soup.select("head>title")
#[<title>TheDormouse'sstory</title>]
soup.select("p>a")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select("p>a:nthoftype(2)")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]
soup.select("p>#link1")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]
soup.select("body>a")
#[]
Findthesiblingsoftags:
soup.select("#link1~.sister")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
30/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select("#link1+.sister")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]
FindtagsbyCSSclass:
soup.select(".sister")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select("[class~=sister]")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
FindtagsbyID:
soup.select("#link1")
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]
soup.select("a#link2")
#[<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>]
Findtagsthatmatchanyselectorfromalistofselectors:
soup.select(#link1,#link2)#[<aclass=sisterhref=http://example.com/elsie
id=link1>Elsie</a>,#<aclass=sisterhref=http://example.com/lacie
id=link2>Lacie</a>]
Testfortheexistenceofanattribute:
soup.select('a[href]')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
Findtagsbyattributevalue:
soup.select('a[href="http://example.com/elsie"]')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]
soup.select('a[href^="http://example.com/"]')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>,
#<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a>,
#<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select('a[href$="tillie"]')
#[<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>]
soup.select('a[href*=".com/el"]')
#[<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
31/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Matchlanguagecodes:
multilingual_markup="""
<plang="en">Hello</p>
<plang="enus">Howdy,y'all</p>
<plang="engb">Pippip,oldfruit</p>
<plang="fr">Bonjourmesamis</p>
"""
multilingual_soup=BeautifulSoup(multilingual_markup)
multilingual_soup.select('p[lang|=en]')
#[<plang="en">Hello</p>,
#<plang="enus">Howdy,y'all</p>,
#<plang="engb">Pippip,oldfruit</p>]
Findonlythefirsttagthatmatchesaselector:
soup.select_one(".sister")
#<aclass="sister"href="http://example.com/elsie"id="link1">Elsie</a>
ThisisallaconvenienceforuserswhoknowtheCSSselectorsyntax.Youcandoallthis
stuffwiththeBeautifulSoupAPI.AndifCSSselectorsareallyouneed,youmightaswell
use lxml directly: its a lot faster, and it supports more CSS selectors. But this lets you
combinesimpleCSSselectorswiththeBeautifulSoupAPI.
Modifying
.string
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
32/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Becareful:ifthetagcontainedothertags,theyandalltheircontentswillbedestroyed.
append()
You can add to a tags contents with Tag.append() . It works just like calling .append() on a
Pythonlist:
soup=BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")
soup
#<html><head></head><body><a>FooBar</a></body></html>
soup.a.contents
#[u'Foo',u'Bar']
NavigableString()
and
.new_tag()
Ifyouneedtoaddastringtoadocument,noproblemyoucanpassaPythonstringinto
append() ,oryoucancallthe NavigableString constructor:
soup=BeautifulSoup("<b></b>")
tag=soup.b
tag.append("Hello")
new_string=NavigableString("there")
tag.append(new_string)
tag
#<b>Hellothere.</b>
tag.contents
#[u'Hello',u'there']
If you want to create a comment or some other subclass of NavigableString , just call the
constructor:
frombs4importComment
new_comment=Comment("Nicetoseeyou.")
tag.append(new_comment)
tag
#<b>Hellothere<!Nicetoseeyou.></b>
tag.contents
#[u'Hello',u'there',u'Nicetoseeyou.']
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
33/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
(ThisisanewfeatureinBeautifulSoup4.4.0.)
Whatifyouneedtocreateawholenewtag?Thebestsolutionistocallthefactorymethod
BeautifulSoup.new_tag() :
soup=BeautifulSoup("<b></b>")
original_tag=soup.b
new_tag=soup.new_tag("a",href="http://www.example.com")
original_tag.append(new_tag)
original_tag
#<b><ahref="http://www.example.com"></a></b>
new_tag.string="Linktext."
original_tag
#<b><ahref="http://www.example.com">Linktext.</a></b>
Onlythefirstargument,thetagname,isrequired.
insert()
Tag.insert() isjustlike Tag.append() , except the new element doesnt necessarily go at the
insert_before()
and
insert_after()
The insert_after() method moves a tag or string so that it immediately follows something
elseintheparsetree:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
34/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
soup.b.i.insert_after(soup.new_string("ever"))
soup.b
#<b><i>Don't</i>everstop</b>
soup.b.contents
#[<i>Don't</i>,u'ever',u'stop']
clear()
Tag.clear() removesthecontentsofatag:
markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
tag=soup.a
tag.clear()
tag
#<ahref="http://example.com/"></a>
extract()
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that
wasextracted:
markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
a_tag=soup.a
i_tag=soup.i.extract()
a_tag
#<ahref="http://example.com/">Ilinkedto</a>
i_tag
#<i>example.com</i>
print(i_tag.parent)
None
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
35/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
decompose()
Tag.decompose() removesatagfromthetree,thencompletelydestroysitanditscontents:
markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
a_tag=soup.a
soup.i.decompose()
a_tag
#<ahref="http://example.com/">Ilinkedto</a>
replace_with()
PageElement.replace_with() removesatagorstringfromthetree,andreplacesitwiththetag
orstringofyourchoice:
markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
a_tag=soup.a
new_tag=soup.new_tag("b")
new_tag.string="example.net"
a_tag.i.replace_with(new_tag)
a_tag
#<ahref="http://example.com/">Ilinkedto<b>example.net</b></a>
replace_with() returnsthetagorstringthatwasreplaced,sothatyoucanexamineitoradd
itbacktoanotherpartofthetree.
wrap()
PageElement.wrap() wrapsanelementinthetagyouspecify.Itreturnsthenewwrapper:
soup=BeautifulSoup("<p>IwishIwasbold.</p>")
soup.p.string.wrap(soup.new_tag("b"))
#<b>IwishIwasbold.</b>
soup.p.wrap(soup.new_tag("div")
#<div><p><b>IwishIwasbold.</b></p></div>
ThismethodisnewinBeautifulSoup4.0.5.
unwrap()
Tag.unwrap() istheoppositeof wrap() . It replaces a tag with whatevers inside that tag. Its
goodforstrippingoutmarkup:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
36/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
a_tag=soup.a
a_tag.i.unwrap()
a_tag
#<ahref="http://example.com/">Ilinkedtoexample.com</a>
Output
Prettyprinting
The prettify() methodwillturnaBeautifulSoupparsetreeintoanicelyformattedUnicode
string,witheachHTML/XMLtagonitsownline:
markup='<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
soup=BeautifulSoup(markup)
soup.prettify()
#'<html>\n<head>\n</head>\n<body>\n<ahref="http://example.com/">\n...'
print(soup.prettify())
#<html>
#<head>
#</head>
#<body>
#<ahref="http://example.com/">
#Ilinkedto
#<i>
#example.com
#</i>
#</a>
#</body>
#</html>
Nonpretty printing
If you just want a string, with no fancy formatting, you can call unicode() or str() on a
BeautifulSoup object,ora Tag withinit:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
37/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
str(soup)
#'<html><head></head><body><ahref="http://example.com/">Ilinkedto<i>example.com</i></a></body>
unicode(soup.a)
#u'<ahref="http://example.com/">Ilinkedto<i>example.com</i></a>'
Output formatters
IfyougiveBeautifulSoupadocumentthatcontainsHTMLentitieslike&lquot,theyllbe
convertedtoUnicodecharacters:
soup=BeautifulSoup("“Dammit!”hesaid.")
unicode(soup)
#u'<html><head></head><body>\u201cDammit!\u201dhesaid.</body></html>'
If you then convert the document to a string, the Unicode characters will be encoded as
UTF8.YouwontgettheHTMLentitiesback:
str(soup)
#'<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9dhesaid.</body></html>'
By default, the only characters that are escaped upon output are bare ampersands and
angle brackets. These get turned into &, <, and >, so that Beautiful Soup
doesntinadvertentlygenerateinvalidHTMLorXML:
soup=BeautifulSoup("<p>ThelawfirmofDewey,Cheatem,&Howe</p>")
soup.p
#<p>ThelawfirmofDewey,Cheatem,&Howe</p>
soup=BeautifulSoup('<ahref="http://example.com/?foo=val1&bar=val2">Alink</a>')
soup.a
#<ahref="http://example.com/?foo=val1&bar=val2">Alink</a>
38/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
#Iladit<<Sacrbleu!>>
#</p>
#</body>
#</html>
If you pass in formatter="html" , Beautiful Soup will convert Unicode characters to HTML
entitieswheneverpossible:
print(soup.prettify(formatter="html"))
#<html>
#<body>
#<p>
#Iladit<<Sacrébleu!>>
#</p>
#</body>
#</html>
39/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
get_text()
If you only want the text part of a document or tag, you can use the get_text() method.It
returnsallthetextinadocumentorbeneathatag,asasingleUnicodestring:
markup='<ahref="http://example.com/">\nIlinkedto<i>example.com</i>\n</a>'
soup=BeautifulSoup(markup)
soup.get_text()
u'\nIlinkedtoexample.com\n'
soup.i.get_text()
u'example.com'
Youcanspecifyastringtobeusedtojointhebitsoftexttogether:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
40/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
#soup.get_text("|")
u'\nIlinkedto|example.com|\n'
You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of
text:
#soup.get_text("|",strip=True)
u'Ilinkedto|example.com'
Butatthatpointyoumightwanttousethe.stripped_stringsgeneratorinstead,andprocess
thetextyourself:
[textfortextinsoup.stripped_strings]
#[u'Ilinkedto',u'example.com']
41/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Heresashortdocument,parsedasHTML:
BeautifulSoup("<a><b/></a>")
#<html><head></head><body><a><b></b></a></body></html>
Sinceanempty<b/>tagisnotvalidHTML,theparserturnsitintoa<b></b>tagpair.
Heres the same document parsed as XML (running this requires that you have lxml
installed).Notethattheempty<b/>tagisleftalone,andthatthedocumentisgivenanXML
declarationinsteadofbeingputintoan<html>tag.:
BeautifulSoup("<a><b/></a>","xml")
#<?xmlversion="1.0"encoding="utf8"?>
#<a><b/></a>
There are also differences between HTML parsers. If you give Beautiful Soup a perfectly
formed HTML document, these differences wont matter. One parser will be faster than
another, but theyll all give you a data structure that looks exactly like the original HTML
document.
But if the document is not perfectlyformed, different parsers will give different results.
Heresashort,invaliddocumentparsedusinglxmlsHTMLparser.Notethatthedangling
</p>tagissimplyignored:
BeautifulSoup("<a></p>","lxml")
#<html><body><a></a></body></html>
Heresthesamedocumentparsedusinghtml5lib:
BeautifulSoup("<a></p>","html5lib")
#<html><head></head><body><a><p></p></a></body></html>
Instead of ignoring the dangling </p> tag, html5lib pairs it with an opening <p> tag. This
parseralsoaddsanempty<head>tagtothedocument.
HeresthesamedocumentparsedwithPythonsbuiltinHTMLparser:
BeautifulSoup("<a></p>","html.parser")
#<a></a>
Likehtml5lib,thisparserignorestheclosing</p>tag.Unlikehtml5lib,thisparsermakesno
attempt to create a wellformed HTML document by adding a <body> tag. Unlike lxml, it
doesntevenbothertoaddan<html>tag.
Since the document <a></p> is invalid, none of these techniques is the correct way to
handle it. The html5lib parser uses techniques that are part of the HTML5 standard, so it
hasthebestclaimonbeingthecorrectway,butallthreetechniquesarelegitimate.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
42/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Differences between parsers can affect your script. If youre planning on distributing your
scripttootherpeople,orrunningitonmultiplemachines,youshouldspecifyaparserinthe
BeautifulSoup constructor. That will reduce the chances that your users parse a document
differentlyfromthewayyouparseit.
Encodings
AnyHTMLorXMLdocumentiswritteninaspecificencodinglikeASCIIorUTF8.Butwhen
youloadthatdocumentintoBeautifulSoup,youlldiscoveritsbeenconvertedtoUnicode:
markup="<h1>Sacr\xc3\xa9bleu!</h1>"
soup=BeautifulSoup(markup)
soup.h1
#<h1>Sacrbleu!</h1>
soup.h1.string
#u'Sacr\xe9bleu!'
Itsnotmagic.(Thatsurewouldbenice.)BeautifulSoupusesasublibrarycalledUnicode,
Dammit to detect a documents encoding and convert it to Unicode. The autodetected
encodingisavailableasthe .original_encoding attributeofthe BeautifulSoup object:
soup.original_encoding
'utf8'
Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes.
Sometimes it guesses correctly, but only after a bytebybyte search of the document that
takes a very long time. If you happen to know a documents encoding ahead of time, you
can avoid mistakes and delays by passing it to the BeautifulSoup constructor as
from_encoding .
HeresadocumentwritteninISO88598.ThedocumentissoshortthatUnicode,Dammit
cantgetagoodlockonit,andmisidentifiesitasISO88597:
markup=b"<h1>\xed\xe5\xec\xf9</h1>"
soup=BeautifulSoup(markup)
soup.h1
<h1></h1>
soup.original_encoding
'ISO88597'
Wecanfixthisbypassinginthecorrect from_encoding :
soup=BeautifulSoup(markup,from_encoding="iso88598")
soup.h1
<h1></h1>
soup.original_encoding
'iso88598'
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
43/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
If you dont know what the correct encoding is, but you know that Unicode, Dammit is
guessingwrong,youcanpassthewrongguessesinas exclude_encodings :
soup=BeautifulSoup(markup,exclude_encodings=["ISO88597"])
soup.h1
<h1></h1>
soup.original_encoding
'WINDOWS1255'
Windows1255isnt100%correct,butthatencodingisacompatiblesupersetofISO8859
8,soitscloseenough.( exclude_encodings isanewfeatureinBeautifulSoup4.4.0.)
Inrarecases(usuallywhenaUTF8documentcontainstextwritteninacompletelydifferent
encoding),theonlywaytogetUnicodemaybetoreplacesomecharacterswiththespecial
Unicode character REPLACEMENT CHARACTER (U+FFFD, ). If Unicode, Dammit
needs to do this, it will set the .contains_replacement_characters attribute to True on the
UnicodeDammit or BeautifulSoup object.ThisletsyouknowthattheUnicoderepresentationis
notanexactrepresentationoftheoriginalsomedatawaslost.Ifadocumentcontains,
but .contains_replacement_characters is False ,youllknowthatthewasthereoriginally(asit
isinthisparagraph)anddoesntstandinformissingdata.
Output encoding
WhenyouwriteoutadocumentfromBeautifulSoup,yougetaUTF8document,evenifthe
documentwasntinUTF8tobeginwith.HeresadocumentwrittenintheLatin1encoding:
markup=b'''
<html>
<head>
<metacontent="text/html;charset=ISOLatin1"httpequiv="Contenttype"/>
</head>
<body>
<p>Sacr\xe9bleu!</p>
</body>
</html>
'''
soup=BeautifulSoup(markup)
print(soup.prettify())
#<html>
#<head>
#<metacontent="text/html;charset=utf8"httpequiv="Contenttype"/>
#</head>
#<body>
#<p>
#Sacrbleu!
#</p>
#</body>
#</html>
Notethatthe<meta>taghasbeenrewrittentoreflectthefactthatthedocumentisnowin
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
44/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
UTF8.
IfyoudontwantUTF8,youcanpassanencodinginto prettify() :
print(soup.prettify("latin1"))
#<html>
#<head>
#<metacontent="text/html;charset=latin1"httpequiv="Contenttype"/>
#...
Any characters that cant be represented in your chosen encoding will be converted into
numeric XML entity references. Heres a document that includes the Unicode character
SNOWMAN:
markup=u"<b>\N{SNOWMAN}</b>"
snowman_soup=BeautifulSoup(markup)
tag=snowman_soup.b
TheSNOWMANcharactercanbepartofaUTF8document(itlookslike),buttheresno
representationforthatcharacterinISOLatin1orASCII,soitsconvertedinto☃for
thoseencodings:
print(tag.encode("utf8"))
#<b></b>
printtag.encode("latin1")
#<b>☃</b>
printtag.encode("ascii")
#<b>☃</b>
Unicode, Dammit
YoucanuseUnicode,DammitwithoutusingBeautifulSoup.Itsusefulwheneveryouhave
datainanunknownencodingandyoujustwantittobecomeUnicode:
frombs4importUnicodeDammit
dammit=UnicodeDammit("Sacr\xc3\xa9bleu!")
print(dammit.unicode_markup)
#Sacrbleu!
dammit.original_encoding
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
45/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
#'utf8'
Unicode,DammithastwospecialfeaturesthatBeautifulSoupdoesntuse.
Smart quotes
YoucanuseUnicode,DammittoconvertMicrosoftsmartquotestoHTMLorXMLentities:
markup=b"<p>Ijust\x93love\x94MicrosoftWord\x92ssmartquotes</p>"
UnicodeDammit(markup,["windows1252"],smart_quotes_to="html").unicode_markup
#u'<p>Ijust“love”MicrosoftWord’ssmartquotes</p>'
UnicodeDammit(markup,["windows1252"],smart_quotes_to="xml").unicode_markup
#u'<p>Ijust“love”MicrosoftWord’ssmartquotes</p>'
YoucanalsoconvertMicrosoftsmartquotestoASCIIquotes:
UnicodeDammit(markup,["windows1252"],smart_quotes_to="ascii").unicode_markup
#u'<p>Ijust"love"MicrosoftWord\'ssmartquotes</p>'
Hopefully youll find this feature useful, but Beautiful Soup doesnt use it. Beautiful Soup
prefers the default behavior, which is to convert Microsoft smart quotes to Unicode
charactersalongwitheverythingelse:
UnicodeDammit(markup,["windows1252"]).unicode_markup
#u'<p>Ijust\u201clove\u201dMicrosoftWord\u2019ssmartquotes</p>'
Inconsistent encodings
SometimesadocumentismostlyinUTF8,butcontainsWindows1252characterssuchas
(again)Microsoftsmartquotes.Thiscanhappenwhenawebsiteincludesdatafrommultiple
sources. You can use UnicodeDammit.detwingle() to turn such a document into pure UTF8.
Heresasimpleexample:
snowmen=(u"\N{SNOWMAN}"*3)
quote=(u"\N{LEFTDOUBLEQUOTATIONMARK}Ilikesnowmen!\N{RIGHTDOUBLEQUOTATIONMARK}")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
46/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
doc=snowmen.encode("utf8")+quote.encode("windows_1252")
Thisdocumentisamess.ThesnowmenareinUTF8andthequotesareinWindows1252.
Youcandisplaythesnowmenorthequotes,butnotboth:
print(doc)
#Ilikesnowmen!
print(doc.decode("windows1252"))
#Ilikesnowmen!
viceversa,Isuppose),butthisisthemostcommoncase.
Notethatyoumustknowtocall UnicodeDammit.detwingle() onyourdatabeforepassingitinto
BeautifulSoup orthe UnicodeDammit constructor.BeautifulSoupassumesthatadocumenthas
asingleencoding,whateveritmightbe.IfyoupassitadocumentthatcontainsbothUTF8
and Windows1252, its likely to think the whole document is Windows1252, and the
documentwillcomeoutlookinglike Ilikesnowmen! .
UnicodeDammit.detwingle() isnewinBeautifulSoup4.1.0.
Ifyouwanttoseewhethertwovariablesrefertoexactlythesameobject,useis:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
47/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
printfirst_bissecond_b
#False
The copy is considered equal to the original, since it represents the same markup as the
original,butitsnotthesameobject:
printsoup.p==p_copy
#True
printsoup.pisp_copy
#False
TheonlyrealdifferenceisthatthecopyiscompletelydetachedfromtheoriginalBeautiful
Soupobjecttree,justasif extract() hadbeencalledonit:
printp_copy.parent
#None
SoupStrainer
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
48/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
The SoupStrainer class takes the same arguments as a typical method from Searching the
tree:name,attrs,string,and**kwargs.Herearethree SoupStrainer objects:
frombs4importSoupStrainer
only_a_tags=SoupStrainer("a")
only_tags_with_id_link2=SoupStrainer(id="link2")
defis_short_string(string):
returnlen(string)<10
only_short_strings=SoupStrainer(string=is_short_string)
Imgoingtobringbackthethreesistersdocumentonemoretime,andwellseewhatthe
documentlookslikewhenitsparsedwiththesethree SoupStrainer objects:
html_doc="""
<html><head><title>TheDormouse'sstory</title></head>
<body>
<pclass="title"><b>TheDormouse'sstory</b></p>
<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere
<ahref="http://example.com/elsie"class="sister"id="link1">Elsie</a>,
<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and
<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;
andtheylivedatthebottomofawell.</p>
<pclass="story">...</p>
"""
print(BeautifulSoup(html_doc,"html.parser",parse_only=only_a_tags).prettify())
#<aclass="sister"href="http://example.com/elsie"id="link1">
#Elsie
#</a>
#<aclass="sister"href="http://example.com/lacie"id="link2">
#Lacie
#</a>
#<aclass="sister"href="http://example.com/tillie"id="link3">
#Tillie
#</a>
print(BeautifulSoup(html_doc,"html.parser",parse_only=only_tags_with_id_link2).prettify())
#<aclass="sister"href="http://example.com/lacie"id="link2">
#Lacie
#</a>
print(BeautifulSoup(html_doc,"html.parser",parse_only=only_short_strings).prettify())
#Elsie
#,
#Lacie
#and
#Tillie
#...
#
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
49/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
You can also pass a SoupStrainer into any of the methods covered in Searching the tree.
Thisprobablyisntterriblyuseful,butIthoughtIdmentionit:
soup=BeautifulSoup(html_doc)
soup.find_all(only_short_strings)
#[u'\n\n',u'\n\n',u'Elsie',u',\n',u'Lacie',u'and\n',u'Tillie',
#u'\n\n',u'...',u'\n']
Troubleshooting
diagnose()
If youre having trouble understanding what Beautiful Soup does to a document, pass the
document into the diagnose() function. (New in Beautiful Soup 4.2.0.) Beautiful Soup will
print out a report showing you how different parsers handle the document, and tell you if
youremissingaparserthatBeautifulSoupcouldbeusing:
frombs4.diagnoseimportdiagnose
data=open("bad.html").read()
diagnose(data)
#DiagnosticrunningonBeautifulSoup4.2.0
#Pythonversion2.7.3(default,Aug12012,05:16:07)
#Inoticedthathtml5libisnotinstalled.Installingitmayhelp.
#Foundlxmlversion2.3.2.0
#
#Tryingtoparseyourdatawithhtml.parser
#Here'swhathtml.parserdidwiththedocument:
#...
Just looking at the output of diagnose() may show you how to solve the problem. Even if
not,youcanpastetheoutputof diagnose() whenaskingforhelp.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
and
50/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
HTMLParser.HTMLParseError:badendtag .ThesearebothgeneratedbyPythonsbuiltinHTML
parserlibrary,andthesolutionistoinstalllxmlorhtml5lib.
Themostcommontypeofunexpectedbehavioristhatyoucantfindatagthatyouknowis
inthedocument.Yousawitgoingin,but find_all() returns [] or find() returns None .Thisis
anothercommonproblemwithPythonsbuiltinHTMLparser,whichsometimesskipstagsit
doesntunderstand.Again,thesolutionistoinstalllxmlorhtml5lib.
Parsing XML
Bydefault,BeautifulSoupparsesdocumentsasHTML.ToparseadocumentasXML,pass
inxmlasthesecondargumenttothe BeautifulSoup constructor:
soup=BeautifulSoup(markup,"xml")
Youllneedtohavelxmlinstalled.
51/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
convertedto<tag></tag>.Ifyouwanttopreservemixedcaseoruppercasetagsand
attributes,youllneedtoparsethedocumentasXML.
Miscellaneous
UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar (or
just about any other UnicodeEncodeError ) This is not a problem with Beautiful Soup.
Thisproblemshowsupintwomainsituations.First,whenyoutrytoprintaUnicode
characterthatyourconsoledoesntknowhowtodisplay.(SeethispageonthePython
wiki for help.) Second, when youre writing to a file and you pass in a Unicode
character thats not supported by your default encoding. In this case, the simplest
solutionistoexplicitlyencodetheUnicodestringintoUTF8with u.encode("utf8") .
KeyError:[attr] Caused by accessing tag['attr'] when the tag in question doesnt
definethe attr attribute.Themostcommonerrorsare KeyError:'href' and KeyError:
'class' .Use tag.get('attr') ifyourenotsure attr isdefined,justasyouwouldwitha
Pythondictionary.
AttributeError: 'ResultSet' object has no attribute 'foo' This usually happens
becauseyouexpected find_all() toreturnasingletagorstring.But find_all() returns
a _list_ of tags and stringsa ResultSet object. You need to iterate over the list and
look at the .foo of each one. Or, if you really only want one result, you need to use
find() insteadof find_all() .
AttributeError:'NoneType'objecthasnoattribute'foo' Thisusuallyhappensbecause
youcalled find() andthentriedtoaccessthe.foo`attributeoftheresult.Butinyour
case, find() didnt find anything, so it returned None , instead of returning a tag or a
string.Youneedtofigureoutwhyyour find() callisntreturninganything.
Improving Performance
Beautiful Soup will never be as fast as the parsers it sits on top of. If response time is
critical, if youre paying for computer time by the hour, or if theres any other reason why
computer time is more valuable than programmer time, you should forget about Beautiful
Soupandworkdirectlyatoplxml.
Thatsaid,therearethingsyoucandotospeedupBeautifulSoup.Ifyourenotusinglxml
as the underlying parser, my advice is to start. Beautiful Soup parses documents
significantlyfasterusinglxmlthanusinghtml.parserorhtml5lib.
Youcanspeedupencodingdetectionsignificantlybyinstallingthecchardetlibrary.
Parsingonlypartofadocumentwontsaveyoumuchtimeparsingthedocument,butitcan
savealotofmemory,anditllmakesearchingthedocumentmuchfaster.
Beautiful Soup 3
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
52/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
BeautifulSoup3isthepreviousreleaseseries,andisnolongerbeingactivelydeveloped.
ItscurrentlypackagedwithallmajorLinuxdistributions:
$aptgetinstallpythonbeautifulsoup
ItsalsopublishedthroughPyPias BeautifulSoup .:
$easy_installBeautifulSoup
$pipinstallBeautifulSoup
YoucanalsodownloadatarballofBeautifulSoup3.2.0.
If you ran easy_install beautifulsoup or easy_install BeautifulSoup , but your code doesnt
work,youinstalledBeautifulSoup3bymistake.Youneedtorun easy_installbeautifulsoup4 .
ThedocumentationforBeautifulSoup3isarchivedonline.
becomesthis:
frombs4importBeautifulSoup
If you get the ImportError No module named BeautifulSoup, your problem is that
youre trying to run Beautiful Soup 3 code, but you only have Beautiful Soup 4
installed.
Ifyougetthe ImportError Nomodulenamedbs4,yourproblemisthatyouretryingto
runBeautifulSoup4code,butyouonlyhaveBeautifulSoup3installed.
Although BS4 is mostly backwardscompatible with BS3, most of its methods have been
deprecated and given new names for PEP 8 compliance. There are numerous other
renamesandchanges,andafewofthembreakbackwardscompatibility.
HereswhatyoullneedtoknowtoconvertyourBS3codeandhabitstoBS4:
53/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
andusethatinstead.SeeInstallingaparserforacomparison.
Since html.parser isnotthesameparseras SGMLParser ,youmayfindthatBeautifulSoup4
givesyouadifferentparsetreethanBeautifulSoup3forthesamemarkup.Ifyouswapout
html.parser for lxml or html5lib, you may find that the parse tree changes yet again. If this
happens,youllneedtoupdateyourscrapingcodetodealwiththenewtree.
Method names
renderContents > encode_contents
replaceWith > replace_with
replaceWithChildren > unwrap
findAll > find_all
findAllNext > find_all_next
findAllPrevious > find_all_previous
findNext > find_next
findNextSibling > find_next_sibling
findNextSiblings > find_next_siblings
findParent > find_parent
findParents > find_parents
findPrevious > find_previous
findPreviousSibling > find_previous_sibling
findPreviousSiblings > find_previous_siblings
nextSibling > next_sibling
previousSibling > previous_sibling
SomeargumentstotheBeautifulSoupconstructorwererenamedforthesamereasons:
BeautifulSoup(parseOnlyThese=...) > BeautifulSoup(parse_only=...)
BeautifulSoup(fromEncoding=...) > BeautifulSoup(from_encoding=...)
IrenamedonemethodforcompatibilitywithPython3:
Tag.has_key() > Tag.has_attr()
Irenamedoneattributetousemoreaccurateterminology:
Tag.isSelfClosing > Tag.is_empty_element
IrenamedthreeattributestoavoidusingwordsthathavespecialmeaningtoPython.Unlike
the others, these changes are not backwards compatible. If you used these attributes in
BS3,yourcodewillbreakonBS4untilyouchangethem.
UnicodeDammit.unicode > UnicodeDammit.unicode_markup
Tag.next > Tag.next_element
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
54/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Generators
IgavethegeneratorsPEP8compliantnames,andtransformedthemintoproperties:
childGenerator() > children
nextGenerator() > next_elements
nextSiblingGenerator() > next_siblings
previousGenerator() > previous_elements
previousSiblingGenerator() > previous_siblings
recursiveChildGenerator() > descendants
parentGenerator() > parents
Soinsteadofthis:
forparentintag.parentGenerator():
...
Youcanwritethis:
forparentintag.parents:
...
(Buttheoldcodewillstillwork.)
Someofthegeneratorsusedtoyield None aftertheyweredone,andthenstop.Thatwasa
bug.Nowthegeneratorsjuststop.
There are two new generators, .strings and .stripped_strings. .strings yields
NavigableString objects, and .stripped_strings yields Python strings that have had
whitespacestripped.
XML
There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in
xml as the second argument to the BeautifulSoup constructor. For the same reason, the
BeautifulSoup constructornolongerrecognizesthe isHTML argument.
BeautifulSoupshandlingofemptyelementXMLtagshasbeenimproved.Previouslywhen
youparsedXMLyouhadtoexplicitlysaywhichtagswereconsideredemptyelementtags.
The selfClosingTags argumenttotheconstructorisnolongerrecognized.Instead,Beautiful
Soupconsidersanyemptytagtobeanemptyelementtag.Ifyouaddachildtoanempty
elementtag,itstopsbeinganemptyelementtag.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
55/56
7/14/2015
BeautifulSoupDocumentationBeautifulSoup4.4.0documentation
Entities
An incoming HTML or XML entity is always converted into the corresponding Unicode
character. Beautiful Soup 3 had a number of overlapping ways of dealing with entities,
which have been removed. The BeautifulSoup constructor no longer recognizes the
smartQuotesTo or convertEntities arguments.(Unicode,Dammitstillhas smart_quotes_to ,butits
defaultisnowtoturnsmartquotesintoUnicode.)Theconstants HTML_ENTITIES , XML_ENTITIES ,
and XHTML_ENTITIES havebeenremoved,sincetheyconfigureafeature(transformingsome
butnotallentitiesintoUnicodecharacters)thatnolongerexists.
IfyouwanttoturnUnicodecharactersbackintoHTMLentitiesonoutput,ratherthanturning
themintoUTF8characters,youneedtouseanoutputformatter.
Miscellaneous
Tag.stringnowoperatesrecursively.IftagAcontainsasingletagBandnothingelse,then
A.stringisthesameasB.string.(Previously,itwasNone.)
Multivaluedattributeslike class have lists of strings as their values, not strings. This may
affectthewayyousearchbyCSSclass.
If you pass one of the find* methods both string and a tagspecific argument like name,
BeautifulSoupwillsearchfortagsthatmatchyourtagspecificcriteriaandwhoseTag.string
matches your value for string. It will not find the strings themselves. Previously, Beautiful
Soupignoredthetagspecificargumentsandlookedforstrings.
The BeautifulSoup constructornolongerrecognizesthemarkupMassageargument.Itsnow
theparsersresponsibilitytohandlemarkupcorrectly.
The rarelyused alternate parser classes like ICantBelieveItsBeautifulSoup and BeautifulSOAP
havebeenremoved.Itsnowtheparsersdecisionhowtohandleambiguousmarkup.
The prettify() methodnowreturnsaUnicodestring,notabytestring.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searchingthetree
56/56