Professional Documents
Culture Documents
2 (20191224)
* The html.parser tree builder now correctly handles DOCTYPEs that are
not uppercase. [bug=1848401]
= 4.8.1 (20191006)
* The role of Formatter objects has been greatly expanded. The Formatter
class now controls the following:
* ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always
recognized as a named entity and converted to a single quote. [bug=1818721]
= 4.7.1 (20190106)
= 4.7.0 (20181231)
= 4.6.3 (20180812)
= 4.6.2 (20180812)
= 4.6.1 (20180728)
= 4.6.0 (20170507) =
* Improved the handling of empty-element tags like <br> when using the
html.parser parser. [bug=1676935]
* HTML parsers treat all HTML4 and HTML5 empty element tags (aka void
element tags) correctly. [bug=1656909]
= 4.5.3 (20170102) =
* Fixed yet another problem that caused the html5lib tree builder to
create a disconnected parse tree. [bug=1629825]
= 4.5.2 (20170102) =
= 4.5.1 (20160802) =
= 4.5.0 (20160719) =
This happened in previous versions, but only when the value being
searched for was a string. Now it also works when that value is
a regular expression, a list of strings, etc. [bug=1476868]
* Fixed a bug that deranged the tree when a whitespace element was
reparented into a tag that contained an identical whitespace
element. [bug=1505351]
* Added support for CSS selector values that contain quoted spaces,
such as tag[style="display: foo"]. [bug=1540588]
= 4.4.1 (20150928) =
= 4.4.0 (20150703) =
New features:
Bug fixes:
* Fixed yet another problem that caused the html5lib tree builder to
create a disconnected parse tree. [bug=1237763]
* Fixed yet another bug that caused a disconnected tree when html5lib
copied an element from one part of the tree to another. [bug=1270611]
* The select() method can now find tags whose names contain
dashes. Patch by Francisco Canas. [bug=1276211]
* The select() method can now find tags with attributes whose names
contain dashes. Patch by Marek Kapolka. [bug=1304007]
* Restored the helpful syntax error that happens when you try to
import the Python 2 edition of Beautiful Soup under Python
3. [bug=1213387]
* The warning when you pass in a filename or URL as markup will now be
displayed correctly even if the filename or URL is a Unicode
string. [bug=1268888]
= 4.3.2 (20131002) =
* Combined two tests to stop a spurious test failure when tests are
run by nosetests. [bug=1212445]
= 4.3.1 (20130815) =
* Fixed yet another problem with the html5lib tree builder, caused by
html5lib's tendency to rearrange the tree during
parsing. [bug=1189267]
= 4.3.0 (20130812) =
= 4.2.1 (20130531) =
* The default XML formatter will now replace ampersands even if they
appear to be part of entities. That is, "<" will become
"&lt;". The old code was left over from Beautiful Soup 3, which
didn't always turn entities into Unicode characters.
If you really want the old behavior (maybe because you add new
strings to the tree, those strings include entities, and you want
the formatter to leave them alone on output), it can be found in
EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183]
* Fixed another bug by which the html5lib tree builder could create a
disconnected tree. [bug=1182089]
= 4.2.0 (20130514) =
- Added support for the adjacent sibling combinator (+) and the
general sibling combinator (~). Tests by "liquider". [bug=1082144]
- The combinators (>, +, and ~) can now combine with any supported
selector, not just one that selects based on tag name.
The alias may change in the future, so don't use this in code you're
going to run more than once.
* Methods like get_text() and properties like .strings now only give
you strings that are visible in the document--no comments or
processing commands. [bug=1050164]
* Fix a bug in the lxml treebuilder which crashed when a tag included
an attribute from the predefined "xml:" namespace. [bug=1065617]
* Now that lxml's segfault on invalid doctype has been fixed, fixed a
corresponding problem on the Beautiful Soup end that was previously
invisible. [bug=984936]
= 4.1.3 (20120820) =
* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious
test failure caused by the lousy HTMLParser in those
versions. [bug=1038503]
= 4.1.2 (20120817) =
= 4.1.1 (20120703) =
= 4.1.0 (20120529) =
* Fixed a bug with the lxml treebuilder that prevented the user from
adding attributes to a tag that didn't originally have
attributes. [bug=1002378] Thanks to Oliver Beattie for the patch.
This caused a major refactoring of the search code. All the tests
pass, but it's possible that some searches will behave differently.
= 4.0.5 (20120427) =
* The test suite now passes when lxml is not installed, whether or not
html5lib is installed. [bug=987004]
= 4.0.4 (20120416) =
* Fixed a bug with the string setter that moved a string around the
tree instead of copying it. [bug=983050]
* Attribute values are now run through the provided output formatter.
Previously they were always run through the 'minimal' formatter. In
the future I may make it possible to specify different formatters
for attribute values and strings, but for now, consistent behavior
is better than inconsistent behavior. [bug=980237]
* Give a more useful error when the user tries to run the Python 2
version of BS under Python 3.
= 4.0.3 (20120403) =
= 4.0.2 (20120326) =
* Fixed a bug where specifying `text` while also searching for a tag
only worked if `text` wanted an exact string match. [bug=955942]
= 4.0.1 (20120314) =
* This is the first official release of Beautiful Soup 4. There is no
4.0.0 release, to eliminate any possibility that packaging software
might treat "4.0.0" as being an earlier version than "4.0.0b10".
= 4.0.0b10 (20120302) =
* Added support for simple CSS selectors, taken from the soupselect project.
= 4.0.0b9 (20120228) =
* Fixed a test failure that occurred on Python 3.x when chardet was
installed.
= 4.0.0b8 (20120224) =
= 4.0.0b7 (20120223) =
* Upon decoding to string, any characters that can't be represented in
your chosen encoding will be converted into numeric XML entity
references.
* About 100 unit tests that "test" the behavior of various parsers on
invalid markup have been removed. Legitimate changes to those
parsers caused these tests to fail, indicating that perhaps
Beautiful Soup should not test the behavior of foreign
libraries.
This makes Beautiful Soup compatible with html5lib version 0.95 and
future versions of HTMLParser.
= 4.0.0b6 (20120216) =
= 4.0.0b5 (20120209) =
This actually affects all attributes that the HTML standard defines
as taking multiple values (class, rel, rev, archive, accept-charset,
and headers), but 'class' is by far the most common. [bug=41034]
* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags
like <meta charset="utf-8" />. [bug=837268]
* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
on certain kinds of markup. [bug=838800]
* Fixed a bug that wrecked the tree if you replaced an element with an
empty string. [bug=728697]
= 4.0.0b4 (20120208) =
= 4.0.0b3 (20120203) =
Beautiful Soup 4.0 comes with glue code for four parsers:
Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
so bad that it barely worked at all. Beautiful Soup 4 works with
Python 3, and since its parser is pluggable, you don't sacrifice
quality.
Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
support to the finish line. Ezio Melotti is also to thank for greatly
improving the HTML parser that comes with Python 3.2.
=== CDATA sections are normal text, if they're understood at all. ===
Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
markup:
A future version of html5lib will turn CDATA sections into text nodes,
but only within tags like <svg> and <math>:
The default XML parser (which uses lxml behind the scenes) turns CDATA
sections into ordinary text elements:
The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
builders set it to False. If you want to parse XHTML with an HTML
parser, you can set it manually.
= 3.2.0 =
The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2
to make it obvious which one you should use.
= 3.1.0 =
1. str() may no longer do what you want. This is because the meaning
of str() inverts between Python 2 and 3; in Python 2 it gives you a
byte string, in Python 3 it gives you a Unicode string.
<a href="http://crummy.com?sacré&bleu">
= 3.0.7a =
= 3.0.7 =
Jump through hoops to avoid the use of chardet, which can be extremely
slow in some circumstances. UTF-8 documents should never trigger the
use of chardet.
Beautiful Soup can now parse a doctype that's scoped to an XML namespace.
= 3.0.6 =
Got rid of a very old debug line that prevented chardet from working.
Tag.findNext() now does something with the keyword arguments you pass
it instead of dropping them on the floor.
Fixed a bug that garbled some <meta> tags when rewriting them.
= 3.0.5 =
The regular expression for bare ampersands was too loose. In some
cases ampersands were not being escaped. (Sam Ruby?)
= 3.0.4 =
Fixed some unit test failures when running against Python 2.5.
= 3.0.3 (20060606) =
= 3.0.2 (20060602) =
I aliased methods to the 2.x names (fetch, find, findText, etc.) for
backwards compatibility purposes. Those names are deprecated and if I
ever do a 4.0 I will remove them. I will, I tell you!
Fixed a bug where the findAll method wasn't passing along any keyword
arguments.
When run from the command line, Beautiful Soup now acts as an HTML
pretty-printer, not an XML pretty-printer.
= 3.0.1 (20060530) =
Reintroduced the "fetch by CSS class" shortcut. I thought keyword
arguments would replace it, but they don't. You can't call soup('a',
class='foo') because class is a Python keyword.
= 3.0.0 "Who would not give all else for two p" (20060528) =
The documentation has been rewritten and greatly expanded with many
more examples.
Beautiful Soup autodetects the encoding of a document (or uses the one
you specify), and converts it from its native encoding to
Unicode. Internally, it only deals with Unicode strings. When you
print out the document, it converts to UTF-8 (or another encoding you
specify). [Doc reference]
It's now easy to make large-scale changes to the parse tree without
screwing up the navigation members. The methods are extract,
replaceWith, and insert. [Doc reference. See also Improving Memory
Usage with extract]
Passing True in as an attribute value gives you tags that have any
value for that attribute. You don't have to create a regular
expression. Passing None for an attribute value gives you tags that
don't have that attribute at all.
Tag objects now know whether or not they're self-closing. This avoids
the problem where Beautiful Soup thought that tags like <BR /> were
self-closing even in XML documents. You can customize the self-closing
tags for a parser object by passing them in as a list of
selfClosingTags: you don't have to subclass anymore.
You can use a SoupStrainer to tell Beautiful Soup to parse only part
of a document. This saves time and memory, often making Beautiful Soup
about as fast as a custom-built SGMLParser subclass. [Doc reference,
SoupStrainer reference]
Some of the argument names have been renamed for clarity. For instance
avoidParserProblems is now parserMassage.
findText and fetchText are gone. Just pass a text argument into find
or findAll.
Null was more trouble than it was worth, so I got rid of it. Anything
that used to return Null now returns None.
Special XML constructs like comments and CDATA now have their own
NavigableString subclasses, instead of being treated as oddly-formed
data. If you parse a document that contains CDATA and write it back
out, the CDATA will still be there.
When you're parsing a document, you can get Beautiful Soup to convert
XML or HTML entities into the corresponding Unicode characters. [Doc
reference]
= 2.1.1 (20050918) =
Fixed a bug that crashed the parser when text chunks that look like
HTML tag names showed up within a SCRIPT tag.
THEAD, TBODY, and TFOOT tags are now nestable within TABLE
tags. Nested tables should parse more sensibly now.
The fetch method and its derivatives now accept a limit argument.
You can now pass keyword arguments when calling a Tag object as though
it were a method.
Fixed a bug that caused all hand-created tags to share a single set of
attributes.
= 2.0.3 (20050501) =
Fixed a bug that gave the wrong representation to tags within quote
tags like <script>.
Took some code from Mark Pilgrim that treats CDATA declarations as
data instead of ignoring them.
= 2.0.2 (20050416) =
Added the done() method, which closes all of the parser's open
tags. It gets called automatically when you pass in some text to the
constructor of a parser class; otherwise you must call it yourself.
= 2.0.1 (20050412) =
Fixed a bug that caused bad results when you tried to reference a tag
name shorter than 3 characters as a member of a Tag, eg. tag.table.td.
Made sure all Tags have the 'hidden' attribute so that an attempt to
access tag.hidden doesn't spawn an attempt to find a tag named
'hidden'.
== Parsing ==
The parser logic has been greatly improved, and the BeautifulSoup
class should much more reliably yield a parse tree that looks like
what the page author intended. For a particular class of odd edge
cases that now causes problems, there is a new class,
ICantBelieveItsBeautifulSoup.
You can now get a pretty-print version of parsed HTML to get a visual
picture of how Beautiful Soup parses it, with the Tag.prettify()
method.
== Tree traversal ==
You can use fetch() and first() to search for text in the parse tree,
not just tags. There are new alias methods fetchText() and firstText()
designed for this purpose. As with searching for tags, you can pass in
a string, a regular expression object, or a method to match your text.
If you pass in something besides a map to the attrs argument of
fetch() or first(), Beautiful Soup will assume you want to match that
thing against the "class" attribute. When you're scraping
well-structured HTML, this makes your code a lot cleaner.
1.x and 2.x both let you call a Tag object as a shorthand for
fetch(). For instance, foo("bar") is a shorthand for
foo.fetch("bar"). In 2.x, you can also access a specially-named member
of a Tag object as a shorthand for first(). For instance, foo.barTag
is a shorthand for foo.first("bar"). By chaining these shortcuts you
traverse a tree in very little code: for header in
soup.bodyTag.pTag.tableTag('th'):
There are two new relations between page elements: previousSibling and
nextSibling. They reference the previous and next element at the same
level of the parse tree. For instance, if you have HTML like this:
<p><ul><li>Foo<br /><li>Bar</ul>
The first 'li' tag has a previousSibling of Null and its nextSibling
is the second 'li' tag. The second 'li' tag has a nextSibling of Null
and its previousSibling is the first 'li' tag. The previousSibling of
the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the
'br' tag.
I took out the ability to use fetch() to find tags that have a
specific list of contents. See, I can't even explain it well. It was
really difficult to use, I never used it, and I don't think anyone
else ever used it. To the extent anyone did, they can probably use
fetchText() instead. If it turns out someone needs it I'll think of
another solution.
== Tree manipulation ==
You can add new attributes to a tag, and delete attributes from a
tag. In 1.x you could only change a tag's existing attributes.
== Porting Considerations ==
In the post-1.2 release you could pass in a function into fetch(). The
function took a string, the tag name. In 2.0, the function takes the
actual Tag object.
It's no longer to pass in SQL-style wildcards to fetch(). Use a
regular expression instead.
The different parsing algorithm means the parse tree may not be shaped
like you expect. This will only actually affect you if your code uses
one of the affected parts. I haven't run into this problem yet while
porting my code.
* A string
* A string with SQL-style wildcards
* A compiled RE object
* A callable that returns None/false/empty string if the given value
doesn't match, and any other value otherwise.
Applied patch from Richie Hindle (richie at entrian dot com) that
makes tag.string a shorthand for tag.contents[0].string when the tag
has only one string-owning child.
Added still more nestable tags. The nestable tags thing won't work in
a lot of cases and needs to be rethought.
Fixed an edge case where searching for "%foo" would match any string
shorter than "foo".
Applied patch from Ben Last (ben at benlast dot com) that made
Tag.renderContents() correctly handle Unicode.