You are on page 1of 9

HTML parser in Delphi XE

Scope
THTMLdom is a (Delphi) class with functions to read a HTML source file and dissect it into
a tree of THTMLelement. The attributes of the HTML tags are stored in the elements.
Functions are provided to select elements on the basis of the attribute values or tag
names. The structure of the tree can be shown and it can be rendered as plain text.
The source is plain Delphi pascal, requiring a version that supports Tdictionary. There is no
dependency on 3rd party units.
The file to be parsed must have valid HTML4/5 tags. It is not necessary that the HTML is
‘correct’ in the sense that end tags may be wrongly placed or be absent altogether. The
speed of processing (reading+parsing) is formidable: 15-40 msec per Mbyte or around 1
msec per 1000 HTML tags.

License
GNU General Public License version 3.0 (GPLv3)
Copyright by Tom de Neef, 2019 – tomdeneef@gmail.com

Use
Typical use of the class is as follows:
DOM:=THTMLDOM.Create(sourceFile); // DOM = THTMLDOM
DOM.assignReportList(memo.Lines); // for debugging
DOM.ParseHTMLstring;
memo.Lines.Add(DOM.showStructure);
memo.Lines.Add(DOM.text( {exclude} ['HEAD','FOOTER','STYLE','SCRIPT']));

elements:=THTMLelementList.Create; // elements = THTMLelementList


// select a group of elements
DOM.selectElements('tagname','TEXT',true,elements);
// select one element
element:=DOM.getElementById('OneStatTag'); // element = THTMLelement

Unit structure
The source is devided over four units. For demo puposes a form unit is added and that
demo is bound together as a program HTMLparser. The units are organized as follows:

UHTMLreference contains the valid HTML tags and whether they have end tags and if so
how a missing end tag can be detected. It has a function to (speedily) translate a tag string
into a reference (like ‘BODY’ → trBODY). Tag references are used throughout in
preference to tag strings.

UHTMLchars is a unit to support textual output. It contains the function to render most
HTML special character codes (like ‘é’→ ‘é’). This unit requires Tdictionary. When
you have an older version of Delphi without Tdictionary functionality, find another way to
translate the HTML codes.

UsourceParser contains the ‘tokenizer’ classes, used to split the source file into tokens (the
tags and text).

UHTMLparse is the unit with the classes to build the DOM tree.

The functionality of these units is described below.

Unit UHTMLreference
HTML (up to version 5) defines a limited number of tags for the markup. The naming of
them is case-insensitive: <b> and <B> describe the same instruction.
The unit declares an ordered set of values asTtagReference with constants to reference
the tags. The naming is intuitive: the tagReference for <B> is trB; for <FIGURE> it is
trFIGURE, etc.
Three tagReferences have no corresponding HTML tag: trUNPARSED (as a kind of NIL –
the tag is not yet determined), trTEXT to mark pieces of text and trUNKNOWN to indicate a
tag whose name is not defined in HTML.
Tag references are relevant for processing speed. Instead of comparing two tag names
with ‘if SameText(tag1,tag2)’ we can use ‘if tagReference1=tagReference2’.
The unit exposes three supporting functions:
1. function getTagReference(var tag : string) : TtagReference;
It will find the proper tagReference for a given tag name. There are two
implementations of this function, one of which is commented out. The active one
processes the tag name string character by character, through case statements.
The suppressed implementation uses a Tdictionary where all pairs
(tagname,tagReference) have been pre-stored. In practice there is very little
difference in speed between these two implementations.
2. function BodylessTag(tag : TtagReference) : boolean;
It returns true if the given tagReference is for a HTML tag that can not have a
closing tag (such as <meta>) as opposed to the tags that may have a closing (such
as <TD>.. </TD>, even though the closing </TD> can be suppressed).
Note that a tag without body can not have any children, which is relevant when
parsing.
3. function ImplicitCloseTag(currentTag,nextTag : TtagReference) : boolean;
It returns true if the sequence <current> <next> implies that an implicit </current>
should be assumed just before the <next> tag.
In HTML there are several situations where a closing tag can be omitted, notably
with <TD> and <LI>. In <TD>nr<TD>code, the second <TD> is a sibling of the first,
since the complete HTML would read <TD>nr</TD><TD>code. But in <TD>nr
<B>code the <B> will be a child of the <TD>. This function can be used to resolve
these situations.

Unit UHTMLchars
The unit exposes three supporting functions:
1. function ASCIIchar(HTMLcode : string) : string;
It will translate the HTML codes for extended ASCII characters into ASCII.
Examples: ‘&eacute;’ → ‘é’, ‘&#144;’→ ‘É’.
When the code is unknown, the result will be the input string.
The function uses a local Tdictionary of (HTMLcode,ASCII index) pairs which is
filled during initialization.
2. function sanitizeHTMLtext(var txt : string) : string;
This function will translate HTMLcodes in the input string and reduce blank parts to
single space. Leading and trailing blanks will be removed. Non-breaking-space
(&nbsp;) will be preserved as a space, so that ‘ &nbsp; ‘ will result in three spaces.
3. Function posOfSubstring(const subStr,str : string; fromPos : integer; checkCase :
boolean = False) : integer;
It is like StrUtils.posEx. But when checkCase=FALSE it will compare the strings
independent of case (which is a lot slower than a case-sensitive comparisson).

Unit UsourceParser
This unit was intended to be independant of the HTML context but on second thought it
seemed easier to integrate the HTMLreference data.
Its purpose is to dissect the HTML source into four kinds of token:
• ttOpenTag – indicating start and end position of a HTML open tag (such as
<BODY>). Also determine the tagReference.
• ttCloseTag – same, for closing tags.
• ttText - indicating start and end position of any (non-blank) text between tags.
• ttCommentTag – indicating an area in the source that starts with ‘<!--’ and ends with
‘-->’.
This is conveniently done by the Ttokenizer class. Its principal function is ‘advance’.
Calling advance will instruct the tokenizer to find the next token – which must be one of the
above – and return its data in the token record.

Details of Ttokenizer class.


Fields (public):
token : Rtoken a record storing the details of the current token.
source : string the HTML source. All access to the source is through the
tokenizer.

currentPos : integer the location in source where the next character will be read.

sourceLength : integer length(source).

doNOTadvance : boolean used as a global variable, indicating that the current token must
be re-used.

Functions:
initialize Set currentPos and sourceLength

locate(ch : char) : integer Find the first occurrence of ch, starting at currentPos

tagString : string The full tag of the current token. I.e. all text between opening <
and closing > when the token indicates a HTML tag. When the
token indicates a piece of text, the function returns that text.
advance Fill token record with details of the next token. That includes
determining the tagReference.
assign(const aSource : string) Copy a HTML source for processing.

loadFromFile(filename : string) Read a HTML file for processing.

setEOF Used to indicate that further advance calls will not succeed.

ThereIsText(pStart, pEnd: integer): Test if source contains printable characters (ch>’ ‘) between
boolean pStart and pEnd positions. This is to determine if a call to
advance must return a ttTEXT type token.

Unit UHTMLparse
This is the key unit for which the other units are supportive. Its purpose is to build a tree of
THTMLelement which mimics the DOM (Document Object Model). The root of the tree is a
dummy element. Normally it will have two children: the first child represents the
<!DOCTYPE> element and the second child holds the <HTML> element, from which all
other elements in the document spring (through <HEAD> and <BODY> elements).
The basis for the tree is thus the THTMLelement class. As specialized descendant, the
THTMLdom class will be the link to an application. It has functions to show and search the
tree.
Parsing a HTML document is a recursive process:
• obtain the next token from the tokenizer
• if it is an open tag, then create a new element as child of the current element and
continue with that child as basis; if it is a closing tag then leave the current element
and continue with its parent.
The difficulty is in handling improper HTML (where an unexpected close tag is
encountered) and recognizing when a close tag is implicit.
Handling improper HTML
There are many ways by which a HTML document can be invalid and THTMLdom has no
knowledge of them. But one situation has to be considered since it will corrupt the parsing.
That is when a close tag is incorrectly placed. We recognize two situations:
1. Mixing close tags up, like <B><I> text </B></I>. In such a case, the parser will
recognize that the child (<I>) of <B> has not been closed when it encounters </B>
and will introduce an implicit close for it. Next it will encounter a close (</I>) for
which there is no open element. It will be skipped.
2. Unmatched close tag. This is the situation of the above after </B> has been
processed.
Both cases are treated the same way: the chain of parents is checked to see if there is an
element with matching open tag. If there is, all children of that element will be closed. If
there is not, the close tag will be skipped.
As an example consider <TR><TD>text</TR>. The closing tag </TR> does not match the
current element <TD>. But it matches with <TD>’s parent <TR>. Therefor the current
element (<TD>) will be closed and processing continues with the – as yet unhandled –
closing tag </TR> for the now current element <TR>.
Recognizing implicit closing
Every time an open tag is offered by the tokenizer, the parser needs to check if this means
that a child has to be added to the current element or that the element needs to be closed
and a child shall be added to its parent. This latter situation happens often in lists and
tables. Consider <B>some <B> text </B> (although meaningless). The second <B> will be
a child of the first one. But in <LI>some <LI> text </LI>, the second <LI> will be a sibling of
the first one. And that will also be the case with <TD>some <TD> text.
All valid situations where an open tag implies the closing of child tags must be recognized.
This is done by means of the ImplicitCloseTag function in THTMLreference unit.
Text
HTML has no tag for text. Text is everything between tags. But the DOM does have
elements for text. These get trTEXT as tagReference and they are child of the element that
embeds them. Thus <P> some text <B> highlighted </B></P> will result in a sub tree with
structure <P><TEXT> some text </TEXT><B><TEXT> highlighted </TEXT</B></P>.
Note that the text is not copied to the elements. Elements know where the text starts and
ends in the source and when needed will retrieve it from the source. (See THTMLdom.)

Description of THTMLelement:
Fields (public):
parent : THTMLelement The element’s parent. When parent=NIL, the element is the
root of the tree.

tag : string HTML tag name in uppercase.

tagReference : TtagReference TagReference – see UHTMLreference unit.


children : THTMLelementList The children of this element, ordered.

attributes : Tstringlist Te attributes of the tag. Attributes are distinguished by blanks


between them. Enclosing quotes will be removed. <body
background="marmer.jpg" style="font-family:'MS Sans Serif'"
class=”normal,GFS”> will have attributes
background=marmer.jpg
style=font-family:'MS Sans Serif'
class=normal,GFS

TagFirstPos : integer The position in source of the character following ‘<’.

bodyFirstPos : integer The position in source of the character following ‘>’.

bodyLastPos : integer Meaningful only for trTEXT elements: the position in source of
the last character of the text.

WhereInSource : Pchar Pointer to tokenizer.source[bodyFirstPos] from where text can


be copied when required.

Functions:

create(aParent : THTMLelement) Create a new element and add it to aParent’s children (if
aParent<>NIL)
free Free all children and the object itself.

delete Remove the element from its parent’s children list. Then free.

analyseTagAttributes(var source : string) Source must be a reference to the HTML source. From the
information about the tag’s position (tagFirstPos and
bodyFirstPos), the attributes list is derived.
processOpenTag(tokenizer: Ttokenizer): Handle the next token (of type ttOpenTag). If it detects the
boolean; (private) implict closing of the current element, then do nothing, set the
doNOTadvance parameter in the tokenizer and return FALSE.
Otherwise create a new element and continue parsing from its
base. (This is the recursion.)
processCloseTag(tokenizer: Ttokenizer): Handle the next token (of type ttCloseTag). Check that it
boolean; (private) belongs to the element. If not, check that it belongs to a parent.
If so, set the doNOTadvance parameter in the tokenizer. If not,
return FALSE.
getOwnText: string; Return the sanitized text of this element. That is an empty
(private) string unless the tagReference is trTEXT (or trHR, or trBR).
Sanitizing involves translating HTML codes into ASCII and
removing double spacing.
parseHTMLstring(tokenizer : Ttokenizer) Repeatedly call for a new token. On the basis of the type of it,
process an open or close tag, a text or a comment part. The
cycle is broken when a close tag is encountered or if the
element can not have children (bodyless tag).
An open tag may also force the (implicit) close of the element.
This will be signaled by processOpenTag. In that case the
procedure is left so that control passes back to the parent
element. Since tokenizer.doNOTadvance will be TRUE, that
parent element will progress with the same token.
A close tag belonging to a parent element will also end the
procedure. This situation will be signalled by processCloseTag.
Text and comment elements are handled on the fly since their
type will already have been established by the tokenizer and
they can not have any children anyway.
selectElements(attributeName,attributeVal Return a list with the children that have the specified attribute.
ue : string; searchInValue : boolean; var When searchInValue=FALSE the the named attribute must
elements : THTMLelementList; have the attributeValue as specified. I.e. the attributes list
unique : boolean = false) contains a (attributeName=attributeValue) pair. When
searchInValue=TRUE, it is sufficient that the value part
contains attributeValue as a substring. But that substring must
be a ‘complete’ attribute value (i.e. surrounded by delimiters).
As an example: an element that has a string
‘class=Normal,NoBreaks’ in its attributes list will be selected on
‘NoBreaks’ when searchInValue=TRUE. It will not be selected
on ‘Breaks’. Note that attribute names and values are case
sensitive.
To find all elements that have a value for the named attribute
(whatever that value is), specify ‘*’ or avAnyValue for
attributeValue.
Specify unique=TRUE to stop the search as soon as an
element is found.

selectElementsByTagReference(aTagRef Return a list with children that have the specified tagReference.
erence : TtagReference; var elements :
THTMLelementList)

selectElementsByText(textpart : string; Return a list with children that have textpart as substring in their
var elements : THTMLelementList; ownText. (Note that in ownText the HTML codes have been
checkCase : boolean) translated to ASCII, so that you can search for e.g. ‘oké’.) If
checkCase=TRUE, the match will be case-sensitive. The
elements list will not be cleared.
firstElementWithText(textpart : string; Return the first element that has textpart as substring in its
checkCase : boolean = false) : ownText. (Note that in ownText the HTML codes have been
THTMLelement translated to ASCII, so that you can search for e.g. ‘oké’.) If
checkCase=TRUE, the match will be case-sensitive.
asText (exclTagReferences: Return the text of this element and its children, each text part
TtagReferenceSet; sList : Tstringlist); as a string in the output list. All formating apart from <BR> and
<HR> will be skipped. HTML codes are translated to ASCII and
double spaces are removed. TrLI, trDD, trDT and trP elements
receive a closing nlcr. TrTEXT elements will copy the text from
the HTML source.
The output can be limited by excluding all elements with a
tagReference listed in the exclTagReferences set.
(This function is meant as support for the Text function of
THTMLdom where the exclusion is bij tag name rather than
tagReference.)
showStructure(offset : string) : string Output all tags (with their attributes) in a structured way. I.e. a
child tag is output with extra leading blanks (offset). All text is
suppressed.

Properties:

ownText : string (read-only) See getOwnText

Description of THTMLdom:
THTMLdom is a descendant of THTMLelement. It has access to the source (via the
tokenizer object) and it has functions like the functions of THTMLelement but with focus on
the whole tree.
Fields (private):
Tokenizer : Ttokenizer It will be created during create. The tokenizer offers access to
the HTML source.

attrCollected : boolean Set when collectTagAttributes has finished. This is required


when selecting elements by attribute name/value.

Functions:

create(filename : string) Create a new (root of the) tree. Instantiate the tokenizer and let
it read the HTML source.
free Free all children and the object itself, including the tokenizer.

assignReportList(aStrings: Tstrings) The unit has a global const debug = false; When set to TRUE,
some debug information of the parsing process will be written
to a global reportList of type Tstrings; This function establishes
a reference for that list in an external unit. In this way we avoid
the need to add that external unit to the uses clause.
parseHTMLstring Same as the inherited function but without the need to pass the
tokenizer as argument.
collectTagAttributes Same as the inherited function analyzeTagAttributes but
without the need to pass the tokenizer as argument. The
attrCollected indication is set when complete.
selectElements(attributeName,attributeVal Same as the inherited function, with two additions:
ue : string; searchInValue : boolean; var 1) selectElements(‘tagname’,…) will be recognized as
elements : THTMLelementList; unique : corresponding selectElementsByTagReference(...), unique
having no effect.
boolean = false)
2) the attributes are collected if this hasn’t been done yet.
text(excludeTags : array of string) : string Similar to the inherited function asText . Exclusion of elements
is now by specifying tag names, such as
s:=text([‘footer’,’head’]); The tag names are case insensitive.
showStructure: string Same as the inherited function but without the need to specify
an indentation offset.
getElementById(anID : string) : Javascript equivalent; The sought ID is case sensitive.
THTMLelement

getEelementsByClassName(aClassName Javascript equivalent; The class name is case sensitive. The


: string; var elements : THTMLelementList) output list must have been created. It will be cleared as a first
step.
getEelementsByTagName(aTagName : Javascript equivalent; Tag names are not case sensitive. The
string; var elements : THTMLelementList) output list must have been created. It will be cleared as a first
step.
getElementsByText(textpart : string; var Javascript lookalike. Selection is as in the inherited function
elements : THTMLelementList; checkCase selectElementsByText. The elements list will be cleared as a
: boolean = false) first step.

Properties:

elementCount : integer; (read-only) Return the number of elements in the tree. No meaning prior to
calling parseHTMLstring.
filesize : integer; (read-only) Return the size of the HTML source.

You might also like