Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1


Ratings: (0)|Views: 8 |Likes:
Published by api-25957478

More info:

Published by: api-25957478 on Sep 16, 2009
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





XML DocumentParsing: Operationaland PerformanceCharacteristics
Tak Cheung Lamand Jianxun Jason Ding 
Cisco Systems
 Jyh-Charn Liu
Texas A&M University
Parsing is an expensive operation that can degrade XMLprocessing performance. A survey of four representative XMLparsing models—DOM, SAX, StAX, and VTD—reveals theirsuitability for different types of applications.
roadly used in database and networking applications, theExtensible Markup Language is the de facto standard for theinteroperable document format. As XML becomes widespread,it is critical for application developers to understand the opera-tional and performance characteristics of XML processing.As Figure 1 shows, XML processing occurs in four stages:
 parsing, access,modification,
. Although parsing is the most expensiveoperation,
there are no detailed studies that comparethe processing steps and associated overhead costs of different parsingmodels,tradeoffs in accessing and modifying parsed data, andXML-based applications’ access and modification requirements.Figure 1 also illustrates the three-step parsing process. The first two steps,
character conversion
lexical analysis,
are usually invariant among dif-ferent parsing models, while the third step,
syntactic analysis,
creates datarepresentations based on the parsing model used.To help developers make sensible choices for their target applications, wecompared the data representations of four representative parsing models:document object model (DOM; www.w3.org/DOM), simple API for XML(SAX; www.saxproject.org), streaming API for XML (StAX; http://jcp.org/ en/jsr/detail?id=173), and virtual token descriptor (VTD; http://vtd-xml.sourceforge.net). These data representations result in different operationaland performance characteristics.XML-based database and networking applications have unique require-ments with respect to access and modification of parsed data. Database
Published by the IEEE Computer Society
Authorized licensed use limited to: to IEEExplore provided by Virginia Tech Libraries. Downloaded on November 7, 2008 at 12:19 from IEEE Xplore. Restrictions apply.
September 2008
applications must be able to access and modify the doc-ument structure back and forth; the parsed documentresides in the database server to receive multiple incom-ing queries and update instructions. Networking appli-cations rely on one-pass access and modification duringparsing; they pass the unparsed document through thenode to match the parsed queries and update instruc-tions reside in the node.
An XML parser first groups a bit sequence into char-acters, then groups the characters into
, and finallyverifies the tokens and organizes them into certain datarepresentations for analysis at the access stage.
Character conversion
The first parsing step involves converting a bit sequencefrom an XML document to the character sets the hostprogramming language understands. For example,documents written in Western, Latin-style alphabets areusually created in UTF-8, while Java usually reads char-acters in UTF-16. In most cases, a UTF-8 character canbe converted to UTF-16 by simply padding 8-bit lead-ing zeros. For example, the parser converts <” “a” “>”from “3C 61 3E” to “003C 0061 003E” in hexadecimalrepresentation. It is possible to avoid such a characterconversion by composing the documents in UTF-16, butUTF-16 takes twice as much space as UTF-8, which hastradeoffs in storage and character scanning speed.
Lexical analysis
The second parsing step involves partitioning thecharacter stream into subsequences called tokens.Major tokens include a
start element 
and an
end element,
as Table 1 shows. A token can itself consistof multiple tokens. Each token is defined by a regularexpression in the World Wide Web Consortium (W3C)XML specifications, as shown in Table 2. For exam-ple, a start element consists of a “<”, followed by anelement name, zero or more attributes preceded by aspace-like character, and a “>”. Each attribute consistsof an attribute name, followed by an “=” enclosed by a
Table 1. XML token examples.
Start element
John</Record>End element <Record>John
Text <Record>
</Record>Start element name <
private = “yes”>Attribute name <Record
= “yes”>Attribute value <Record private =
b2b1b1b1b1b1b1b1ca a a a a a a ac c cccb2b2b1a$b1a$a$b2
Final stateStart stateSpaceElementnameStartelementfoundSpaceSpaceSpaceSpaceSpace or charCharCharCharCharElementnameEndelementfoundTextfoundTextEOF<<>> / 
Character sequence
(for example, 003C 0061 003E =‘<’ ‘a’ ‘>’)
Token sequence
(for example,‘<a>’ ‘x’ ‘</a>’)
Data representation (parsing model dependent)
(for example, tree, events, integer arrays)
Bit sequence
(for example,3C 61 3E)Characterconversion(for example, pad zeros)Invariant amongdifferent parsing modelsVariant amongdifferent parsing modelsSemanticanalysisInput XMLdocumentParsingAccessModificationSerialization(Performance bottleneck)(Performance affected by parsing models)Output XMLdocumentSyntacticanalysis(PDA)Lexicalanalysis(FSM)Managed by application(access, modification, and so on)API
Figure 1. XML processing stages and parsing steps. The three-step parsing process is the most expensive operation in XML processing.
Authorized licensed use limited to: to IEEExplore provided by Virginia Tech Libraries. Downloaded on November 7, 2008 at 12:19 from IEEE Xplore. Restrictions apply.
zero or one space-like character on each side, and thenan attribute value.A finite-state machine (FSM) processes the characterstream to match the regular expressions. The simplifiedFSM in Figure 1 processes the start element, text, andthe end element only, without processing attributes. Toachieve full tokenization, an FSM must evaluate manyconditions that occur at every character. Depending onthe nature of these conditions and the frequency withwhich they occur, this can result in a less predictable flowof instructions and thus potentially low performanceon a general-purpose processor. Proposed tokenizationimprovements include assigning priority to transitionrules,
changing instruction sets for “<” and “>”,
andduplicating the FSM for parallel processing.
Syntactic analysis
The third parsing step involves verifying the tokens’well-formedness, mainly by ensuring that they haveproperly nested tags. The
 pushdown automaton
(PDA)in Figure 1 verifies the nested structure using the follow-ing transition rules:The PDA initially pushes a “$” symbol to the stack.If it finds a start element, the PDA pushes it to thestack.If it finds an end element, the PDA checks whetherit is equal to the top of the stack.If yes, the PDA pops the element from the stack.If the top element is “$”, then the document is“well-formed.” Done!Otherwise, the PDA continues to read the nextelement.If no, the document is not “well-formed.” Done!In the complete well-formedness check, the PDA mustverify more constraints—for example, attribute names1.2.3.of the same element cannot repeat. If schema validationis required, a more sophisticated PDA checks extra con-straints such as specific element names, the number of child elements, and the data type of attribute values.In accordance with the parsing model, the PDA orga-nizes tokens into data representations for subsequentprocessing. For example, it can produce a tree objectusing the following variation of transition rule 2:If it finds a start element, the PDA checks the top elementbefore pushing it to the stack.If the top element is “$”, then this start element isthe root.Otherwise, this start element becomes the top ele-ment’s child.After syntactic analysis, the data representations areavailable for access or modification by the applicationvia various APIs provided by different parsing models,including DOM, SAX, StAX, and VTD.
XML parsers use different models to create data rep-resentations. DOM creates a tree object, VTD createsinteger arrays, and SAX and StAX create a sequence of events. Both DOM and VTD maintain long-lived struc-tural data for sophisticated operations in the access andmodification stages, while SAX and StAX do not. DOMas well as SAX and StAX create objects for their datarepresentations, while VTD eliminates the object-cre-ation overhead via integer arrays.DOM and VTD maintain different types of long-livedstructural data. DOM produces many node objects tobuild the tree object. Each node object stores the elementname, attributes, namespaces, and pointers to indicatethe parent-child-sibling relationship. For example, in Fig-ure 2 the node object stores the element name of Phoneas well as the pointers to its parent (Home), child (1234),and next sibling (Address). In contrast, VTD creates noobject but stores the original document and producesarrays of 64-bit integers called
VTD records
(VRs) and
location caches
(LCs). VRs store token positions in theoriginal document, while LCs store the parent-child-sib-ling relationship among tokens.While DOM produces many node objects that includepointers to indicate the parent-child-sibling relationship,SAX and StAX associate different objects with differ-ent events and do not maintain the structures amongobjects. For example, the start element event is associ-ated with three String objects and an Attributes objectfor the namespace uniform resource identifier (URI),local name, qualified name, and attribute list. The endelement event is similar to the start element event with-out an attribute list. The character event is associatedwith an array of characters and two integers to denote
Table 2. Regular expressions of XML tokens.
Start element ‘<’ Name (S Attribute)* S? ‘<’End element ‘</’ Name (S Attribute)* S? ‘<’Attribute Name Eq AttValueS (0
Space-like characters 
Eq S? ‘=’ S?
Equal-like character
Name Some other regular expressionsAttValue Some other regular expressions
* = 0 or more; ? = 0 or 1; + = 1 or more.
Authorized licensed use limited to: to IEEExplore provided by Virginia Tech Libraries. Downloaded on November 7, 2008 at 12:19 from IEEE Xplore. Restrictions apply.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->