Stylistic

Honours Year Project Report
PARCELS: PARser for Content Extraction and Logical

Structure (Stylistic Detection)
By
Lee Chee How
Department of Computer Science
School of Computing
National University of Singapore
2003 / 2004
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________
Honours Year Project Report
PARCELS: PARser for Content Extraction and Logical

Structure (Stylistic Detection)
By
Lee Chee How
Department of Computer Science
School of Computing
National University of Singapore
2003 / 2004
Project: H79010
Advisor: Assistant Professor Kan Min-Yen
Deliverables:
Report: 1 Volume
Program: 1 CD
________________________________________________________________________
i
________________________________________________________________________
Abstract
Web documents that look similar often use different HTML tags to achieve their layout
effect. These tags make it difficult for a machine to find text or images of interest.
PARCELS is a Java backend system designed to classify different components of a web
page by obtaining the logical structure from its layout and stylistic characteristics. Each
component is given a PARCELS label which identifies its purpose, like Main Content,
Links, etc. A recall and precision of 71.9% is obtained. This backend system is currently
released under GPL.
Subject Descriptors:
I.5.2 Style guides

I.5.4 Text Processing
I.7.2 Format and notation
I.7.5 Document Analysis
Keywords:
Text Processing, Logical structure of web pages, Segmentation of web pages,

Classification of web page components / segments
Implementation Software and Hardware:
Intel Pentium Centrino processor, 512 MB DDR RAM, Windows / UNIX / Linux,
Java SDK 1.4.2
________________________________________________________________________
ii
________________________________________________________________________
Acknowledgement
There are many people whom helped me accomplish the work for this thesis. I would like
to thank my partner, Sandra, who has worked along side with me in formulating the
PARCELS toolkits. She is in charge of the textual engine.
I would also like to thank the people who release their tools and codes under GNU Public
License (GPL) or Open Source. They have made our work on PARCELS easier by not
re-inventing the wheel. In return, PARCELS is released under GPL as well.
Over the past year, there’s one person whom I owe the greatest thanks. During the year,
Assistant Professor Kan Min-Yen truly took me under his wing and guided me in
countless ways. I would like to thank him for all his support and patience I have received
for the past year.
________________________________________________________________________
iii
________________________________________________________________________
List of Figures
Figure 1: VIPS Representation ......................................................................................5
Figure 2: Cues for segmentation of web pages ..............................................................6
Figure 3: Patterns for Visual Separator Detection .........................................................6
Figure 4: Observations: Related Work vs. News Articles .............................................9
Figure 5: 9 Regions on Screen.....................................................................................12
Figure 6: National Geographic News Site ...................................................................15
Figure 7: Partial DOM Tree Representation................................................................15
Figure 8: Resultant DOM Tree Representation ...........................................................16
Figure 9: Stylistic Observations on Figure 8 ...............................................................17
Figure 10: PARCELS labels and purposes ................................................................19
Figure 11: PARCELS labels’ Stylistic Features ........................................................23
Figure 12: Breakdown of features in Feature Vector.................................................24
Figure 13: Co-Training Flowchart.............................................................................25
Figure 14: Example of Co-Training Classification....................................................25
Figure 15: Preliminary Investigation of Results ........................................................27
Figure 16: Distribution of HTML Structure Tags..................................................... A3
________________________________________________________________________
iv
________________________________________________________________________
Table of Contents
Abstract .............................................................................................................................ii
Acknowledgement ........................................................................................................... iii
List of Figures...................................................................................................................iv
Table of Contents...............................................................................................................v
1. Motivations ................................................................................................................1
1.1 Applications of PARCELS ............................................................................1
2. Related Works ...........................................................................................................3
3. Background Analysis .................................................................................................7
3.1 Analysis of News sites ...................................................................................7
3.1.1 Structural Properties...............................................................................7
3.1.2 Stylistic Properties .................................................................................8
4. Analysis of HTML Structures..................................................................................10
4.1 Body.............................................................................................................10
4.2 Paragraph .....................................................................................................10
4.3 Table Structure.............................................................................................10
4.3.1 Flow of Cells........................................................................................10
4.3.2 Grouping of Cells by Value..................................................................11
4.3.3 Ordering of Tables by Depth................................................................11
4.3.4 Position of Cells ...................................................................................11
4.4 Division / Span Structure .............................................................................12
4.4.1 Position of Division / Span...................................................................12
4.5 Layers Structure ...........................................................................................13
4.5.1 Priority of Layers .................................................................................13
4.6 Other Structures ...........................................................................................13
5. Formulation of the PARCELS toolkits ....................................................................14
5.1 DOM Tree Parser.........................................................................................14
5.1.1 DOM Tree............................................................................................14
5.1.2 Parser ...................................................................................................16
5.2 PARCELS Tags for News Domain ..............................................................18
________________________________________________________________________
v
________________________________________________________________________
6. Machine Learning and Co-Training .........................................................................20

6.1 Support Vector Machine ..............................................................................20
6.1.1 PARCELS Labels vs. Stylistic Features...............................................20
6.1.2 Formulation of Feature Vector .............................................................23
6.2 Boostexter ....................................................................................................24
6.3 Co-Training..................................................................................................24
6.3.1 Stylistic and Textual Engine.................................................................24
7. Analysis of Results ..................................................................................................27
7.1 Preliminary Investigation of Results ............................................................27
7.2 Detailed Investigation of Results .................................................................28
8. Conclusion ...............................................................................................................29
8.1 Future Work.................................................................................................29
References .......................................................................................................................31
Appendix A – Distribution of HTML Structural Tags .................................................... A1
Appendix B – Listing of Features in Feature Vector ...................................................... B1
________________________________________________________________________
vi
________________________________________________________________________
1. Motivations
Although the Web Wide Web (WWW) is defined as “the universe of network-accessible
information, the embodiment of human knowledge” (DiPasquo, 1998) by the World
Wide Web Consortium (W3C), it currently does not live up to this definition.
It is true that web pages are increasing at an exponential rate. Thus, countless projects
were targeted at ways of using machines to harness this knowledge, e.g. Semantic Web.
However, due to the semi-structured nature of raw HTML pages, accessing relevant
information is by no means an easy task. Pages that look similar often use different
HTML tags to achieve their layout effects and this makes it difficult for a machine to find
relevant fields of interest.
There are existing projects which tackle part of the problem, e.g. (DiPasquo, 1998), (Cai,
Yu, Wen and Ma, 2003), (Yu, Cai, Wen and Ma, 2003) and (Gupta, Kaiser, Neistadt and
Grimm, 2003). However, none of the solutions address the problem exactly.
PARCELS is a backend system designed to address this problem. Coded in Java,

PARCELS is executable on major platforms like Windows, UNIX and Linux. Web pages
of any nature can be applied to PARCELS.
Essentially, PARCELS consists of 2 engines – One on the textual detection of web pages
(Lai, 2004), the other on the stylistic detection of web pages. In this documentation, we
will be discussing on the stylistic detection engine.
1.1 Applications of PARCELS

PARCELS can be applied to many real life applications today:
• Efficient web page reading
________________________________________________________________________
1
________________________________________________________________________
With PARCELS, we can extract the title and/or main body of any web pages. This
greatly improves the efficiency of retrieving the relevant information on the WWW.
• Reducing size for Data Archiving

Given the number of web pages today, archiving these pages will require enormous
amounts of disk space. With PARCELS, we can remove items of lower priorities
like site links, advertisements and images of no relevance. Thus, reducing the size
needed.
• Improving results for Web Search / Mining

With information on the layout of the pages, we can target the search and mining on
the relevant components. This greatly improves the accuracy and efficiency of the
search.
• Classification of web page cluster through web design

We can identify web pages of the same nature using the logical structure of the web
pages. With this classification, we can proceed with other tasks like archiving the
documents in the correct domain.
As we can see, PARCELS greatly improves the effectiveness and efficiency of our
everyday tasks, like reading a News articles without all the advertisements. This allows
better access of “the universe of network-accessible information, the embodiment of
human knowledge” mentioned earlier.
In the next few chapters, we will do an analysis on prominent web pages and their HTML
structures. Following that, we will touch on how we formulate the PARCELS toolkits
which is our engine. Lastly, we will describe our Machine Learning techniques and
perform an analysis on the results.
________________________________________________________________________
2
________________________________________________________________________
2. Related Works
Our research is closely related to several areas which have been heavily worked on. In
this section, we will look at some of the more prominent research on text processing and
information retrieval on the Web.
Firstly, we must reaffirm our basis that there is information in the layout and stylistic
properties of web pages:
• In (DiPasquo, 1998), the thesis argues that there is information in the layout of a
web page, and that by looking at the HTML formatting in addition to the text on a
page, one can improve performance in tasks such as learning to classify segments of
documents.
The focus of the thesis was to extract the names of companies from the web pages
and the locations where they have operations in. For companies’ names, the thesis
reported a precision of 66.0% and recall of 27.1%. For companies’ locations, the
thesis reported a precision of 71.6% and recall of 64.4%.
As we can see, the results are better than those not using the HTML Struct Tree. Thus,
layout improves the classification. As PARCELS is divided into textual and stylistic
engines. We reaffirm the belief that the stylistic properties are useful in classifying
segments of web pages.
Next, we noted from related works that most papers modified the representation of the
DOM Tree:
• In (DiPasquo, 1998), they modified the DOM Tree and called it the HTML Struct
Tree. The HTML Struct Tree is intended to represent how a human would parse
word relationships on a web page due to its layout. As web pages are meant to be
________________________________________________________________________
3
________________________________________________________________________
read by humans, if parsed correctly, the relationship between objects on the page
can be captured.
Using the HTML Struct Tree, a Machine Learner was applied. As mentioned
earlier, the result is better than those not using the HTML Struct Tree.
Instead of changing the structure, we will be using the normal DOM Tree. This ensures
easier portability of our codes for other uses. As a result, we made modifications to our
parsers instead.
• In (Yu et al, 2003), the paper describes a new method to detect the semantic content
of a web page. Instead of a DOM Tree method, their new method uses the VIsions-
based Page Segmentation (VIPS) algorithm.
They proposed that by using local feedback to add keywords from top ranking
documents, they are able to improve the precision and recall for second round
search. However, there is one difficulty they faced: Accurate feedback is impossible
due to noises like navigation, decoration and interaction.
Thus, VIPS is used to eliminate such noises by using a tree to represent the flow of
text. From the tree, it is observed that topics of similar content are grouped together.
VIPS uses a series of heuristics and detection to achieve this.
The main drawback of this paper is that they are handling web pages with linear style (no
tables or layers). The usage of feedback described in this paper is good but it will not be
included this version of PARCELS. In this release, we will focus on the main framework.
• In (Cai et al, 2003), the paper proposes a new web content structure based on visual
representation of web pages. The new structure presents an automatic top-down,
tag-tree independent approach to detect web content structure. It simulates how a
user understands web layout structure based on his visual perception.
________________________________________________________________________
4
________________________________________________________________________
Figure 1: VIPS Representation
Comparing to other existing techniques, the new approach is independent to

underlying documentation representation such as HTML and works well even when
the HTML structure is far different from layout structure.
This paper is also using the VIPS algorithm as described previously. The main difference
is they are handling other Control Structures like tables (Figure 1). Personally, we still
feel that using DOM-based Tree with an improved DOM Tree parser to handle tables,
layers and frames will be more effective, while preserving the generic-ness of the web
page. We would do a comparison on this in the Chapter 7 (Analysis of Results).
Lastly, we would need to observe their heuristics on stylistic detection:
• In (Yu et al, 2003) and (Cai et al, 2003), the heuristics used to split contents into
blocks are as follows:
________________________________________________________________________
5
________________________________________________________________________
Cue Purpose
Tag Cue HTML tags like <HR> are used to split the web page into blocks of text.
Color Cue Difference in the background colors between 2 blocks of text means they
belongs to 2 different blocks.
Text Cue If one block contains HTML tags and the other does, it means they belong to
2 different blocks.
Size Cue Difference in the font sizes between 2 blocks of text means they belongs to
2 different blocks.
Figure 2: Cues for segmentation of web pages
Visual Separator Detection used:
Pattern Purpose
Distance Distance between blocks on both sides of separator
Tag Position of tags like <HR>
Font Difference in formatting on both sides of separator
Color Difference in background color on both sides of separator
Figure 3: Patterns for Visual Separator Detection
The heuristics and detection methods used in VIPS are very useful and are included in
PARCELS. We further extended the set of heuristics to suit our purposes. The detailed
set can be found in Chapter 5 (Formulation of PARCELS toolkits).
As we can see, most of the related works focus on how humans interpret the web pages
based on the layout and stylistic properties. This is our main focus for the stylistic engine
as well. Thus, we shall proceed with our analysis on the layout and stylistic properties of
real web pages.
________________________________________________________________________
6
________________________________________________________________________
3. Background Analysis
Based on the argument that there is information inherent in the layout of each page, and
that parsing the HTML formatting appropriately can improve traditional text processing
(DiPasquo, 1998), we decided to carry out an analysis on News web pages.
News articles are usually reader friendly but not machine friendly. Furthermore, articles
in the News domain are feature rich with complex layout structures. Articles are also
updated daily and read by millions of people. With the ability to parse news articles, we
would be able to parse pages of other domains with relative ease. Thus, it makes sense to
use this domain as a starting point for our analysis.
3.1 Analysis of News sites

To facilitate our analysis, a total of 12 popular news sites were selected. The News sites
were analyzed for their structural and stylistic properties.
3.1.1 Structural Properties
A study was made to find out the dominant layout structure of these sites. Figure 16 in
Appendix A gives the breakdown of the statistics of <table>, <div>, <span> and <layer>
tags used.
A related study was also made to News articles in Year 2000. We will be contrasting
these 2 analyses to pin point the trend of layout structural tags usage.
• Structural tag <table> is most widely used

In Year 2004, we found out that 10 out of the 12 News sites are using tables as their
dominant layout structure. In Year 2000, all the News sites are using tables for their
main layout. As we observed, <table> tags are widely used and will continue to be
so (due to compatibility reasons with existing systems).
________________________________________________________________________
7
________________________________________________________________________
• Structural tag <div> and <span> is gaining popularity

2 of the News sites have converted to using <div> tags for their main layout. The
conversion is likely for speed issues (do not need to wait for pages to load) and
easier maintenance of web pages.
For the rendering position of <div> and <span> tags on the screen, we would have to
look into the Cascading Style Sheet (CSS) files to obtain the layout properties.
• Structural tag <layer> is not used

This tag is obsolete due to increased usage of CSS with the z-index properties.
Thus, we will be focusing on <table>, <div> and <span> structural tags for the
development of PARCELS.
3.1.2 Stylistic Properties
Having determined the dominant structural tags, we proceed with the analysis on the
stylistic properties of the pages.
In (Ivory and Hearst, 2002), (Ivory, 2003), (Yu et al, 2003) and (Cai et al, 2003), we
noted some of the stylistic aspect which will be useful to us. Next, we applied our
knowledge of the stylistic characteristics to News articles today.
Related Work Our Observations
Number of links for different segments Sitemaps, navigation links and related articles’ links are
normally grouped together. Thus, the number of links for that
segment will be significantly higher.
Style of links for different segments The style of links in the navigation bar, related articles and
advertisements are different from the links in the main content
and supporting content.
Color of links for different segments The color of links in the navigation bar is sometimes different
from the links in the main content and supporting content.
________________________________________________________________________
8
________________________________________________________________________
Number of images for different Sitemaps and navigations links are more probable to have
segments more images than other segments. Images located in between
segments with high density text will likely be images
supporting the content.
Size of Images This is sometimes useful in determining whether the image is

an advertisement as they usually come in fixed sizes.
Font styles for different segments Headers, captions and newsletters are in different font styles
from the main content. As such, we will be using these styles
to aid in our classification.
Font sizes for different segments Headers, main content, support contents and captions all have
different font sizes. Font size will be a good indicator for some
of our component classification.
Figure 4: Observations: Related Work vs. News Articles
These characteristics of News sites will act as the foundation for our learning process in
Chapter 6 (Machine Learning and Co-training).
________________________________________________________________________
9
________________________________________________________________________
4. Analysis of HTML Structures

In this chapter, we will do an analysis on the HTML Control Structure that will be used in
used in PARCELS. As in Control Structure, we are referring to HTML tags that will
affect the layout position of the contents of a page.
4.1 Body
The <body> tag is the trivial case in HTML Structure. All HTML documents must contain
a body tag. This is where the layout of a web page begins.
4.2 Paragraph
The next fundamental tag is the Paragraph <p> tags. The usage of paragraph tag is pretty
straightforward. PARCELS uses these tags to specify the basic unit for a block of text.
4.3 Table Structure

From the analysis we did in Chapter 3, tables are the most prominent structure on the
Web now. Thus, we will place more emphasis on the analysis of table structure. There are
a few papers dedicated to the study of tables. These papers aided us in our analysis of
tables, e.g. (Hurst and Douglas, 1997), (Hurst, 1999) and (Cohen, Hurst and Jenson,
2002).
4.3.1 Flow of Cells
The flow of cells is known as the functional view of a table. Functional view is a
description of how the information presentation aspects of tables embody a decision
structure or reading path. This determines the order of the tables (Hurst et al, 1997).
________________________________________________________________________
10
________________________________________________________________________
The flow of cells is important as we know there’s a usual layout on how webmasters
design their website. If we follow that flow of layout, it would be significantly easier to
classify the segments correctly.
4.3.2 Grouping of Cells by Value
Grouping of cells by value will aid in understanding how the webmasters intend the
tables to be read (Hurst, 1999). The value is calculated using the type of data located in
the cell and the word density of that particular cell.
Cells of the same value are often closely related to each other. Furthermore, grouping the
cells by value will enable us to understand the contents better. This will ensure a better
classification of that particular segment.
4.3.3 Ordering of Tables by Depth
Nested tables are very common in web designing today. Thus, it is important to find the
relation between nested tables and how each cell affects another cell. First and foremost,
we would need to order each cell by their depth.
From the News site we analyzed, the number of nested tables for each cell is different.
Based on this, we observed it is possible to classify some of the segments using such
characteristics.
4.3.4 Position of Cells
The position of cells is important for information extraction. Table-based extraction is

done by tagging each cell with a visual position tag (Cohen et al, 2002).
With the visual position tag, we will be able to know the position of this cell with respect
to any nested tables. From this, we can gather the following information.
________________________________________________________________________
11
________________________________________________________________________
• Cut-in header: Cells related to a particular header cell in a table.

• Column header: The overall column header of the table (if any).
• Row header: The overall row header of the table (if any).
Furthermore, there are some layout formats webmasters will follow when designing a
web site. If they are using Content Management System (CMS), these layouts will be
even similar. CMS are scripts that are used by webmasters to generate the layout and
content of the site.
Thus, using the position of cells will be useful in our classification.
4.4 Division / Span Structure

Division <div> and Span <span> are two tags that are gaining popularity in recent times.
Division tag is a Control Structure tag as it will create a Break <br> effect. Span tag may
or may not be a Control Structure tag depending on whether it has a position allocated to
it. We can assign positional value to both Division and Span in CSS.
4.4.1 Position of Division / Span
As we know, the position of a certain block of text is a major factor which contributes to
the structure of a web site; it would be useful in our classification. If a positional value is
allocated, we would need to estimate its position on the screen. For this, we will split the
screen into 9 regions.
Region 1 Region 2 Region 3
Figure 5: 9 Regions on Screen
________________________________________________________________________
12
________________________________________________________________________
We will locate the maximum position on the X and Y axis. That block of text will be in
region 9. The rest of the regions will be sub divided according to this maximum position.
The remaining blocks of text will be placed into each of these regions according to their
X-Y coordinates.
4.5 Layers Structure

Due to advancement in HTML and CSS specifications, the functionalities of Layers can
be emulated in Division and Span (z-index in CSS). Thus, Layers will be handled in the
same way as Division and Span.
4.5.1 Priority of Layers
Priority of a particular layer depends on the value of z-index. The higher the value of z-
index, the higher it will appear on top of other layers. For example, if layer A has a z-
index of 0, layer B has a z-index of 10 and layer A and B overlap each other, then layer B
will appear on top of layer A.
In terms of classifications, position of each layer is important as mentioned in Section

4.4.1. Furthermore, z-index will also be useful as segments like main contents will likely
be on top of other segments like background images.
4.6 Other Structures

We recognize that there are other structures on the Web today, like Frames and Cross
Documents structure. However, PARCELS focuses on single documents in this release.
Thus, they will be included in future releases.
________________________________________________________________________
13
________________________________________________________________________
5. Formulation of the PARCELS toolkits

After our analysis on web pages and HTML structures, we would now formulate our
basic toolkits for PARCELS. We would focus on the stylistic engine in this chapter.
5.1 DOM Tree Parser

A DOM Tree parser stimulates how humans parse words on a web page due to its layout.
Since a web page is designed to be read by humans, if we are able to parse it properly, we
will be able to capture the right relationship between segments correctly.
This relationship is what we defined as the “Logical Structure”.
5.1.1 DOM Tree
To manipulate with the styles of web pages, we need their DOM Trees. A DOM Tree is
an n-ary tree in which each node corresponds to an HTML tag on the page. This DOM
Tree is created using the default Node class in Java.
Before we create the DOM Tree, we need to ensure the HTML coding is sound. Due to
the complexity of web pages, there will likely be errors. Thus, a program called Tidy is
used to correct such errors.
Figure 6 shows a web page as viewed in Internet Explorer 6. After parsing it through
Tidy, a DOM Tree is created. Figure 7 shows the partial DOM Tree that is created.
• The parent-child relationship in the tree implies that the child is in the scope of the
parents. Each parent-child relationship loosely means they are directly related.
• The ordering of nodes in the DOM Tree is preserved in that children are ordered
from left to right in the order they appear on that page.
________________________________________________________________________
14
________________________________________________________________________
Figure 6: National Geographic News Site
Document
head body
title meta … table table …
Text tbody …
tr
td td
Text Text
Figure 7: Partial DOM Tree Representation
________________________________________________________________________
15
________________________________________________________________________
The next section describes how our parser parses the relevant information out of this
DOM Tree.
5.1.2 Parser
This section will focus on the 2 major functions of the DOM Tree parser.
• Splitting of Text from DOM Tree
The parser will split the web page into blocks of text (basic unit). Each block of text
is split in a human readable way, rather than how it is represented in a DOM Tree.
For example, given the following block of HTML codes:

<body><b>Tom Mitchell’s <i>Machine Learning</i> Website</b></body>
The resultant DOM Tree is shown in Figure 8.
body
Tom Mitchell’s i Website
Machine Learning
Figure 8: Resultant DOM Tree Representation
If we were to split it as how the DOM Tree represents it, we will get 3 blocks of
text: “Tom Mitchell’s”, “Machine Learning” and “Website”.
________________________________________________________________________
16
________________________________________________________________________
However, that’s not how human interpret it. So our algorithm will link up all these
text under the umbrella of a Control Structure. A Control Structure can be a
paragraph <p>, a table cell <td>, a division <div>, a span <span>, a layer <layer> or a
body <body>.
Thus, the block of text will be “Tom Mitchell’s Machine Learning Website”. This
block of text will then be used in the Text Detection Engine of PARCELS (Lai,
2004).
• Parsing of Styles from DOM Tree
For each block of text, the parser will parse its stylistic properties (as mentioned in
Section 3.1.2). For example, given the block of text and DOM Tree in Figure 8, we
would obtain the following characteristics:
Stylistic Properties (in each segment) Our Observations
All the Control Structures containing this • Default Body Structure found
segment
• No other Control Structure found
Position of each Control Structure • Position of Body Structure is trivial
Number and type Link elements • No links were found
Colors of Link formatting • No links were found
Number of Graphics elements • No graphical elements found
Size of Images (if available) • No graphical elements found
Total Word Count • Total word count is 5
Font styles, sizes and colors • 100% of the text are bold
• 40% of the text are in italics
• 100% of the text are of the same size
• 100% of the text are of the same color
Figure 9: Stylistic Observations on Figure 8
________________________________________________________________________
17
________________________________________________________________________
This is a simple demonstration on what the parser will obtain together with the
block of text. These data will be stored in a Vector for Machine Learning purposes
in the next chapter.
5.2 PARCELS Tags for News Domain

As mentioned earlier, we would focus on the News Domain in this release. Thus, we
would need to identify the labeling of segments that are of interest to us in the News
Domain. These labels are easily extended.
After looking through articles from 12 News sites, we came up with 15 PARCELS labels.
These labels are common components found, like title, date, report name and etc.
When we did the actual annotation for our Machine Learning training data, there are
segments which are useful but do not fall into any of the classes. Thus, 2 more labels
were added, namely sub header and supporting contents.
Presently, these are the 17 labels:
PARCELS Labels Purpose of Labels
Title of article Refers to the headline of the article or phrase that summarizes the
article
Date / Time of article Refers to the date and time the article was published
Reporter name Refers to the author of the article
Source station Refers to the source / provider broadcast station of the news article
Country where news occurred Refers to the country, city or state where the event took place
Image supporting contents of Refers to an image (a photograph, drawing or sketch) related to the
article contents of the article
Links supporting contents of Refers to a hyperlink placed within the main content of the article
article
Main content of article Refers to the main text of the news article
________________________________________________________________________
18
________________________________________________________________________
Supporting content of article Refers to the other text not belonging to the main text of the news
article but is still related to the article, such as captions of images,
captions of audio / video files, any text of sidebars containing
additional information about the article
Sub header Refers to the sub headers which are found within the main content
of the article
Links to related article Refers to links to other news articles which have similar content or
articles previously reported on the same topic, and the date / time of
these articles
Newsletter Refers to text / links prompting the user to sign up for or subscribe to
a newsletter
Site image Refers to images which are part of the web page's design and not
related to the news article
Site content Refers to the text which is part of the web page's content and not
related to the news article
Site links / navigation Refers to links to other parts of the website, including navigation
bars but not related to the news article
Advertisements Refers to advertisements on the website, text ads and banner ads
Search Refers to the text / links related to searching or search options
Figure 10: PARCELS labels and purposes
More information on the textual engine of PARCELS can be found in (Lai, 2004).
After formulating the PARCELS toolkits, we are now able to the retrieve useful
information from web pages. With this information, we shall proceed with our Machine
Learning.
________________________________________________________________________
19
________________________________________________________________________
6. Machine Learning and Co-Training

For the stylistic engine, we would be using Support Vector Machine (SVM) for our
Machine Learning purposes. For the textual engine, Boostexter would be used (Lai,
2004). For comparison purposes, Boostexter would also be included as an added feature
in the stylistic engine.
6.1 Support Vector Machine

SVM is selected due to some of its core features like its ability to learn is independent of
the dimensionality of the feature vector (Joachims, 1998). For stylistic detection, the size
of the feature vector will grow increasingly with additional domains.
Furthermore, SVM measure the complexity of hypotheses based on the margin with
which they separate the data and not the number of features. This means we can
generalize even in the presence of many features (Joachims, 1998).
Thus, this same margin argument suggests a heuristic for selecting good parameter
settings for the learner. We can input a bias (cost factor) via a parameter. This allows
fully automatic parameter tuning without expensive cross-validation.
6.1.1 PARCELS Labels vs. Stylistic Features
Before we proceed with the feature vector, we must first tabulate the stylistic features
which are likely to affect certain PARCELS labels. This is to cut down on the number of
redundant features.
Labels that are more likely to be classified by the stylistic engine are colored in light grey
while labels that are less likely to be classified are in white.
________________________________________________________________________
20
________________________________________________________________________
PARCELS Label Stylistic Feature
Title of article • Likely detected by stylistic engine
• Larger font size
• May be in Bold font
• Relatively Top Position
Date / Time of article • Not likely detected by stylistic engine
• Likely grouped with Reporter name
• Smaller font size
• May be in Italics font
Reporter name • Not likely detected by stylistic engine
• Likely grouped with Date / Time
• May be in Italics font
Source station • Not likely detected by stylistic engine
• Likely grouped within blocks of text
Country where news occurred • Not likely detected by stylistic engine
• Likely grouped within blocks of text
Image supporting contents of article • Likely detected by stylistic engine
• Image tags grouped with high Word Count
• Position within Main Content
Links supporting contents of article • Likely detected by stylistic engine
• Links tag grouped with high Word Count
• Position within Main Content
Main content of article • Likely detected by stylistic engine
• High Word Count
• Few or no Bold / Italics tags
• Position across News sites is similar
________________________________________________________________________
21
________________________________________________________________________
Supporting content of article • Likely detected by stylistic engine
• High Word Count
• Few or no Bold / Italics tags
• Likely to be using different font color
Sub header • Likely detected by stylistic engine
• Low Word Count
• High percentage in Bold
Links to related article • Likely detected by stylistic engine
• High percentage of links tags
• High percentage of text within <a> tags
Newsletter • Not likely detected by stylistic engine
• Low Word Count
• May be in smaller font size
Site image • Likely detected by stylistic engine
• Low Word Count
Site content • Not likely detected by stylistic engine
• Low Word Count
Site links / navigation • Likely detected by stylistic engine
• High percentage of text within <a> tags
• May have a high percentage of images
________________________________________________________________________
22
________________________________________________________________________
Advertisements (Graphical) • Likely detected by stylistic engine
• Low percentage of link tags
• Low percentage of image tags
• Low or no Word Count
• May be in fixed sizes
Search • Not likely detected by stylistic engine
• Low Word Count
Figure 11: PARCELS labels’ Stylistic Features
Therefore, from Figure 11, we can see that the stylistic engine will likely detect 10 out of
the 17 labels with high confidence. We will verify this in Chapter 7 (Analysis of Results).
The textual engine will handle the rest of the labels with high confidence (Lai, 2004).
With the observed stylistic features, we now formulate our feature vector.
6.1.2 Formulation of Feature Vector
Having identified the prominent stylistic features, we came up with the feature vector for
each segment of text.
Characteristics Detection Number of features
Detection of tables and the flow of cells as mentioned in Chapter 4.3. We 11

observed the number of nested tables does not go beyond 3 levels on the
Web. Thus, we set the limit to 4 nested tables.
Detection of position of div, span and layer as mentioned in Chapter 4.4. This 3
refers to which of the 9 regions on screen are they in.
Detection of stylistic properties of the segment, like the percentage of bold 8

text, the percentage of italics text, etc.
Detection of weight factor of the segment. These features capture the number 4
of <br> tags and word count. Good for detecting labels like Main Content.
________________________________________________________________________
23
________________________________________________________________________
Detection of links ratio in that particular segment of the web page. Useful in 3
detecting labels like Related Links and Navigation Links.
Detection of images and their sizes. Good for advertisements and site links 2
detection.
Total Features 31
Figure 12: Breakdown of features in Feature Vector
One feature vector will be outputted for every text block segment and stored in a single
file. In the event when one of the features does not apply, they will be left out. This file
will then be trained or classified.
For the full listing on the contents of the feature vector, please refer to Appendix B.
6.2 Boostexter
Boostexter is used mainly in the textual engine (Lai, 2004). For comparison, we have
included a module to convert our SVM feature vector to Boostexter. The PARCELS
labels and formulation of feature vector for Boostexter will be similar to Section 6.1.
6.3 Co-Training
In Machine Learning, unlabelled examples are significantly easier to come by than label
ones. The idea of co-training is to use this large unlabelled sample to boost the
performance of our learning algorithm since a small set of labeled examples is available
to us due to manpower issues (Blum and Mitchell, 1998).
6.3.1 Stylistic and Textual Engine
Since we have 2 engines, one based on styles and one based on text, which are almost
mutual exclusive to each other, it is logical to apply co-training to achieve a better result.
________________________________________________________________________
24
________________________________________________________________________
Add top k positive data
SVM Training Data SVM Trainer SVM Classifier

Classify New Data
Remove top k positive data

Unlabelled Data Co-training Module
Classify New Data

BT Training Data BT Trainer BT Classifier
Add top k positive data
Figure 13: Co-Training Flowchart
Firstly, there are two sets of training data for the both SVM and Boostexter. After
learning is performed on these training data, the SVM classifier will classify the
unlabeled examples. Top k examples with the highest positive confidence will be
collected after classification. The same applies for the Boostexter classifier.
PARCELS will then compare the top k examples between the two classifiers. The
overlapping examples will be selected first. For example,
SVM Classifier (Top k) Boostexter Classifier (Top k)
Segment Classification Segment Classification
5 Main Content 6 Main Content
6 Related Links 5 Main Content
Figure 14: Example of Co-Training Classification
________________________________________________________________________
25
________________________________________________________________________
From Figure 14, Segment 5 and 11 will be selected. Segments with two different
classifications will be ignored. Thus, Segment 6 will be ignored. Since we are using two
separate Machine Learners, it is not accurate to compare the level of confidence between
two learners.
Since we are looking for top 5 (k = 5 in this example), we will take the top examples
from each classifier until 5 is reached. Therefore, Segment 10, 42 and 1 will be included.
These 5 examples will then be removed from the unlabelled examples and added to the
training set for both SVM and Boostexter.
The whole process will repeat for the number of times specified.
________________________________________________________________________
26
________________________________________________________________________
7. Analysis of Results
With the PARCELS engines working, it would be useful to put it into real use by
applying existing web pages in the News domain.
7.1 Preliminary Investigation of Results

As we know, it is not easy to obtain huge amounts of labeled data in a short time. Thus,
with a training set of 5 news sites, we would try to obtain some preliminary results.
Labels Initial Round 1 Round 2 Round 3 Round 4

Precision / Recall P R P R P R P R P R
Overall 70.1 70.1 70.1 70.1 71.9 71.9 64.9 64.9 64.9 64.9
Main Content 87.8 100.0 84.8 96.5 84.3 93.1 72.9 93.1 72.9 93.1
Related Link 100.0 50.0 75.0 50.0 100.0 66.6 100.0 33.3 100.0 33.3
Site Content 33.3 100.0 38.4 100.0 50.0 80.0 57.1 80.0 57.1 80.0
Navigation Links 75.0 60.0 50.0 60.0 36.3 80.0 22.2 40.0 22.2 40.0
Figure 15: Preliminary Investigation of Results
With the available training data, the result peaked in the second round of co-training. An
overall precision and recall of 71.9% is a good result. In fact, if we do a comparison on
the papers (Chapter 2) that uses their own representation tree, like HTML Struct Tree, our
results are on par and even higher in some instances.
Not surprisingly, Main Content is the easiest to classify with its recall constantly over
90% and precision over 70%. However, as we know main content extraction is already
heavily researched. Thus, we are more interested in the other labels.
Related links comes a close second with precision of 100% but with a rather low recall.
The low recall is due to the lack of training data. The same can be said for Navigation
Links as they both belong to the Links class.
________________________________________________________________________
27
________________________________________________________________________
On the other hand, Site Content has an increasing precision and a high recall. The
increasing precision reflects that co-training works well for our classification given
sufficient amount of training data.
We are certain that the results can be a lot better. First of all, not all the labels are
available in the training set. Those labels not mentioned in Figure 15 are the ones we do
not have positive examples of.
Nevertheless, an annotation engine is already in place to handle this problem (Lai, 2004).
7.2 Detailed Investigation of Results
More labeled examples from the annotation engine will be available in near future. With
more labeled examples, we are very confident that the precision and recall will be
improved further.
We will be providing the detailed investigation of results on our PARCELS website on

Sourceforge.
________________________________________________________________________
28
________________________________________________________________________
8. Conclusion
We started this paper with a mission to design something that classifies segments of web
pages into useful information. PARCELS has done just that. Inputting a News article into
PARCELS will give us a Logical Structure of how the web page is designed. With this
Logical Structure, we are able to obtain the relevant segments of the web page.
We have done numerous background analyses on how humans perceive web pages in real
life before designing such a system. Thus, we are pretty confident PARCELS will be able
to classify with accuracy and precision.
With the working system now, we can apply it to many tasks; some of which mentioned
in this thesis. PARCELS is easily extendable. This was one of the design considerations
of PARCELS – to be able to handle pages from different domains. Since the system is
designed modularly, extending it to other domains poses no problem at all.
PARCELS is currently released under GPL on Sourceforge. http://parcels.sourceforge.net
8.1 Future Work

Nevertheless, there is always room for further work. Below are some of the areas which
can be improved.
• Ability to handle Frame Structure

This will ensure we have a complete system that can parse virtually any web pages
on the Web today,
• Ability to handle Cross Documents Structure

This will give PARCELS the ability the view web pages as clusters instead of
individual pages. This is important since most web pages today are part of major
web site.
________________________________________________________________________
29
________________________________________________________________________
• Improved CSS support

Although there is CSS support in this release, our primary focus is still HTML tags.
CSS is gaining popularity and it is useful to take a look at CSS.
• JavaScript support
We did not handle JavaScript in this release. It would be good to have as JavaScript
does affect the layout of pages in some instances (though rarely).
________________________________________________________________________
30
________________________________________________________________________
References
Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-
training. In COLT: Proceedings of the Workshop on Computational Learning
Theory, Morgan Kaufmann Publishers, 1998.
Cai, D., Yu, S., Wen, J.R. and Ma, W.Y. (2003). Extracting Content Structure for Web
Pages Based on Visual Representation. In the 5 th Asia Pacific Web Conference,
Xi’an, 2003, pp. 406 – 417
Cohen, W.W., Hurst, M. and Jenson, L.S. (2002). A flexible learning system for
wrapping tables and lists in HTML documents. In the 11 th WWW Conference,
Hawaii, 2002, pp. 232 – 241
DiPasquo, D. (1998). Using HTML formatting to aid in natural language processing on

the World Wide Web. Senior thesis, Computer Science Department, Carnegie
Mellon University, 1998.
Gupta, S., Kaiser, G.E., Neistadt, D. and Grimm, P. (2003). DOM-based content
extraction of HTML documents. In the 12 th WWW Conference, Budapest, 2003, pp.
207 – 214
Hurst, M. (1999). Layout and Language: Beyond Simple Text for Information Interaction
– Modeling the Table. In the 2nd International Conference on Multi-model
Interfaces, 1999.
Hurst, M. and Douglas, S. (1997). Layout & Language: Preliminary experiments in

assigning logical structure to table cells. In the Applied Natural Language
Processing Conference, Washington, 1997, pp 217 – 220
________________________________________________________________________
31
________________________________________________________________________
Ivory, M.Y. and Hearst, M.A. (2002). Statistical Profiles of Highly-Rated Web Sites. In
ACM Conference on Human Factors in Computing Systems, CHI Letters, 2002, pp.
367-374.
Ivory, M.Y. (2003). Characteristics of Web Site Designs: Reality vs. Recommendation.
In HCI International Conference, 2003.
Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with
many relevant features. In Machine Learning: ECML-98, Tenth European
Conference on Machine Learning, pp. 137 – 142.
Lai, S. (2004). PARCELS: PARser for Content Extraction and Logical Structure (Text
Deduction). Honours Thesis, School of Computing, National University of
Singapore, 2004.
Yu, S., Cai, D., Wen, J.R. and Ma, W.Y. (2003). Improving pseudo-relevance feedback
in web information retrieval using web page segmentation. In the 12 th WWW
Conference, Budapest, 2003, pp. 11 – 18
________________________________________________________________________
32
________________________________________________________________________
Appendix A – Distribution of HTML Structural Tags

The table below shows the distribution of HTML Control tags for 12 popular News sites
on the Web today.
Percentage of tags embed with
News Sites Total Tags <table> <div> <span> <layer>
Index 464 97.1% 32.1% 4.0% 0%

National Geographic
(Jan 2004)
Articles 447 97.2% 50.2% 2.0% 0%
Index 454 96.0% 29.9% 5.2% 0%

National Geographic
(Jan 2002)
Articles 337 96.3% 39.3% 2.9% 0%
Index 608 98.1% 3.9% 2.7% 0%

Straits Times
(Jan 2004)
Articles 425 96.7% 12.4% 1.1% 0%
Index 619 96.4% 10.1% 0% 0%

Straits Times
(May 2000)
Articles 411 95.3% 12.0% 0% 0%
Index 943 97.1% 31.2% 6.6% 0%

CNN
(Jan 2004)
Articles 649 95.4% 49.0% 2.4% 0%
Index 1358 99.1% 22.1% 8.0% 0%

CNN
(Jun 2000)
Articles 934 98.2% 36.4% 12.1% 0%
Index 1012 98.6% 6.9% 3.3% 0%

Yahoo News
(Jan 2004)
Articles 769 98.0% 6.8% 4.4% 0%
Index 929 98.7% 0% 0% 0%

Yahoo News
(Oct 2000)
Articles 506 96.3% 0% 0% 0%
Index 622 88.5% 60.1% 0.8% 0%

BBC
(Jan 2004)
Articles 479 85.0% 48.2% 1.4% 0%
Index 589 95.7% 19.6% 18.6% 0%

BBC
(Feb 2000)
Articles 465 87.2% 31.4% 9.0% 0%
________________________________________________________________________
A1
________________________________________________________________________
Index 1116 98.8% 0% 0% 0%

Slashdot
(Jan 2004)
Articles 2894 99.3% 0.3% 0% 0%
Index 914 96.2% 0% 0% 0%

Slashdot
(May 2000)
Articles 2459 98.9% 0% 0% 0%
Index 661 0% 97.1% 8.4% 0%

CNet
(Jan 2004)
Articles 431 4.6% 95.9% 2.0% 0%
Index 744 98.3% 98.4% 17.4% 0%

CNet
(Jan 2002)
Articles 403 97.0% 97.2% 12.3% 0%
Index 402 5.9% 94.7% 9.7% 0%

Wired News
(Jan 2004)
Articles 217 9.4% 81.7% 13.3% 0%
Index 857 98.2% 0% 0% 0%

Wired News
(Feb 2000)
Articles 410 93.4% 0% 0% 0%
Index 786 98.8% 54.8% 0.3% 0%

The Register
(Jan 2004)
Articles 412 97.8% 15.7% 0% 0%
Index 679 98.2% 0% 0% 0.1%

The Register
(Feb 2000)
Articles 247 95.1% 0% 0% 0.4%
Index 426 97.6% 5.1% 0% 0%

Space.com
(Jan 2004)
Articles 586 93.6% 1.1% 0% 0%
Index 616 98.5% 12.0% 0% 0%

Space.com
(Mar 2000)
Articles 598 98.2% 13.0% 0% 0%
Index 1036 88.0% 97.4% 29.1% 0%

MSNBC
(Jan 2004)
Articles 538 77.9% 97.3% 15.4% 0%
Index 315 95.8% 31.7% 9.2% 0%

MSNBC
(Feb 2001)
Articles 458 94.8% 0.1% 0% 0%
Index 417 97.3% 98.0% 0% 0%

Arabic News
(Jan 2004)
Articles 213 95.3% 3.7% 0% 0%
________________________________________________________________________
A2
________________________________________________________________________
Index 426 95.3% 0% 0% 0%

Arabic News
(Feb 2000)
Articles 133 86.4% 0% 0% 0%
Figure 16: Distribution of HTML Structure Tags
________________________________________________________________________
A3
________________________________________________________________________
Appendix B – Listing of Features in Feature Vector

1: Depth of current cell in the nested tables
2: Row number of current cell in outermost table
3: Row number of current cell in the 1st nested table
4: Row number of current cell in the 2nd nested table
5: Row number of current cell in the 3rd nested table
6: Row number of current cell in the 4th nested table
7: Column number of current cell in outermost table
st
8: Column number of current cell in the 1 nested table
9: Column number of current cell in the 2nd nested table
10: Column number of current cell in the 3rd nested table
th
11: Column number of current cell in the 4 nested table
12: Position (1 – 9) of div
13: Position (1 – 9) of span
14: Position (1 – 9) of layer
15: Percentage of words in <i> in this segment
16: Percentage of words in <i> in this segment over all words under <i> in article
17: Percentage of words in <b> in this segment
18: Percentage of words in <b> in this segment over all words under <b> in article
19: Percentage of words in <u> in this segment
20: Percentage of words in <u> in this segment over all words under <u> in article
21: Percentage of words in <font> in this segment
22: Percentage of words in <font> in this segment over all words under <font> in article
23: Number of <br> tags in this segment
24: Percentage of <br> in this segment over all <br> in article
25: Number of words in this segment
26: Percentage of words in this segment over all words in article
27: Percentage of words in <a> in this segment
28: Number of <a> tags in this segment
29: Percentage of <a> in this segment over all <a> in article
30: Number of <img> tags in this segment
31: Percentage of <img> in this segment over all <img> in article
________________________________________________________________________
B1

Stylistic

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stylistic

Uploaded by

Copyright:

Available Formats

Honours Year Project Report

PARCELS: PARser for Content Extraction and Logical

Lee Chee How

Department of Computer Science

National University of Singapore

Honours Year Project Report

PARCELS: PARser for Content Extraction and Logical

Lee Chee How

Department of Computer Science

National University of Singapore

I.5.2 Style guides

Text Processing, Logical structure of web pages, Segmentation of web pages,

Implementation Software and Hardware:

6. Machine Learning and Co-Training .........................................................................20

PARCELS is a backend system designed to address this problem. Coded in Java,

1.1 Applications of PARCELS

• Efficient web page reading

• Reducing size for Data Archiving

• Improving results for Web Search / Mining

• Classification of web page cluster through web design

Figure 1: VIPS Representation

Comparing to other existing techniques, the new approach is independent to

Lastly, we would need to observe their heuristics on stylistic detection:

Figure 2: Cues for segmentation of web pages

Visual Separator Detection used:

Distance Distance between blocks on both sides of separator

Tag Position of tags like <HR>

Font Difference in formatting on both sides of separator

Color Difference in background color on both sides of separator

Figure 3: Patterns for Visual Separator Detection

3.1 Analysis of News sites

3.1.1 Structural Properties

• Structural tag <table> is most widely used

• Structural tag <div> and <span> is gaining popularity

• Structural tag <layer> is not used

3.1.2 Stylistic Properties

Related Work Our Observations

Size of Images This is sometimes useful in determining whether the image is

Figure 4: Observations: Related Work vs. News Articles

4. Analysis of HTML Structures

4.3 Table Structure

4.3.1 Flow of Cells

4.3.2 Grouping of Cells by Value

4.3.3 Ordering of Tables by Depth

4.3.4 Position of Cells

The position of cells is important for information extraction. Table-based extraction is

• Cut-in header: Cells related to a particular header cell in a table.

Thus, using the position of cells will be useful in our classification.

4.4 Division / Span Structure

4.4.1 Position of Division / Span

Region 1 Region 2 Region 3

Region 4 Region 5 Region 6

Region 7 Region 8 Region 9

Figure 5: 9 Regions on Screen

4.5 Layers Structure

4.5.1 Priority of Layers

In terms of classifications, position of each layer is important as mentioned in Section

4.6 Other Structures

5. Formulation of the PARCELS toolkits

5.1 DOM Tree Parser

This relationship is what we defined as the “Logical Structure”.

5.1.1 DOM Tree