You are on page 1of 43

Honours Year Project Report

PARCELS: PARser for Content Extraction and Logical


Structure (Stylistic Detection)

By

Lee Chee How

Department of Computer Science

School of Computing

National University of Singapore

2003 / 2004
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Honours Year Project Report

PARCELS: PARser for Content Extraction and Logical


Structure (Stylistic Detection)

By

Lee Chee How

Department of Computer Science

School of Computing

National University of Singapore

2003 / 2004

Project: H79010
Advisor: Assistant Professor Kan Min-Yen

Deliverables:
Report: 1 Volume
Program: 1 CD

________________________________________________________________________
i
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Abstract
Web documents that look similar often use different HTML tags to achieve their layout
effect. These tags make it difficult for a machine to find text or images of interest.
PARCELS is a Java backend system designed to classify different components of a web
page by obtaining the logical structure from its layout and stylistic characteristics. Each
component is given a PARCELS label which identifies its purpose, like Main Content,
Links, etc. A recall and precision of 71.9% is obtained. This backend system is currently
released under GPL.

Subject Descriptors:

I.5.2 Style guides


I.5.4 Text Processing
I.7.2 Format and notation
I.7.5 Document Analysis

Keywords:

Text Processing, Logical structure of web pages, Segmentation of web pages,


Classification of web page components / segments

Implementation Software and Hardware:

Intel Pentium Centrino processor, 512 MB DDR RAM, Windows / UNIX / Linux,
Java SDK 1.4.2

________________________________________________________________________
ii
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Acknowledgement
There are many people whom helped me accomplish the work for this thesis. I would like
to thank my partner, Sandra, who has worked along side with me in formulating the
PARCELS toolkits. She is in charge of the textual engine.

I would also like to thank the people who release their tools and codes under GNU Public
License (GPL) or Open Source. They have made our work on PARCELS easier by not
re-inventing the wheel. In return, PARCELS is released under GPL as well.

Over the past year, there’s one person whom I owe the greatest thanks. During the year,
Assistant Professor Kan Min-Yen truly took me under his wing and guided me in
countless ways. I would like to thank him for all his support and patience I have received
for the past year.

________________________________________________________________________
iii
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

List of Figures
Figure 1: VIPS Representation ......................................................................................5
Figure 2: Cues for segmentation of web pages ..............................................................6
Figure 3: Patterns for Visual Separator Detection .........................................................6
Figure 4: Observations: Related Work vs. News Articles .............................................9
Figure 5: 9 Regions on Screen.....................................................................................12
Figure 6: National Geographic News Site ...................................................................15
Figure 7: Partial DOM Tree Representation................................................................15
Figure 8: Resultant DOM Tree Representation ...........................................................16
Figure 9: Stylistic Observations on Figure 8 ...............................................................17
Figure 10: PARCELS labels and purposes ................................................................19
Figure 11: PARCELS labels’ Stylistic Features ........................................................23
Figure 12: Breakdown of features in Feature Vector.................................................24
Figure 13: Co-Training Flowchart.............................................................................25
Figure 14: Example of Co-Training Classification....................................................25
Figure 15: Preliminary Investigation of Results ........................................................27
Figure 16: Distribution of HTML Structure Tags..................................................... A3

________________________________________________________________________
iv
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Table of Contents
Abstract .............................................................................................................................ii
Acknowledgement ........................................................................................................... iii
List of Figures...................................................................................................................iv
Table of Contents...............................................................................................................v
1. Motivations ................................................................................................................1
1.1 Applications of PARCELS ............................................................................1
2. Related Works ...........................................................................................................3
3. Background Analysis .................................................................................................7
3.1 Analysis of News sites ...................................................................................7
3.1.1 Structural Properties...............................................................................7
3.1.2 Stylistic Properties .................................................................................8
4. Analysis of HTML Structures..................................................................................10
4.1 Body.............................................................................................................10
4.2 Paragraph .....................................................................................................10
4.3 Table Structure.............................................................................................10
4.3.1 Flow of Cells........................................................................................10
4.3.2 Grouping of Cells by Value..................................................................11
4.3.3 Ordering of Tables by Depth................................................................11
4.3.4 Position of Cells ...................................................................................11
4.4 Division / Span Structure .............................................................................12
4.4.1 Position of Division / Span...................................................................12
4.5 Layers Structure ...........................................................................................13
4.5.1 Priority of Layers .................................................................................13
4.6 Other Structures ...........................................................................................13
5. Formulation of the PARCELS toolkits ....................................................................14
5.1 DOM Tree Parser.........................................................................................14
5.1.1 DOM Tree............................................................................................14
5.1.2 Parser ...................................................................................................16
5.2 PARCELS Tags for News Domain ..............................................................18

________________________________________________________________________
v
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

6. Machine Learning and Co-Training .........................................................................20


6.1 Support Vector Machine ..............................................................................20
6.1.1 PARCELS Labels vs. Stylistic Features...............................................20
6.1.2 Formulation of Feature Vector .............................................................23
6.2 Boostexter ....................................................................................................24
6.3 Co-Training..................................................................................................24
6.3.1 Stylistic and Textual Engine.................................................................24
7. Analysis of Results ..................................................................................................27
7.1 Preliminary Investigation of Results ............................................................27
7.2 Detailed Investigation of Results .................................................................28
8. Conclusion ...............................................................................................................29
8.1 Future Work.................................................................................................29
References .......................................................................................................................31
Appendix A – Distribution of HTML Structural Tags .................................................... A1
Appendix B – Listing of Features in Feature Vector ...................................................... B1

________________________________________________________________________
vi
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

1. Motivations
Although the Web Wide Web (WWW) is defined as “the universe of network-accessible
information, the embodiment of human knowledge” (DiPasquo, 1998) by the World
Wide Web Consortium (W3C), it currently does not live up to this definition.

It is true that web pages are increasing at an exponential rate. Thus, countless projects
were targeted at ways of using machines to harness this knowledge, e.g. Semantic Web.
However, due to the semi-structured nature of raw HTML pages, accessing relevant
information is by no means an easy task. Pages that look similar often use different
HTML tags to achieve their layout effects and this makes it difficult for a machine to find
relevant fields of interest.

There are existing projects which tackle part of the problem, e.g. (DiPasquo, 1998), (Cai,
Yu, Wen and Ma, 2003), (Yu, Cai, Wen and Ma, 2003) and (Gupta, Kaiser, Neistadt and
Grimm, 2003). However, none of the solutions address the problem exactly.

PARCELS is a backend system designed to address this problem. Coded in Java,


PARCELS is executable on major platforms like Windows, UNIX and Linux. Web pages
of any nature can be applied to PARCELS.

Essentially, PARCELS consists of 2 engines – One on the textual detection of web pages
(Lai, 2004), the other on the stylistic detection of web pages. In this documentation, we
will be discussing on the stylistic detection engine.

1.1 Applications of PARCELS


PARCELS can be applied to many real life applications today:

• Efficient web page reading

________________________________________________________________________
1
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

With PARCELS, we can extract the title and/or main body of any web pages. This
greatly improves the efficiency of retrieving the relevant information on the WWW.

• Reducing size for Data Archiving


Given the number of web pages today, archiving these pages will require enormous
amounts of disk space. With PARCELS, we can remove items of lower priorities
like site links, advertisements and images of no relevance. Thus, reducing the size
needed.

• Improving results for Web Search / Mining


With information on the layout of the pages, we can target the search and mining on
the relevant components. This greatly improves the accuracy and efficiency of the
search.

• Classification of web page cluster through web design


We can identify web pages of the same nature using the logical structure of the web
pages. With this classification, we can proceed with other tasks like archiving the
documents in the correct domain.

As we can see, PARCELS greatly improves the effectiveness and efficiency of our
everyday tasks, like reading a News articles without all the advertisements. This allows
better access of “the universe of network-accessible information, the embodiment of
human knowledge” mentioned earlier.

In the next few chapters, we will do an analysis on prominent web pages and their HTML
structures. Following that, we will touch on how we formulate the PARCELS toolkits
which is our engine. Lastly, we will describe our Machine Learning techniques and
perform an analysis on the results.

________________________________________________________________________
2
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

2. Related Works
Our research is closely related to several areas which have been heavily worked on. In
this section, we will look at some of the more prominent research on text processing and
information retrieval on the Web.

Firstly, we must reaffirm our basis that there is information in the layout and stylistic
properties of web pages:

• In (DiPasquo, 1998), the thesis argues that there is information in the layout of a
web page, and that by looking at the HTML formatting in addition to the text on a
page, one can improve performance in tasks such as learning to classify segments of
documents.

The focus of the thesis was to extract the names of companies from the web pages
and the locations where they have operations in. For companies’ names, the thesis
reported a precision of 66.0% and recall of 27.1%. For companies’ locations, the
thesis reported a precision of 71.6% and recall of 64.4%.

As we can see, the results are better than those not using the HTML Struct Tree. Thus,
layout improves the classification. As PARCELS is divided into textual and stylistic
engines. We reaffirm the belief that the stylistic properties are useful in classifying
segments of web pages.

Next, we noted from related works that most papers modified the representation of the
DOM Tree:

• In (DiPasquo, 1998), they modified the DOM Tree and called it the HTML Struct
Tree. The HTML Struct Tree is intended to represent how a human would parse
word relationships on a web page due to its layout. As web pages are meant to be

________________________________________________________________________
3
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

read by humans, if parsed correctly, the relationship between objects on the page
can be captured.

Using the HTML Struct Tree, a Machine Learner was applied. As mentioned
earlier, the result is better than those not using the HTML Struct Tree.

Instead of changing the structure, we will be using the normal DOM Tree. This ensures
easier portability of our codes for other uses. As a result, we made modifications to our
parsers instead.

• In (Yu et al, 2003), the paper describes a new method to detect the semantic content
of a web page. Instead of a DOM Tree method, their new method uses the VIsions-
based Page Segmentation (VIPS) algorithm.

They proposed that by using local feedback to add keywords from top ranking
documents, they are able to improve the precision and recall for second round
search. However, there is one difficulty they faced: Accurate feedback is impossible
due to noises like navigation, decoration and interaction.

Thus, VIPS is used to eliminate such noises by using a tree to represent the flow of
text. From the tree, it is observed that topics of similar content are grouped together.
VIPS uses a series of heuristics and detection to achieve this.

The main drawback of this paper is that they are handling web pages with linear style (no
tables or layers). The usage of feedback described in this paper is good but it will not be
included this version of PARCELS. In this release, we will focus on the main framework.

• In (Cai et al, 2003), the paper proposes a new web content structure based on visual
representation of web pages. The new structure presents an automatic top-down,
tag-tree independent approach to detect web content structure. It simulates how a
user understands web layout structure based on his visual perception.
________________________________________________________________________
4
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Figure 1: VIPS Representation

Comparing to other existing techniques, the new approach is independent to


underlying documentation representation such as HTML and works well even when
the HTML structure is far different from layout structure.

This paper is also using the VIPS algorithm as described previously. The main difference
is they are handling other Control Structures like tables (Figure 1). Personally, we still
feel that using DOM-based Tree with an improved DOM Tree parser to handle tables,
layers and frames will be more effective, while preserving the generic-ness of the web
page. We would do a comparison on this in the Chapter 7 (Analysis of Results).

Lastly, we would need to observe their heuristics on stylistic detection:

• In (Yu et al, 2003) and (Cai et al, 2003), the heuristics used to split contents into
blocks are as follows:

________________________________________________________________________
5
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Cue Purpose

Tag Cue HTML tags like <HR> are used to split the web page into blocks of text.

Color Cue Difference in the background colors between 2 blocks of text means they
belongs to 2 different blocks.

Text Cue If one block contains HTML tags and the other does, it means they belong to
2 different blocks.

Size Cue Difference in the font sizes between 2 blocks of text means they belongs to
2 different blocks.

Figure 2: Cues for segmentation of web pages

Visual Separator Detection used:

Pattern Purpose

Distance Distance between blocks on both sides of separator

Tag Position of tags like <HR>

Font Difference in formatting on both sides of separator

Color Difference in background color on both sides of separator

Figure 3: Patterns for Visual Separator Detection

The heuristics and detection methods used in VIPS are very useful and are included in
PARCELS. We further extended the set of heuristics to suit our purposes. The detailed
set can be found in Chapter 5 (Formulation of PARCELS toolkits).

As we can see, most of the related works focus on how humans interpret the web pages
based on the layout and stylistic properties. This is our main focus for the stylistic engine
as well. Thus, we shall proceed with our analysis on the layout and stylistic properties of
real web pages.

________________________________________________________________________
6
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

3. Background Analysis
Based on the argument that there is information inherent in the layout of each page, and
that parsing the HTML formatting appropriately can improve traditional text processing
(DiPasquo, 1998), we decided to carry out an analysis on News web pages.

News articles are usually reader friendly but not machine friendly. Furthermore, articles
in the News domain are feature rich with complex layout structures. Articles are also
updated daily and read by millions of people. With the ability to parse news articles, we
would be able to parse pages of other domains with relative ease. Thus, it makes sense to
use this domain as a starting point for our analysis.

3.1 Analysis of News sites


To facilitate our analysis, a total of 12 popular news sites were selected. The News sites
were analyzed for their structural and stylistic properties.

3.1.1 Structural Properties

A study was made to find out the dominant layout structure of these sites. Figure 16 in
Appendix A gives the breakdown of the statistics of <table>, <div>, <span> and <layer>
tags used.

A related study was also made to News articles in Year 2000. We will be contrasting
these 2 analyses to pin point the trend of layout structural tags usage.

• Structural tag <table> is most widely used


In Year 2004, we found out that 10 out of the 12 News sites are using tables as their
dominant layout structure. In Year 2000, all the News sites are using tables for their
main layout. As we observed, <table> tags are widely used and will continue to be
so (due to compatibility reasons with existing systems).

________________________________________________________________________
7
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

• Structural tag <div> and <span> is gaining popularity


2 of the News sites have converted to using <div> tags for their main layout. The
conversion is likely for speed issues (do not need to wait for pages to load) and
easier maintenance of web pages.

For the rendering position of <div> and <span> tags on the screen, we would have to
look into the Cascading Style Sheet (CSS) files to obtain the layout properties.

• Structural tag <layer> is not used


This tag is obsolete due to increased usage of CSS with the z-index properties.

Thus, we will be focusing on <table>, <div> and <span> structural tags for the
development of PARCELS.

3.1.2 Stylistic Properties

Having determined the dominant structural tags, we proceed with the analysis on the
stylistic properties of the pages.

In (Ivory and Hearst, 2002), (Ivory, 2003), (Yu et al, 2003) and (Cai et al, 2003), we
noted some of the stylistic aspect which will be useful to us. Next, we applied our
knowledge of the stylistic characteristics to News articles today.

Related Work Our Observations

Number of links for different segments Sitemaps, navigation links and related articles’ links are
normally grouped together. Thus, the number of links for that
segment will be significantly higher.

Style of links for different segments The style of links in the navigation bar, related articles and
advertisements are different from the links in the main content
and supporting content.

Color of links for different segments The color of links in the navigation bar is sometimes different
from the links in the main content and supporting content.

________________________________________________________________________
8
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Number of images for different Sitemaps and navigations links are more probable to have
segments more images than other segments. Images located in between
segments with high density text will likely be images
supporting the content.

Size of Images This is sometimes useful in determining whether the image is


an advertisement as they usually come in fixed sizes.

Font styles for different segments Headers, captions and newsletters are in different font styles
from the main content. As such, we will be using these styles
to aid in our classification.

Font sizes for different segments Headers, main content, support contents and captions all have
different font sizes. Font size will be a good indicator for some
of our component classification.

Figure 4: Observations: Related Work vs. News Articles

These characteristics of News sites will act as the foundation for our learning process in
Chapter 6 (Machine Learning and Co-training).

________________________________________________________________________
9
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

4. Analysis of HTML Structures


In this chapter, we will do an analysis on the HTML Control Structure that will be used in
used in PARCELS. As in Control Structure, we are referring to HTML tags that will
affect the layout position of the contents of a page.

4.1 Body
The <body> tag is the trivial case in HTML Structure. All HTML documents must contain
a body tag. This is where the layout of a web page begins.

4.2 Paragraph
The next fundamental tag is the Paragraph <p> tags. The usage of paragraph tag is pretty
straightforward. PARCELS uses these tags to specify the basic unit for a block of text.

4.3 Table Structure


From the analysis we did in Chapter 3, tables are the most prominent structure on the
Web now. Thus, we will place more emphasis on the analysis of table structure. There are
a few papers dedicated to the study of tables. These papers aided us in our analysis of
tables, e.g. (Hurst and Douglas, 1997), (Hurst, 1999) and (Cohen, Hurst and Jenson,
2002).

4.3.1 Flow of Cells

The flow of cells is known as the functional view of a table. Functional view is a
description of how the information presentation aspects of tables embody a decision
structure or reading path. This determines the order of the tables (Hurst et al, 1997).

________________________________________________________________________
10
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

The flow of cells is important as we know there’s a usual layout on how webmasters
design their website. If we follow that flow of layout, it would be significantly easier to
classify the segments correctly.

4.3.2 Grouping of Cells by Value

Grouping of cells by value will aid in understanding how the webmasters intend the
tables to be read (Hurst, 1999). The value is calculated using the type of data located in
the cell and the word density of that particular cell.

Cells of the same value are often closely related to each other. Furthermore, grouping the
cells by value will enable us to understand the contents better. This will ensure a better
classification of that particular segment.

4.3.3 Ordering of Tables by Depth

Nested tables are very common in web designing today. Thus, it is important to find the
relation between nested tables and how each cell affects another cell. First and foremost,
we would need to order each cell by their depth.

From the News site we analyzed, the number of nested tables for each cell is different.
Based on this, we observed it is possible to classify some of the segments using such
characteristics.

4.3.4 Position of Cells

The position of cells is important for information extraction. Table-based extraction is


done by tagging each cell with a visual position tag (Cohen et al, 2002).

With the visual position tag, we will be able to know the position of this cell with respect
to any nested tables. From this, we can gather the following information.

________________________________________________________________________
11
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

• Cut-in header: Cells related to a particular header cell in a table.


• Column header: The overall column header of the table (if any).
• Row header: The overall row header of the table (if any).

Furthermore, there are some layout formats webmasters will follow when designing a
web site. If they are using Content Management System (CMS), these layouts will be
even similar. CMS are scripts that are used by webmasters to generate the layout and
content of the site.

Thus, using the position of cells will be useful in our classification.

4.4 Division / Span Structure


Division <div> and Span <span> are two tags that are gaining popularity in recent times.
Division tag is a Control Structure tag as it will create a Break <br> effect. Span tag may
or may not be a Control Structure tag depending on whether it has a position allocated to
it. We can assign positional value to both Division and Span in CSS.

4.4.1 Position of Division / Span

As we know, the position of a certain block of text is a major factor which contributes to
the structure of a web site; it would be useful in our classification. If a positional value is
allocated, we would need to estimate its position on the screen. For this, we will split the
screen into 9 regions.

Region 1 Region 2 Region 3

Region 4 Region 5 Region 6

Region 7 Region 8 Region 9

Figure 5: 9 Regions on Screen

________________________________________________________________________
12
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

We will locate the maximum position on the X and Y axis. That block of text will be in
region 9. The rest of the regions will be sub divided according to this maximum position.
The remaining blocks of text will be placed into each of these regions according to their
X-Y coordinates.

4.5 Layers Structure


Due to advancement in HTML and CSS specifications, the functionalities of Layers can
be emulated in Division and Span (z-index in CSS). Thus, Layers will be handled in the
same way as Division and Span.

4.5.1 Priority of Layers

Priority of a particular layer depends on the value of z-index. The higher the value of z-
index, the higher it will appear on top of other layers. For example, if layer A has a z-
index of 0, layer B has a z-index of 10 and layer A and B overlap each other, then layer B
will appear on top of layer A.

In terms of classifications, position of each layer is important as mentioned in Section


4.4.1. Furthermore, z-index will also be useful as segments like main contents will likely
be on top of other segments like background images.

4.6 Other Structures


We recognize that there are other structures on the Web today, like Frames and Cross
Documents structure. However, PARCELS focuses on single documents in this release.
Thus, they will be included in future releases.

________________________________________________________________________
13
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

5. Formulation of the PARCELS toolkits


After our analysis on web pages and HTML structures, we would now formulate our
basic toolkits for PARCELS. We would focus on the stylistic engine in this chapter.

5.1 DOM Tree Parser


A DOM Tree parser stimulates how humans parse words on a web page due to its layout.
Since a web page is designed to be read by humans, if we are able to parse it properly, we
will be able to capture the right relationship between segments correctly.

This relationship is what we defined as the “Logical Structure”.

5.1.1 DOM Tree

To manipulate with the styles of web pages, we need their DOM Trees. A DOM Tree is
an n-ary tree in which each node corresponds to an HTML tag on the page. This DOM
Tree is created using the default Node class in Java.

Before we create the DOM Tree, we need to ensure the HTML coding is sound. Due to
the complexity of web pages, there will likely be errors. Thus, a program called Tidy is
used to correct such errors.

Figure 6 shows a web page as viewed in Internet Explorer 6. After parsing it through
Tidy, a DOM Tree is created. Figure 7 shows the partial DOM Tree that is created.

• The parent-child relationship in the tree implies that the child is in the scope of the
parents. Each parent-child relationship loosely means they are directly related.

• The ordering of nodes in the DOM Tree is preserved in that children are ordered
from left to right in the order they appear on that page.

________________________________________________________________________
14
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Figure 6: National Geographic News Site

Document

head body

title meta … table table …

Text tbody …

tr

td td

Text Text

Figure 7: Partial DOM Tree Representation

________________________________________________________________________
15
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

The next section describes how our parser parses the relevant information out of this
DOM Tree.

5.1.2 Parser

This section will focus on the 2 major functions of the DOM Tree parser.

• Splitting of Text from DOM Tree

The parser will split the web page into blocks of text (basic unit). Each block of text
is split in a human readable way, rather than how it is represented in a DOM Tree.

For example, given the following block of HTML codes:


<body><b>Tom Mitchell’s <i>Machine Learning</i> Website</b></body>

The resultant DOM Tree is shown in Figure 8.

body

Tom Mitchell’s i Website

Machine Learning

Figure 8: Resultant DOM Tree Representation

If we were to split it as how the DOM Tree represents it, we will get 3 blocks of
text: “Tom Mitchell’s”, “Machine Learning” and “Website”.

________________________________________________________________________
16
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

However, that’s not how human interpret it. So our algorithm will link up all these
text under the umbrella of a Control Structure. A Control Structure can be a
paragraph <p>, a table cell <td>, a division <div>, a span <span>, a layer <layer> or a
body <body>.

Thus, the block of text will be “Tom Mitchell’s Machine Learning Website”. This
block of text will then be used in the Text Detection Engine of PARCELS (Lai,
2004).

• Parsing of Styles from DOM Tree

For each block of text, the parser will parse its stylistic properties (as mentioned in
Section 3.1.2). For example, given the block of text and DOM Tree in Figure 8, we
would obtain the following characteristics:

Stylistic Properties (in each segment) Our Observations

All the Control Structures containing this • Default Body Structure found
segment
• No other Control Structure found

Position of each Control Structure • Position of Body Structure is trivial

Number and type Link elements • No links were found

Colors of Link formatting • No links were found

Number of Graphics elements • No graphical elements found

Size of Images (if available) • No graphical elements found

Total Word Count • Total word count is 5

Font styles, sizes and colors • 100% of the text are bold

• 40% of the text are in italics

• 100% of the text are of the same size

• 100% of the text are of the same color

Figure 9: Stylistic Observations on Figure 8

________________________________________________________________________
17
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

This is a simple demonstration on what the parser will obtain together with the
block of text. These data will be stored in a Vector for Machine Learning purposes
in the next chapter.

5.2 PARCELS Tags for News Domain


As mentioned earlier, we would focus on the News Domain in this release. Thus, we
would need to identify the labeling of segments that are of interest to us in the News
Domain. These labels are easily extended.

After looking through articles from 12 News sites, we came up with 15 PARCELS labels.
These labels are common components found, like title, date, report name and etc.

When we did the actual annotation for our Machine Learning training data, there are
segments which are useful but do not fall into any of the classes. Thus, 2 more labels
were added, namely sub header and supporting contents.

Presently, these are the 17 labels:

PARCELS Labels Purpose of Labels

Title of article Refers to the headline of the article or phrase that summarizes the
article

Date / Time of article Refers to the date and time the article was published

Reporter name Refers to the author of the article

Source station Refers to the source / provider broadcast station of the news article

Country where news occurred Refers to the country, city or state where the event took place

Image supporting contents of Refers to an image (a photograph, drawing or sketch) related to the
article contents of the article

Links supporting contents of Refers to a hyperlink placed within the main content of the article
article

Main content of article Refers to the main text of the news article

________________________________________________________________________
18
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Supporting content of article Refers to the other text not belonging to the main text of the news
article but is still related to the article, such as captions of images,
captions of audio / video files, any text of sidebars containing
additional information about the article

Sub header Refers to the sub headers which are found within the main content
of the article

Links to related article Refers to links to other news articles which have similar content or
articles previously reported on the same topic, and the date / time of
these articles

Newsletter Refers to text / links prompting the user to sign up for or subscribe to
a newsletter

Site image Refers to images which are part of the web page's design and not
related to the news article

Site content Refers to the text which is part of the web page's content and not
related to the news article

Site links / navigation Refers to links to other parts of the website, including navigation
bars but not related to the news article

Advertisements Refers to advertisements on the website, text ads and banner ads

Search Refers to the text / links related to searching or search options

Figure 10: PARCELS labels and purposes

More information on the textual engine of PARCELS can be found in (Lai, 2004).

After formulating the PARCELS toolkits, we are now able to the retrieve useful
information from web pages. With this information, we shall proceed with our Machine
Learning.

________________________________________________________________________
19
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

6. Machine Learning and Co-Training


For the stylistic engine, we would be using Support Vector Machine (SVM) for our
Machine Learning purposes. For the textual engine, Boostexter would be used (Lai,
2004). For comparison purposes, Boostexter would also be included as an added feature
in the stylistic engine.

6.1 Support Vector Machine


SVM is selected due to some of its core features like its ability to learn is independent of
the dimensionality of the feature vector (Joachims, 1998). For stylistic detection, the size
of the feature vector will grow increasingly with additional domains.

Furthermore, SVM measure the complexity of hypotheses based on the margin with
which they separate the data and not the number of features. This means we can
generalize even in the presence of many features (Joachims, 1998).

Thus, this same margin argument suggests a heuristic for selecting good parameter
settings for the learner. We can input a bias (cost factor) via a parameter. This allows
fully automatic parameter tuning without expensive cross-validation.

6.1.1 PARCELS Labels vs. Stylistic Features

Before we proceed with the feature vector, we must first tabulate the stylistic features
which are likely to affect certain PARCELS labels. This is to cut down on the number of
redundant features.

Labels that are more likely to be classified by the stylistic engine are colored in light grey
while labels that are less likely to be classified are in white.

________________________________________________________________________
20
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

PARCELS Label Stylistic Feature

Title of article • Likely detected by stylistic engine

• Larger font size

• May be in Bold font

• Relatively Top Position

Date / Time of article • Not likely detected by stylistic engine

• Likely grouped with Reporter name

• Smaller font size

• May be in Italics font

Reporter name • Not likely detected by stylistic engine

• Likely grouped with Date / Time

• Smaller font size

• May be in Italics font

Source station • Not likely detected by stylistic engine

• Likely grouped within blocks of text

Country where news occurred • Not likely detected by stylistic engine

• Likely grouped within blocks of text

Image supporting contents of article • Likely detected by stylistic engine

• Image tags grouped with high Word Count

• Position within Main Content

Links supporting contents of article • Likely detected by stylistic engine

• Links tag grouped with high Word Count

• Position within Main Content

Main content of article • Likely detected by stylistic engine

• High Word Count

• Few or no Bold / Italics tags

• Position across News sites is similar

________________________________________________________________________
21
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Supporting content of article • Likely detected by stylistic engine

• High Word Count

• Few or no Bold / Italics tags

• Likely to be using different font color

Sub header • Likely detected by stylistic engine

• Low Word Count

• High percentage in Bold

Links to related article • Likely detected by stylistic engine

• High percentage of links tags

• High percentage of text within <a> tags

• Position across News sites is similar

Newsletter • Not likely detected by stylistic engine

• Low Word Count

• May be in smaller font size

Site image • Likely detected by stylistic engine

• High percentage of links tags

• Low Word Count

• Position across News sites is similar

Site content • Not likely detected by stylistic engine

• Low Word Count

• May be in smaller font size

Site links / navigation • Likely detected by stylistic engine

• High percentage of links tags

• High percentage of text within <a> tags

• May have a high percentage of images

• Smaller font size

• Position across News sites is similar

________________________________________________________________________
22
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Advertisements (Graphical) • Likely detected by stylistic engine

• Low percentage of link tags

• Low percentage of image tags

• Low or no Word Count

• May be in fixed sizes

Search • Not likely detected by stylistic engine

• Low Word Count

• May be in smaller font size

Figure 11: PARCELS labels’ Stylistic Features

Therefore, from Figure 11, we can see that the stylistic engine will likely detect 10 out of
the 17 labels with high confidence. We will verify this in Chapter 7 (Analysis of Results).
The textual engine will handle the rest of the labels with high confidence (Lai, 2004).
With the observed stylistic features, we now formulate our feature vector.

6.1.2 Formulation of Feature Vector

Having identified the prominent stylistic features, we came up with the feature vector for
each segment of text.

Characteristics Detection Number of features

Detection of tables and the flow of cells as mentioned in Chapter 4.3. We 11


observed the number of nested tables does not go beyond 3 levels on the
Web. Thus, we set the limit to 4 nested tables.

Detection of position of div, span and layer as mentioned in Chapter 4.4. This 3
refers to which of the 9 regions on screen are they in.

Detection of stylistic properties of the segment, like the percentage of bold 8


text, the percentage of italics text, etc.

Detection of weight factor of the segment. These features capture the number 4
of <br> tags and word count. Good for detecting labels like Main Content.

________________________________________________________________________
23
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Detection of links ratio in that particular segment of the web page. Useful in 3
detecting labels like Related Links and Navigation Links.

Detection of images and their sizes. Good for advertisements and site links 2
detection.

Total Features 31

Figure 12: Breakdown of features in Feature Vector

One feature vector will be outputted for every text block segment and stored in a single
file. In the event when one of the features does not apply, they will be left out. This file
will then be trained or classified.

For the full listing on the contents of the feature vector, please refer to Appendix B.

6.2 Boostexter
Boostexter is used mainly in the textual engine (Lai, 2004). For comparison, we have
included a module to convert our SVM feature vector to Boostexter. The PARCELS
labels and formulation of feature vector for Boostexter will be similar to Section 6.1.

6.3 Co-Training
In Machine Learning, unlabelled examples are significantly easier to come by than label
ones. The idea of co-training is to use this large unlabelled sample to boost the
performance of our learning algorithm since a small set of labeled examples is available
to us due to manpower issues (Blum and Mitchell, 1998).

6.3.1 Stylistic and Textual Engine

Since we have 2 engines, one based on styles and one based on text, which are almost
mutual exclusive to each other, it is logical to apply co-training to achieve a better result.

________________________________________________________________________
24
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Add top k positive data

SVM Training Data SVM Trainer SVM Classifier


Classify New Data

Remove top k positive data


Unlabelled Data Co-training Module

Classify New Data


BT Training Data BT Trainer BT Classifier

Add top k positive data

Figure 13: Co-Training Flowchart

Firstly, there are two sets of training data for the both SVM and Boostexter. After
learning is performed on these training data, the SVM classifier will classify the
unlabeled examples. Top k examples with the highest positive confidence will be
collected after classification. The same applies for the Boostexter classifier.

PARCELS will then compare the top k examples between the two classifiers. The
overlapping examples will be selected first. For example,

SVM Classifier (Top k) Boostexter Classifier (Top k)

Segment Classification Segment Classification

5 Main Content 6 Main Content

6 Related Links 5 Main Content

10 Main Content 11 Main Content

11 Main Content 42 Main Content

1 Main Content 56 Main Content

Figure 14: Example of Co-Training Classification

________________________________________________________________________
25
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

From Figure 14, Segment 5 and 11 will be selected. Segments with two different
classifications will be ignored. Thus, Segment 6 will be ignored. Since we are using two
separate Machine Learners, it is not accurate to compare the level of confidence between
two learners.

Since we are looking for top 5 (k = 5 in this example), we will take the top examples
from each classifier until 5 is reached. Therefore, Segment 10, 42 and 1 will be included.
These 5 examples will then be removed from the unlabelled examples and added to the
training set for both SVM and Boostexter.

The whole process will repeat for the number of times specified.

________________________________________________________________________
26
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

7. Analysis of Results
With the PARCELS engines working, it would be useful to put it into real use by
applying existing web pages in the News domain.

7.1 Preliminary Investigation of Results


As we know, it is not easy to obtain huge amounts of labeled data in a short time. Thus,
with a training set of 5 news sites, we would try to obtain some preliminary results.

Labels Initial Round 1 Round 2 Round 3 Round 4


Precision / Recall P R P R P R P R P R
Overall 70.1 70.1 70.1 70.1 71.9 71.9 64.9 64.9 64.9 64.9

Main Content 87.8 100.0 84.8 96.5 84.3 93.1 72.9 93.1 72.9 93.1

Related Link 100.0 50.0 75.0 50.0 100.0 66.6 100.0 33.3 100.0 33.3

Site Content 33.3 100.0 38.4 100.0 50.0 80.0 57.1 80.0 57.1 80.0

Navigation Links 75.0 60.0 50.0 60.0 36.3 80.0 22.2 40.0 22.2 40.0

Figure 15: Preliminary Investigation of Results

With the available training data, the result peaked in the second round of co-training. An
overall precision and recall of 71.9% is a good result. In fact, if we do a comparison on
the papers (Chapter 2) that uses their own representation tree, like HTML Struct Tree, our
results are on par and even higher in some instances.

Not surprisingly, Main Content is the easiest to classify with its recall constantly over
90% and precision over 70%. However, as we know main content extraction is already
heavily researched. Thus, we are more interested in the other labels.

Related links comes a close second with precision of 100% but with a rather low recall.
The low recall is due to the lack of training data. The same can be said for Navigation
Links as they both belong to the Links class.

________________________________________________________________________
27
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

On the other hand, Site Content has an increasing precision and a high recall. The
increasing precision reflects that co-training works well for our classification given
sufficient amount of training data.

We are certain that the results can be a lot better. First of all, not all the labels are
available in the training set. Those labels not mentioned in Figure 15 are the ones we do
not have positive examples of.

Nevertheless, an annotation engine is already in place to handle this problem (Lai, 2004).

7.2 Detailed Investigation of Results

More labeled examples from the annotation engine will be available in near future. With
more labeled examples, we are very confident that the precision and recall will be
improved further.

We will be providing the detailed investigation of results on our PARCELS website on


Sourceforge.

________________________________________________________________________
28
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

8. Conclusion
We started this paper with a mission to design something that classifies segments of web
pages into useful information. PARCELS has done just that. Inputting a News article into
PARCELS will give us a Logical Structure of how the web page is designed. With this
Logical Structure, we are able to obtain the relevant segments of the web page.

We have done numerous background analyses on how humans perceive web pages in real
life before designing such a system. Thus, we are pretty confident PARCELS will be able
to classify with accuracy and precision.

With the working system now, we can apply it to many tasks; some of which mentioned
in this thesis. PARCELS is easily extendable. This was one of the design considerations
of PARCELS – to be able to handle pages from different domains. Since the system is
designed modularly, extending it to other domains poses no problem at all.

PARCELS is currently released under GPL on Sourceforge. http://parcels.sourceforge.net

8.1 Future Work


Nevertheless, there is always room for further work. Below are some of the areas which
can be improved.

• Ability to handle Frame Structure


This will ensure we have a complete system that can parse virtually any web pages
on the Web today,

• Ability to handle Cross Documents Structure


This will give PARCELS the ability the view web pages as clusters instead of
individual pages. This is important since most web pages today are part of major
web site.

________________________________________________________________________
29
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

• Improved CSS support


Although there is CSS support in this release, our primary focus is still HTML tags.
CSS is gaining popularity and it is useful to take a look at CSS.

• JavaScript support
We did not handle JavaScript in this release. It would be good to have as JavaScript
does affect the layout of pages in some instances (though rarely).

________________________________________________________________________
30
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

References
Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-
training. In COLT: Proceedings of the Workshop on Computational Learning
Theory, Morgan Kaufmann Publishers, 1998.

Cai, D., Yu, S., Wen, J.R. and Ma, W.Y. (2003). Extracting Content Structure for Web
Pages Based on Visual Representation. In the 5 th Asia Pacific Web Conference,
Xi’an, 2003, pp. 406 – 417

Cohen, W.W., Hurst, M. and Jenson, L.S. (2002). A flexible learning system for
wrapping tables and lists in HTML documents. In the 11 th WWW Conference,
Hawaii, 2002, pp. 232 – 241

DiPasquo, D. (1998). Using HTML formatting to aid in natural language processing on


the World Wide Web. Senior thesis, Computer Science Department, Carnegie
Mellon University, 1998.

Gupta, S., Kaiser, G.E., Neistadt, D. and Grimm, P. (2003). DOM-based content
extraction of HTML documents. In the 12 th WWW Conference, Budapest, 2003, pp.
207 – 214

Hurst, M. (1999). Layout and Language: Beyond Simple Text for Information Interaction
– Modeling the Table. In the 2nd International Conference on Multi-model
Interfaces, 1999.

Hurst, M. and Douglas, S. (1997). Layout & Language: Preliminary experiments in


assigning logical structure to table cells. In the Applied Natural Language
Processing Conference, Washington, 1997, pp 217 – 220

________________________________________________________________________
31
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Ivory, M.Y. and Hearst, M.A. (2002). Statistical Profiles of Highly-Rated Web Sites. In
ACM Conference on Human Factors in Computing Systems, CHI Letters, 2002, pp.
367-374.

Ivory, M.Y. (2003). Characteristics of Web Site Designs: Reality vs. Recommendation.
In HCI International Conference, 2003.

Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with
many relevant features. In Machine Learning: ECML-98, Tenth European
Conference on Machine Learning, pp. 137 – 142.

Lai, S. (2004). PARCELS: PARser for Content Extraction and Logical Structure (Text
Deduction). Honours Thesis, School of Computing, National University of
Singapore, 2004.

Yu, S., Cai, D., Wen, J.R. and Ma, W.Y. (2003). Improving pseudo-relevance feedback
in web information retrieval using web page segmentation. In the 12 th WWW
Conference, Budapest, 2003, pp. 11 – 18

________________________________________________________________________
32
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Appendix A – Distribution of HTML Structural Tags


The table below shows the distribution of HTML Control tags for 12 popular News sites
on the Web today.

Percentage of tags embed with

News Sites Total Tags <table> <div> <span> <layer>

Index 464 97.1% 32.1% 4.0% 0%


National Geographic
(Jan 2004)
Articles 447 97.2% 50.2% 2.0% 0%

Index 454 96.0% 29.9% 5.2% 0%


National Geographic
(Jan 2002)
Articles 337 96.3% 39.3% 2.9% 0%

Index 608 98.1% 3.9% 2.7% 0%


Straits Times
(Jan 2004)
Articles 425 96.7% 12.4% 1.1% 0%

Index 619 96.4% 10.1% 0% 0%


Straits Times
(May 2000)
Articles 411 95.3% 12.0% 0% 0%

Index 943 97.1% 31.2% 6.6% 0%


CNN
(Jan 2004)
Articles 649 95.4% 49.0% 2.4% 0%

Index 1358 99.1% 22.1% 8.0% 0%


CNN
(Jun 2000)
Articles 934 98.2% 36.4% 12.1% 0%

Index 1012 98.6% 6.9% 3.3% 0%


Yahoo News
(Jan 2004)
Articles 769 98.0% 6.8% 4.4% 0%

Index 929 98.7% 0% 0% 0%


Yahoo News
(Oct 2000)
Articles 506 96.3% 0% 0% 0%

Index 622 88.5% 60.1% 0.8% 0%


BBC
(Jan 2004)
Articles 479 85.0% 48.2% 1.4% 0%

Index 589 95.7% 19.6% 18.6% 0%


BBC
(Feb 2000)
Articles 465 87.2% 31.4% 9.0% 0%

________________________________________________________________________
A1
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Index 1116 98.8% 0% 0% 0%


Slashdot
(Jan 2004)
Articles 2894 99.3% 0.3% 0% 0%

Index 914 96.2% 0% 0% 0%


Slashdot
(May 2000)
Articles 2459 98.9% 0% 0% 0%

Index 661 0% 97.1% 8.4% 0%


CNet
(Jan 2004)
Articles 431 4.6% 95.9% 2.0% 0%

Index 744 98.3% 98.4% 17.4% 0%


CNet
(Jan 2002)
Articles 403 97.0% 97.2% 12.3% 0%

Index 402 5.9% 94.7% 9.7% 0%


Wired News
(Jan 2004)
Articles 217 9.4% 81.7% 13.3% 0%

Index 857 98.2% 0% 0% 0%


Wired News
(Feb 2000)
Articles 410 93.4% 0% 0% 0%

Index 786 98.8% 54.8% 0.3% 0%


The Register
(Jan 2004)
Articles 412 97.8% 15.7% 0% 0%

Index 679 98.2% 0% 0% 0.1%


The Register
(Feb 2000)
Articles 247 95.1% 0% 0% 0.4%

Index 426 97.6% 5.1% 0% 0%


Space.com
(Jan 2004)
Articles 586 93.6% 1.1% 0% 0%

Index 616 98.5% 12.0% 0% 0%


Space.com
(Mar 2000)
Articles 598 98.2% 13.0% 0% 0%

Index 1036 88.0% 97.4% 29.1% 0%


MSNBC
(Jan 2004)
Articles 538 77.9% 97.3% 15.4% 0%

Index 315 95.8% 31.7% 9.2% 0%


MSNBC
(Feb 2001)
Articles 458 94.8% 0.1% 0% 0%

Index 417 97.3% 98.0% 0% 0%


Arabic News
(Jan 2004)
Articles 213 95.3% 3.7% 0% 0%

________________________________________________________________________
A2
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Index 426 95.3% 0% 0% 0%


Arabic News
(Feb 2000)
Articles 133 86.4% 0% 0% 0%

Figure 16: Distribution of HTML Structure Tags

________________________________________________________________________
A3
PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)
________________________________________________________________________

Appendix B – Listing of Features in Feature Vector


1: Depth of current cell in the nested tables
2: Row number of current cell in outermost table
3: Row number of current cell in the 1st nested table
4: Row number of current cell in the 2nd nested table
5: Row number of current cell in the 3rd nested table
6: Row number of current cell in the 4th nested table
7: Column number of current cell in outermost table
st
8: Column number of current cell in the 1 nested table
9: Column number of current cell in the 2nd nested table
10: Column number of current cell in the 3rd nested table
th
11: Column number of current cell in the 4 nested table
12: Position (1 – 9) of div
13: Position (1 – 9) of span
14: Position (1 – 9) of layer
15: Percentage of words in <i> in this segment
16: Percentage of words in <i> in this segment over all words under <i> in article
17: Percentage of words in <b> in this segment
18: Percentage of words in <b> in this segment over all words under <b> in article
19: Percentage of words in <u> in this segment
20: Percentage of words in <u> in this segment over all words under <u> in article
21: Percentage of words in <font> in this segment
22: Percentage of words in <font> in this segment over all words under <font> in article
23: Number of <br> tags in this segment
24: Percentage of <br> in this segment over all <br> in article
25: Number of words in this segment
26: Percentage of words in this segment over all words in article
27: Percentage of words in <a> in this segment
28: Number of <a> tags in this segment
29: Percentage of <a> in this segment over all <a> in article
30: Number of <img> tags in this segment
31: Percentage of <img> in this segment over all <img> in article

________________________________________________________________________
B1

You might also like