P. 1
stylistic

stylistic

|Views: 127|Likes:
Published by sarvan21

More info:

Published by: sarvan21 on Feb 27, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

12/03/2012

pdf

text

original

Sections

  • 1. Motivations
  • 1.1 Applications of PARCELS
  • 2. Related Works
  • Figure 1: VIPS Representation
  • 3. Background Analysis
  • 3.1 Analysis of News sites
  • 3.1.1 Structural Properties
  • 3.1.2 Stylistic Properties
  • 4. Analysis of HTML Structures
  • 4.1 Body
  • 4.2 Paragraph
  • 4.3 Table Structure
  • 4.3.1 Flow of Cells
  • 4.3.2 Grouping of Cells by Value
  • 4.3.3 Ordering of Tables by Depth
  • 4.3.4 Position of Cells
  • 4.4 Division / Span Structure
  • 4.4.1 Position of Division / Span
  • 4.5 Layers Structure
  • 4.5.1 Priority of Layers
  • 4.6 Other Structures
  • 5. Formulation of the PARCELS toolkits
  • 5.1 DOM Tree Parser
  • 5.1.1 DOM Tree
  • 5.1.2 Parser
  • 5.2 PARCELS Tags for News Domain
  • 6. Machine Learning and Co-Training
  • 6.1 Support Vector Machine
  • 6.1.1 PARCELS Labels vs. Stylistic Features
  • 6.1.2 Formulation of Feature Vector
  • 6.2 Boostexter
  • 6.3 Co-Training
  • 6.3.1 Stylistic and Textual Engine
  • 7. Analysis of Results
  • 7.1 Preliminary Investigation of Results
  • 7.2 Detailed Investigation of Results
  • 8. Conclusion
  • 8.1 Future Work
  • References

Honours Year Project Report

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)

By

Lee Chee How

Department of Computer Science School of Computing National University of Singapore 2003 / 2004

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)

________________________________________________________________________

Honours Year Project Report

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)

By

Lee Chee How

Department of Computer Science School of Computing National University of Singapore 2003 / 2004

Project: Advisor:

H79010 Assistant Professor Kan Min-Yen

Deliverables: Report: Program: 1 Volume 1 CD

________________________________________________________________________
i

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)

________________________________________________________________________

Abstract
Web documents that look similar often use different HTML tags to achieve their layout effect. These tags make it difficult for a machine to find text or images of interest. PARCELS is a Java backend system designed to classify different components of a web page by obtaining the logical structure from its layout and stylistic characteristics. Each component is given a PARCELS label which identifies its purpose, like Main Content, Links, etc. A recall and precision of 71.9% is obtained. This backend system is currently released under GPL.

Subject Descriptors:

I.5.2 I.5.4 I.7.2 I.7.5

Style guides Text Processing Format and notation Document Analysis

Keywords:

Text Processing, Logical structure of web pages, Segmentation of web pages, Classification of web page components / segments

Implementation Software and Hardware:

Intel Pentium Centrino processor, 512 MB DDR RAM, Windows / UNIX / Linux, Java SDK 1.4.2

________________________________________________________________________
ii

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)

________________________________________________________________________

Acknowledgement
There are many people whom helped me accomplish the work for this thesis. I would like to thank my partner, Sandra, who has worked along side with me in formulating the PARCELS toolkits. She is in charge of the textual engine.

I would also like to thank the people who release their tools and codes under GNU Public License (GPL) or Open Source. They have made our work on PARCELS easier by not re-inventing the wheel. In return, PARCELS is released under GPL as well.

Over the past year, there’s one person whom I owe the greatest thanks. During the year, Assistant Professor Kan Min-Yen truly took me under his wing and guided me in countless ways. I would like to thank him for all his support and patience I have received for the past year.

________________________________________________________________________
iii

......................................9 9 Regions on Screen.............................................................................15 Partial DOM Tree Representation.........................16 Stylistic Observations on Figure 8 .............. News Articles .....25 Example of Co-Training Classification....................23 Breakdown of features in Feature Vector.......................PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ List of Figures Figure 1: Figure 2: Figure 3: Figure 4: Figure 5: Figure 6: Figure 7: Figure 8: Figure 9: Figure 10: Figure 11: Figure 12: Figure 13: Figure 14: Figure 15: Figure 16: VIPS Representation .................................................17 PARCELS labels and purposes ....................................................................27 Distribution of HTML Structure Tags...........................12 National Geographic News Site .........................25 Preliminary Investigation of Results ......................19 PARCELS labels’ Stylistic Features .................................................................................15 Resultant DOM Tree Representation .................................. A3 ________________________________________________________________________ iv ...................................................................................................................................................................................................24 Co-Training Flowchart..........6 Observations: Related Work vs...................................................................................................................................6 Patterns for Visual Separator Detection ..................................................5 Cues for segmentation of web pages .......................................................................................................

............................2 DOM Tree Parser......................................................................................................1 4.....1 4....iv Table of Contents..................1...... Applications of PARCELS ..........................................................................................................................3..................................................................................3.............................1 4..................................................2 5................................13 5...........................6 Body...............................................................10 Paragraph ............1 4...............................................................8 4......................ii Acknowledgement .......................................... iii List of Figures...... Formulation of the PARCELS toolkits ................. Motivations ...............................12 Position of Division / Span..2 Analysis of News sites ....................................... Analysis of HTML Structures..............18 ________________________________________________________________________ v ................................ 3.10 Grouping of Cells by Value.....................................................1 3..3 Background Analysis .............5.........................7 Structural Properties...........10 Table Structure..............................................................................1 2..............................................................................................12 Layers Structure ......................................................................................................................................................................2 4.................................14 5.........10 Flow of Cells.......................................................................................1 Related Works .................................................................................................................................................................................1...............................................................................1 5.............................................................................................4 4....................4 4.....................14 DOM Tree...11 Position of Cells ............................................1.............................................................................................................................PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Table of Contents Abstract ........................1 3..........................................10 4........2 4................................................v 1.....................................7 3...5 4................................1....................1 5...................................................1 1...................3...........................3 4...........................................11 Ordering of Tables by Depth............4.................................................................................7 Stylistic Properties ....................................................................14 Parser .....16 PARCELS Tags for News Domain .......3....................................13 Priority of Layers ...........11 Division / Span Structure .................13 Other Structures .......3 4....................................................

.............................1................................................................................24 Co-Training......................... B1 ________________________________________________________________________ vi ..........................1 6.........1.....28 8..................24 Analysis of Results ..........20 6...................29 References ........................................20 PARCELS Labels vs................................ Conclusion ....................1 6...............23 Boostexter ..............................................................................................27 7.............................................1 7............1 7...............................29 8......................1 Future Work......................2 6.......2 Preliminary Investigation of Results .................20 Formulation of Feature Vector ........................................................................27 Detailed Investigation of Results ...................3.........................................2 6.........3 6...........................................................31 Appendix A – Distribution of HTML Structural Tags ................................................................. Stylistic Features............24 Stylistic and Textual Engine.......... Support Vector Machine ..............................................................................................................................................................................PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ 6........................................... A1 Appendix B – Listing of Features in Feature Vector .............................................................. Machine Learning and Co-Training ...................................................

countless projects were targeted at ways of using machines to harness this knowledge.g. It is true that web pages are increasing at an exponential rate.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ 1. Essentially. Cai. Pages that look similar often use different HTML tags to achieve their layout effects and this makes it difficult for a machine to find relevant fields of interest. Kaiser.g. (Yu. due to the semi-structured nature of raw HTML pages. it currently does not live up to this definition. Thus. 1. accessing relevant information is by no means an easy task. Coded in Java. PARCELS is a backend system designed to address this problem. Web pages of any nature can be applied to PARCELS. Motivations Although the Web Wide Web (WWW) is defined as “the universe of network-accessible information. 1998) by the World Wide Web Consortium (W3C). 2003) and (Gupta. Yu.1 Applications of PARCELS PARCELS can be applied to many real life applications today: • Efficient web page reading ________________________________________________________________________ 1 . 2003). Neistadt and Grimm. none of the solutions address the problem exactly. 1998). In this documentation. we will be discussing on the stylistic detection engine. PARCELS consists of 2 engines – One on the textual detection of web pages (Lai. 2003). PARCELS is executable on major platforms like Windows. Wen and Ma. UNIX and Linux. e. the embodiment of human knowledge” (DiPasquo. Wen and Ma. 2004). (DiPasquo. e. (Cai. Semantic Web. the other on the stylistic detection of web pages. However. However. There are existing projects which tackle part of the problem.

In the next few chapters. ________________________________________________________________________ 2 . we can target the search and mining on the relevant components. This allows better access of “the universe of network-accessible information. Thus. we can remove items of lower priorities like site links. like reading a News articles without all the advertisements. With PARCELS. PARCELS greatly improves the effectiveness and efficiency of our everyday tasks. This greatly improves the efficiency of retrieving the relevant information on the WWW. • Improving results for Web Search / Mining With information on the layout of the pages. • Reducing size for Data Archiving Given the number of web pages today. With this classification. • Classification of web page cluster through web design We can identify web pages of the same nature using the logical structure of the web pages. Following that. the embodiment of human knowledge” mentioned earlier.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ With PARCELS. we will do an analysis on prominent web pages and their HTML structures. we can extract the title and/or main body of any web pages. we will touch on how we formulate the PARCELS toolkits which is our engine. advertisements and images of no relevance. archiving these pages will require enormous amounts of disk space. This greatly improves the accuracy and efficiency of the search. we can proceed with other tasks like archiving the documents in the correct domain. As we can see. reducing the size needed. Lastly. we will describe our Machine Learning techniques and perform an analysis on the results.

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ 2. Related Works Our research is closely related to several areas which have been heavily worked on. Next. In this section. the results are better than those not using the HTML Struct Tree. we noted from related works that most papers modified the representation of the DOM Tree: • In (DiPasquo.4%. they modified the DOM Tree and called it the HTML Struct Tree. Thus.6% and recall of 64. As PARCELS is divided into textual and stylistic engines.1%. we must reaffirm our basis that there is information in the layout and stylistic properties of web pages: • In (DiPasquo. For companies’ names. and that by looking at the HTML formatting in addition to the text on a page. The HTML Struct Tree is intended to represent how a human would parse word relationships on a web page due to its layout.0% and recall of 27. the thesis reported a precision of 71. we will look at some of the more prominent research on text processing and information retrieval on the Web. As web pages are meant to be ________________________________________________________________________ 3 . For companies’ locations. Firstly. 1998). the thesis reported a precision of 66. As we can see. We reaffirm the belief that the stylistic properties are useful in classifying segments of web pages. layout improves the classification. 1998). The focus of the thesis was to extract the names of companies from the web pages and the locations where they have operations in. the thesis argues that there is information in the layout of a web page. one can improve performance in tasks such as learning to classify segments of documents.

They proposed that by using local feedback to add keywords from top ranking documents. Using the HTML Struct Tree. we will be using the normal DOM Tree. a Machine Learner was applied. they are able to improve the precision and recall for second round search. It simulates how a user understands web layout structure based on his visual perception. there is one difficulty they faced: Accurate feedback is impossible due to noises like navigation. VIPS uses a series of heuristics and detection to achieve this. if parsed correctly. As a result. the relationship between objects on the page can be captured. the paper proposes a new web content structure based on visual representation of web pages. 2003). decoration and interaction. the paper describes a new method to detect the semantic content of a web page. The new structure presents an automatic top-down. we made modifications to our parsers instead. their new method uses the VIsionsbased Page Segmentation (VIPS) algorithm. This ensures easier portability of our codes for other uses. the result is better than those not using the HTML Struct Tree. we will focus on the main framework. Thus. ________________________________________________________________________ 4 . • In (Yu et al. In this release. The main drawback of this paper is that they are handling web pages with linear style (no tables or layers). From the tree. • In (Cai et al. tag-tree independent approach to detect web content structure. VIPS is used to eliminate such noises by using a tree to represent the flow of text. However.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ read by humans. As mentioned earlier. it is observed that topics of similar content are grouped together. Instead of a DOM Tree method. The usage of feedback described in this paper is good but it will not be included this version of PARCELS. 2003). Instead of changing the structure.

The main difference is they are handling other Control Structures like tables (Figure 1). the heuristics used to split contents into blocks are as follows: ________________________________________________________________________ 5 . layers and frames will be more effective.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Figure 1: VIPS Representation Comparing to other existing techniques. 2003). we still feel that using DOM-based Tree with an improved DOM Tree parser to handle tables. This paper is also using the VIPS algorithm as described previously. we would need to observe their heuristics on stylistic detection: • In (Yu et al. Personally. 2003) and (Cai et al. Lastly. while preserving the generic-ness of the web page. We would do a comparison on this in the Chapter 7 (Analysis of Results). the new approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure.

Thus.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Cue Tag Cue Color Cue Purpose HTML tags like <HR> are used to split the web page into blocks of text. Difference in the background colors between 2 blocks of text means they belongs to 2 different blocks. Size Cue Difference in the font sizes between 2 blocks of text means they belongs to 2 different blocks. We further extended the set of heuristics to suit our purposes. The detailed set can be found in Chapter 5 (Formulation of PARCELS toolkits). ________________________________________________________________________ 6 . This is our main focus for the stylistic engine as well. Figure 2: Cues for segmentation of web pages Visual Separator Detection used: Pattern Distance Tag Font Color Purpose Distance between blocks on both sides of separator Position of tags like <HR> Difference in formatting on both sides of separator Difference in background color on both sides of separator Figure 3: Patterns for Visual Separator Detection The heuristics and detection methods used in VIPS are very useful and are included in PARCELS. As we can see. Text Cue If one block contains HTML tags and the other does. most of the related works focus on how humans interpret the web pages based on the layout and stylistic properties. it means they belong to 2 different blocks. we shall proceed with our analysis on the layout and stylistic properties of real web pages.

In Year 2000. 3. We will be contrasting these 2 analyses to pin point the trend of layout structural tags usage. • Structural tag <table> is most widely used In Year 2004. and that parsing the HTML formatting appropriately can improve traditional text processing (DiPasquo. ________________________________________________________________________ 7 . News articles are usually reader friendly but not machine friendly. 1998). a total of 12 popular news sites were selected. The News sites were analyzed for their structural and stylistic properties.1 Structural Properties A study was made to find out the dominant layout structure of these sites. 3. we decided to carry out an analysis on News web pages.1 Analysis of News sites To facilitate our analysis.1. we found out that 10 out of the 12 News sites are using tables as their dominant layout structure. it makes sense to use this domain as a starting point for our analysis. Figure 16 in Appendix A gives the breakdown of the statistics of <table>. we would be able to parse pages of other domains with relative ease. A related study was also made to News articles in Year 2000.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ 3. Furthermore. As we observed. With the ability to parse news articles. <table> tags are widely used and will continue to be so (due to compatibility reasons with existing systems). Thus. <div>. Background Analysis Based on the argument that there is information inherent in the layout of each page. articles in the News domain are feature rich with complex layout structures. Articles are also updated daily and read by millions of people. <span> and <layer> tags used. all the News sites are using tables for their main layout.

navigation links and related articles’ links are normally grouped together. we proceed with the analysis on the stylistic properties of the pages. (Ivory. we applied our knowledge of the stylistic characteristics to News articles today. Next. we noted some of the stylistic aspect which will be useful to us. • Structural tag <layer> is not used This tag is obsolete due to increased usage of CSS with the z-index properties. the number of links for that segment will be significantly higher. In (Ivory and Hearst. Related Work Number of links for different segments Our Observations Sitemaps. 2003).PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ • Structural tag <div> and <span> is gaining popularity 2 of the News sites have converted to using <div> tags for their main layout. Color of links for different segments The color of links in the navigation bar is sometimes different from the links in the main content and supporting content.2 Stylistic Properties Having determined the dominant structural tags. Style of links for different segments The style of links in the navigation bar. we would have to look into the Cascading Style Sheet (CSS) files to obtain the layout properties. 3. <div> and <span> structural tags for the development of PARCELS. For the rendering position of <div> and <span> tags on the screen. we will be focusing on <table>. (Yu et al. Thus. Thus.1. 2003). 2003) and (Cai et al. The conversion is likely for speed issues (do not need to wait for pages to load) and easier maintenance of web pages. ________________________________________________________________________ 8 . related articles and advertisements are different from the links in the main content and supporting content. 2002).

Font styles for different segments Headers. As such. support contents and captions all have different font sizes.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Number of images for different segments Sitemaps and navigations links are more probable to have more images than other segments. we will be using these styles to aid in our classification. Font size will be a good indicator for some of our component classification. main content. Font sizes for different segments Headers. News Articles These characteristics of News sites will act as the foundation for our learning process in Chapter 6 (Machine Learning and Co-training). ________________________________________________________________________ 9 . Images located in between segments with high density text will likely be images supporting the content. Figure 4: Observations: Related Work vs. captions and newsletters are in different font styles from the main content. Size of Images This is sometimes useful in determining whether the image is an advertisement as they usually come in fixed sizes.

4. 4. As in Control Structure.1 Flow of Cells The flow of cells is known as the functional view of a table. Analysis of HTML Structures In this chapter. 2002). PARCELS uses these tags to specify the basic unit for a block of text. we are referring to HTML tags that will affect the layout position of the contents of a page. we will place more emphasis on the analysis of table structure.3 Table Structure From the analysis we did in Chapter 3. 4. There are a few papers dedicated to the study of tables. 4.1 Body The <body> tag is the trivial case in HTML Structure. (Hurst and Douglas. Thus. 1997). This is where the layout of a web page begins. 1997). (Hurst.3. tables are the most prominent structure on the Web now. All HTML documents must contain a body tag.g. Hurst and Jenson. Functional view is a description of how the information presentation aspects of tables embody a decision structure or reading path.2 Paragraph The next fundamental tag is the Paragraph <p> tags. These papers aided us in our analysis of tables. e.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ 4. This determines the order of the tables (Hurst et al. ________________________________________________________________________ 10 . we will do an analysis on the HTML Control Structure that will be used in used in PARCELS. The usage of paragraph tag is pretty straightforward. 1999) and (Cohen.

it would be significantly easier to classify the segments correctly. we can gather the following information. we observed it is possible to classify some of the segments using such characteristics. we will be able to know the position of this cell with respect to any nested tables.3 Ordering of Tables by Depth Nested tables are very common in web designing today.3.2 Grouping of Cells by Value Grouping of cells by value will aid in understanding how the webmasters intend the tables to be read (Hurst. 4.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ The flow of cells is important as we know there’s a usual layout on how webmasters design their website. First and foremost. If we follow that flow of layout. ________________________________________________________________________ 11 . we would need to order each cell by their depth. The value is calculated using the type of data located in the cell and the word density of that particular cell. Based on this. From the News site we analyzed. From this. 1999).3. grouping the cells by value will enable us to understand the contents better.3. This will ensure a better classification of that particular segment. the number of nested tables for each cell is different. Cells of the same value are often closely related to each other. Thus. 4. With the visual position tag. it is important to find the relation between nested tables and how each cell affects another cell. 4. Table-based extraction is done by tagging each cell with a visual position tag (Cohen et al. Furthermore. 2002).4 Position of Cells The position of cells is important for information extraction.

For this. CMS are scripts that are used by webmasters to generate the layout and content of the site.4. Thus. If they are using Content Management System (CMS). we would need to estimate its position on the screen. Region 1 Region 4 Region 7 Region 2 Region 5 Region 8 Region 3 Region 6 Region 9 Figure 5: 9 Regions on Screen ________________________________________________________________________ 12 . Furthermore.1 Position of Division / Span As we know. there are some layout formats webmasters will follow when designing a web site.4 Division / Span Structure Division <div> and Span <span> are two tags that are gaining popularity in recent times. it would be useful in our classification. using the position of cells will be useful in our classification. the position of a certain block of text is a major factor which contributes to the structure of a web site. we will split the screen into 9 regions. 4. 4.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ • • • Cut-in header: Cells related to a particular header cell in a table. Span tag may or may not be a Control Structure tag depending on whether it has a position allocated to it. these layouts will be even similar. We can assign positional value to both Division and Span in CSS. Column header: The overall column header of the table (if any). If a positional value is allocated. Row header: The overall row header of the table (if any). Division tag is a Control Structure tag as it will create a Break <br> effect.

For example. the higher it will appear on top of other layers.1. The higher the value of zindex. However.1 Priority of Layers Priority of a particular layer depends on the value of z-index. they will be included in future releases. In terms of classifications. then layer B will appear on top of layer A.5.6 Other Structures We recognize that there are other structures on the Web today. like Frames and Cross Documents structure. the functionalities of Layers can be emulated in Division and Span (z-index in CSS). The rest of the regions will be sub divided according to this maximum position.4. The remaining blocks of text will be placed into each of these regions according to their X-Y coordinates. Layers will be handled in the same way as Division and Span. position of each layer is important as mentioned in Section 4. Thus.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ We will locate the maximum position on the X and Y axis. z-index will also be useful as segments like main contents will likely be on top of other segments like background images. 4. if layer A has a zindex of 0. That block of text will be in region 9.5 Layers Structure Due to advancement in HTML and CSS specifications. layer B has a z-index of 10 and layer A and B overlap each other. 4. Thus. Furthermore. PARCELS focuses on single documents in this release. ________________________________________________________________________ 13 . 4.

a DOM Tree is created. we will be able to capture the right relationship between segments correctly. 5. Before we create the DOM Tree. A DOM Tree is an n-ary tree in which each node corresponds to an HTML tag on the page. we would now formulate our basic toolkits for PARCELS. This DOM Tree is created using the default Node class in Java. Thus.1. Figure 7 shows the partial DOM Tree that is created. Figure 6 shows a web page as viewed in Internet Explorer 6. Since a web page is designed to be read by humans. This relationship is what we defined as the “Logical Structure”. • The ordering of nodes in the DOM Tree is preserved in that children are ordered from left to right in the order they appear on that page. there will likely be errors. we need their DOM Trees. Due to the complexity of web pages. We would focus on the stylistic engine in this chapter.1 DOM Tree To manipulate with the styles of web pages. • The parent-child relationship in the tree implies that the child is in the scope of the parents. a program called Tidy is used to correct such errors. we need to ensure the HTML coding is sound. Each parent-child relationship loosely means they are directly related. ________________________________________________________________________ 14 .1 DOM Tree Parser A DOM Tree parser stimulates how humans parse words on a web page due to its layout. if we are able to parse it properly.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ 5. After parsing it through Tidy. Formulation of the PARCELS toolkits After our analysis on web pages and HTML structures. 5.

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Figure 6: National Geographic News Site Document head body title meta … table table … Text tbody … tr td td Text Text Figure 7: Partial DOM Tree Representation ________________________________________________________________________ 15 .

“Machine Learning” and “Website”. For example. rather than how it is represented in a DOM Tree.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ The next section describes how our parser parses the relevant information out of this DOM Tree.1. 5. ________________________________________________________________________ 16 . we will get 3 blocks of text: “Tom Mitchell’s”. • Splitting of Text from DOM Tree The parser will split the web page into blocks of text (basic unit).2 Parser This section will focus on the 2 major functions of the DOM Tree parser. given the following block of HTML codes: <body><b>Tom Mitchell’s <i>Machine Learning</i> Website</b></body> The resultant DOM Tree is shown in Figure 8. Each block of text is split in a human readable way. body b Tom Mitchell’s i Website Machine Learning Figure 8: Resultant DOM Tree Representation If we were to split it as how the DOM Tree represents it.

given the block of text and DOM Tree in Figure 8. a table cell <td>.2). sizes and colors • • • • • • • • • • • • Our Observations Default Body Structure found No other Control Structure found Position of Body Structure is trivial No links were found No links were found No graphical elements found No graphical elements found Total word count is 5 100% of the text are bold 40% of the text are in italics 100% of the text are of the same size 100% of the text are of the same color Figure 9: Stylistic Observations on Figure 8 ________________________________________________________________________ 17 .1. Thus. So our algorithm will link up all these text under the umbrella of a Control Structure.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ However. the block of text will be “Tom Mitchell’s Machine Learning Website”. • Parsing of Styles from DOM Tree For each block of text. 2004). a span <span>. a layer <layer> or a body <body>. For example. the parser will parse its stylistic properties (as mentioned in Section 3. This block of text will then be used in the Text Detection Engine of PARCELS (Lai. that’s not how human interpret it. a division <div>. we would obtain the following characteristics: Stylistic Properties (in each segment) All the Control Structures containing this segment Position of each Control Structure Number and type Link elements Colors of Link formatting Number of Graphics elements Size of Images (if available) Total Word Count Font styles. A Control Structure can be a paragraph <p>.

date. Thus. there are segments which are useful but do not fall into any of the classes. we would need to identify the labeling of segments that are of interest to us in the News Domain.2 PARCELS Tags for News Domain As mentioned earlier. these are the 17 labels: PARCELS Labels Title of article Purpose of Labels Refers to the headline of the article or phrase that summarizes the article Date / Time of article Reporter name Source station Country where news occurred Image supporting contents of article Links supporting contents of article Main content of article Refers to the date and time the article was published Refers to the author of the article Refers to the source / provider broadcast station of the news article Refers to the country. 2 more labels were added. report name and etc. drawing or sketch) related to the contents of the article Refers to a hyperlink placed within the main content of the article Refers to the main text of the news article ________________________________________________________________________ 18 . namely sub header and supporting contents. These labels are easily extended.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ This is a simple demonstration on what the parser will obtain together with the block of text. When we did the actual annotation for our Machine Learning training data. Presently. like title. These data will be stored in a Vector for Machine Learning purposes in the next chapter. we came up with 15 PARCELS labels. These labels are common components found. 5. we would focus on the News Domain in this release. Thus. city or state where the event took place Refers to an image (a photograph. After looking through articles from 12 News sites.

After formulating the PARCELS toolkits. we are now able to the retrieve useful information from web pages. any text of sidebars containing additional information about the article Sub header Refers to the sub headers which are found within the main content of the article Links to related article Refers to links to other news articles which have similar content or articles previously reported on the same topic. including navigation bars but not related to the news article Advertisements Search Refers to advertisements on the website. and the date / time of these articles Newsletter Refers to text / links prompting the user to sign up for or subscribe to a newsletter Site image Refers to images which are part of the web page's design and not related to the news article Site content Refers to the text which is part of the web page's content and not related to the news article Site links / navigation Refers to links to other parts of the website. With this information. 2004). text ads and banner ads Refers to the text / links related to searching or search options Figure 10: PARCELS labels and purposes More information on the textual engine of PARCELS can be found in (Lai. such as captions of images. ________________________________________________________________________ 19 . captions of audio / video files.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Supporting content of article Refers to the other text not belonging to the main text of the news article but is still related to the article. we shall proceed with our Machine Learning.

This means we can generalize even in the presence of many features (Joachims. Machine Learning and Co-Training For the stylistic engine. this same margin argument suggests a heuristic for selecting good parameter settings for the learner. Boostexter would also be included as an added feature in the stylistic engine. the size of the feature vector will grow increasingly with additional domains. 6. For comparison purposes. For the textual engine. ________________________________________________________________________ 20 . 1998). This allows fully automatic parameter tuning without expensive cross-validation. We can input a bias (cost factor) via a parameter. This is to cut down on the number of redundant features. SVM measure the complexity of hypotheses based on the margin with which they separate the data and not the number of features. we must first tabulate the stylistic features which are likely to affect certain PARCELS labels. 2004). Furthermore. 6. 1998).PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ 6. Boostexter would be used (Lai.1.1 Support Vector Machine SVM is selected due to some of its core features like its ability to learn is independent of the dimensionality of the feature vector (Joachims. we would be using Support Vector Machine (SVM) for our Machine Learning purposes. For stylistic detection. Stylistic Features Before we proceed with the feature vector. Thus. Labels that are more likely to be classified by the stylistic engine are colored in light grey while labels that are less likely to be classified are in white.1 PARCELS Labels vs.

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ PARCELS Label Title of article Stylistic Feature • • • • Date / Time of article • • • • Reporter name • • • • Source station • • Country where news occurred • • Image supporting contents of article • • • Links supporting contents of article • • • Main content of article • • • • Likely detected by stylistic engine Larger font size May be in Bold font Relatively Top Position Not likely detected by stylistic engine Likely grouped with Reporter name Smaller font size May be in Italics font Not likely detected by stylistic engine Likely grouped with Date / Time Smaller font size May be in Italics font Not likely detected by stylistic engine Likely grouped within blocks of text Not likely detected by stylistic engine Likely grouped within blocks of text Likely detected by stylistic engine Image tags grouped with high Word Count Position within Main Content Likely detected by stylistic engine Links tag grouped with high Word Count Position within Main Content Likely detected by stylistic engine High Word Count Few or no Bold / Italics tags Position across News sites is similar ________________________________________________________________________ 21 .

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Supporting content of article • • • • Sub header • • • Links to related article • • • • Newsletter • • • Site image • • • • Site content • • • Site links / navigation • • • • • • Likely detected by stylistic engine High Word Count Few or no Bold / Italics tags Likely to be using different font color Likely detected by stylistic engine Low Word Count High percentage in Bold Likely detected by stylistic engine High percentage of links tags High percentage of text within <a> tags Position across News sites is similar Not likely detected by stylistic engine Low Word Count May be in smaller font size Likely detected by stylistic engine High percentage of links tags Low Word Count Position across News sites is similar Not likely detected by stylistic engine Low Word Count May be in smaller font size Likely detected by stylistic engine High percentage of links tags High percentage of text within <a> tags May have a high percentage of images Smaller font size Position across News sites is similar ________________________________________________________________________ 22 .

Characteristics Detection Detection of tables and the flow of cells as mentioned in Chapter 4.4. We observed the number of nested tables does not go beyond 3 levels on the Web.3. 6. Thus. like the percentage of bold text. the percentage of italics text. we set the limit to 4 nested tables. 2004). from Figure 11. we came up with the feature vector for each segment of text. Detection of weight factor of the segment. Detection of position of div. we now formulate our feature vector. span and layer as mentioned in Chapter 4. These features capture the number of <br> tags and word count. we can see that the stylistic engine will likely detect 10 out of the 17 labels with high confidence. The textual engine will handle the rest of the labels with high confidence (Lai. We will verify this in Chapter 7 (Analysis of Results). With the observed stylistic features.1. This refers to which of the 9 regions on screen are they in. Detection of stylistic properties of the segment. Good for detecting labels like Main Content. Number of features 11 3 8 4 ________________________________________________________________________ 23 . etc.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Advertisements (Graphical) • • • • • Search • • • Likely detected by stylistic engine Low percentage of link tags Low percentage of image tags Low or no Word Count May be in fixed sizes Not likely detected by stylistic engine Low Word Count May be in smaller font size Figure 11: PARCELS labels’ Stylistic Features Therefore.2 Formulation of Feature Vector Having identified the prominent stylistic features.

________________________________________________________________________ 24 . Good for advertisements and site links detection. 2004). Useful in detecting labels like Related Links and Navigation Links. 6.1. 6. which are almost mutual exclusive to each other. Detection of images and their sizes. The PARCELS labels and formulation of feature vector for Boostexter will be similar to Section 6. it is logical to apply co-training to achieve a better result. we have included a module to convert our SVM feature vector to Boostexter. The idea of co-training is to use this large unlabelled sample to boost the performance of our learning algorithm since a small set of labeled examples is available to us due to manpower issues (Blum and Mitchell. For the full listing on the contents of the feature vector. please refer to Appendix B. For comparison. 1998). 6.3. This file will then be trained or classified.2 Boostexter Boostexter is used mainly in the textual engine (Lai. In the event when one of the features does not apply.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Detection of links ratio in that particular segment of the web page. Total Features 31 2 3 Figure 12: Breakdown of features in Feature Vector One feature vector will be outputted for every text block segment and stored in a single file.1 Stylistic and Textual Engine Since we have 2 engines.3 Co-Training In Machine Learning. they will be left out. one based on styles and one based on text. unlabelled examples are significantly easier to come by than label ones.

Top k examples with the highest positive confidence will be collected after classification. After learning is performed on these training data. For example. The same applies for the Boostexter classifier. the SVM classifier will classify the unlabeled examples. The overlapping examples will be selected first. PARCELS will then compare the top k examples between the two classifiers.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Add top k positive data SVM Training Data SVM Trainer SVM Classifier Classify New Data Remove top k positive data Unlabelled Data Co-training Module Classify New Data BT Training Data BT Trainer BT Classifier Add top k positive data Figure 13: Co-Training Flowchart Firstly. there are two sets of training data for the both SVM and Boostexter. SVM Classifier (Top k) Segment 5 6 10 11 1 Classification Main Content Related Links Main Content Main Content Main Content Boostexter Classifier (Top k) Segment 6 5 11 42 56 Classification Main Content Main Content Main Content Main Content Main Content Figure 14: Example of Co-Training Classification ________________________________________________________________________ 25 .

42 and 1 will be included. Segment 10.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ From Figure 14. ________________________________________________________________________ 26 . Since we are looking for top 5 (k = 5 in this example). Since we are using two separate Machine Learners. Therefore. Segment 5 and 11 will be selected. Segments with two different classifications will be ignored. These 5 examples will then be removed from the unlabelled examples and added to the training set for both SVM and Boostexter. it is not accurate to compare the level of confidence between two learners. we will take the top examples from each classifier until 5 is reached. Thus. Segment 6 will be ignored. The whole process will repeat for the number of times specified.

0 50.3 80. it is not easy to obtain huge amounts of labeled data in a short time.0 R 64.4 50.0 Round 1 P 70. with a training set of 5 news sites. we are more interested in the other labels. we would try to obtain some preliminary results.3 75.9 93.3 Round 3 P 64. The same can be said for Navigation Links as they both belong to the Links class.8 100.0 36.1 66.1 22. as we know main content extraction is already heavily researched.0 R 71.3 100.1 22. our results are on par and even higher in some instances.9 93.0 50.0 40.0 57.1 33. it would be useful to put it into real use by applying existing web pages in the News domain.9 100.9 72.9 72.0 80.9% is a good result.0 70.6 80.0 57. The low recall is due to the lack of training data. ________________________________________________________________________ 27 . like HTML Struct Tree.0 38.1 100.9 93.0 33.0 Round 2 P 71. Labels Precision / Recall Overall Main Content Related Link Site Content Navigation Links P Initial R 70. Thus. Related links comes a close second with precision of 100% but with a rather low recall.1 Preliminary Investigation of Results As we know.1 87. Not surprisingly.0 Figure 15: Preliminary Investigation of Results With the available training data. However.9 100.1 96.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ 7.8 75.3 80.0 60. Thus.1 33.1 84.9 84.0 100. In fact.0 40. 7. if we do a comparison on the papers (Chapter 2) that uses their own representation tree. An overall precision and recall of 71.0 60.0 100.5 50.2 R 70. the result peaked in the second round of co-training.2 Round 4 P 64. Analysis of Results With the PARCELS engines working.0 R 64. Main Content is the easiest to classify with its recall constantly over 90% and precision over 70%.

We will be providing the detailed investigation of results on our PARCELS website on Sourceforge. First of all. The increasing precision reflects that co-training works well for our classification given sufficient amount of training data. We are certain that the results can be a lot better. not all the labels are available in the training set. Nevertheless. an annotation engine is already in place to handle this problem (Lai. With more labeled examples. Those labels not mentioned in Figure 15 are the ones we do not have positive examples of. ________________________________________________________________________ 28 . 7.2 Detailed Investigation of Results More labeled examples from the annotation engine will be available in near future. we are very confident that the precision and recall will be improved further. Site Content has an increasing precision and a high recall.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ On the other hand. 2004).

net 8. PARCELS has done just that. With the working system now.sourceforge. we are able to obtain the relevant segments of the web page. We have done numerous background analyses on how humans perceive web pages in real life before designing such a system. we are pretty confident PARCELS will be able to classify with accuracy and precision. This was one of the design considerations of PARCELS – to be able to handle pages from different domains. This is important since most web pages today are part of major web site. Below are some of the areas which can be improved. • Ability to handle Cross Documents Structure This will give PARCELS the ability the view web pages as clusters instead of individual pages. PARCELS is currently released under GPL on Sourceforge. there is always room for further work. extending it to other domains poses no problem at all. Inputting a News article into PARCELS will give us a Logical Structure of how the web page is designed. we can apply it to many tasks. Since the system is designed modularly. With this Logical Structure. some of which mentioned in this thesis. http://parcels.1 Future Work Nevertheless.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ 8. ________________________________________________________________________ 29 . Thus. PARCELS is easily extendable. • Ability to handle Frame Structure This will ensure we have a complete system that can parse virtually any web pages on the Web today. Conclusion We started this paper with a mission to design something that classifies segments of web pages into useful information.

________________________________________________________________________ 30 . CSS is gaining popularity and it is useful to take a look at CSS. our primary focus is still HTML tags. It would be good to have as JavaScript does affect the layout of pages in some instances (though rarely). • JavaScript support We did not handle JavaScript in this release.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ • Improved CSS support Although there is CSS support in this release.

Neistadt. pp. pp. (2003). and Ma. L. and Mitchell. Cai. M. 2003. (1997). P. D. G. In the 5 th Asia Pacific Web Conference. 2002. (2003). A flexible learning system for wrapping tables and lists in HTML documents. A. T. In the 12 th WWW Conference. D. Carnegie Mellon University. (1998). D.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ References Blum. J. and Grimm. and Douglas. 1998. In the 11 th WWW Conference. Wen. Combining labeled and unlabeled data with cotraining. pp. S.E. 207 – 214 Hurst. In COLT: Proceedings of the Workshop on Computational Learning Theory. Layout and Language: Beyond Simple Text for Information Interaction – Modeling the Table. (2002). Senior thesis. Gupta. Hurst. Washington.S. Computer Science Department.. W.R. Yu. S. Hurst. Xi’an.. W.. 2003. Kaiser. 232 – 241 DiPasquo. Extracting Content Structure for Web Pages Based on Visual Representation. In the Applied Natural Language Processing Conference. (1999). (1998). M. M.W. Using HTML formatting to aid in natural language processing on the World Wide Web. 406 – 417 Cohen. In the 2nd International Conference on Multi-model Interfaces. Budapest.Y.. Layout & Language: Preliminary experiments in assigning logical structure to table cells. S. and Jenson. 1999. Hawaii. 1998. 1997. pp 217 – 220 ________________________________________________________________________ 31 .. DOM-based content extraction of HTML documents. Morgan Kaufmann Publishers.

Ivory. (1998). (2002). Tenth European Conference on Machine Learning. W. S.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Ivory. and Ma. In Machine Learning: ECML-98..Y. Joachims.R. (2004). Honours Thesis. and Hearst. 2003. In ACM Conference on Human Factors in Computing Systems. National University of Singapore. Cai. Yu.Y. D. 367-374. pp. S. T. (2003). 2004. Lai. M. In HCI International Conference. (2003). Budapest. School of Computing. Wen. Characteristics of Web Site Designs: Reality vs.A. Recommendation. CHI Letters. In the 12 th WWW Conference. M.. Improving pseudo-relevance feedback in web information retrieval using web page segmentation. Statistical Profiles of Highly-Rated Web Sites. M. 11 – 18 ________________________________________________________________________ 32 . 137 – 142. PARCELS: PARser for Content Extraction and Logical Structure (Text Deduction). 2003. pp. 2002. Text categorization with Support Vector Machines: Learning with many relevant features.Y. pp. J.

0% 2.9% 39.3% 4.6% 9.1% 96.5% 85.2% 96. Percentage of tags embed with News Sites Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Total Tags 464 447 454 337 608 425 619 411 943 649 1358 934 1012 769 929 506 622 479 589 465 <table> 97.4% <span> 4.4% 0% 0% 0.2% 2.0% 22.3% 97.9% 2.9% 12.4% 99.4% 6.7% 96.7% 96.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Appendix A – Distribution of HTML Structural Tags The table below shows the distribution of HTML Control tags for 12 popular News sites on the Web today.2% 29.2% 98.0% 31.4% 10.7% 1.0% 96.2% 19.1% 50.4% 18.1% 48.0% 98.3% 98.2% <div> 32.4% 8.6% 98.6% 31.1% 0% 0% 6.1% 12.1% 36.1% 95.1% 98.0% 5.6% 2.4% 95.1% 97.1% 3.9% 6.7% 87.3% 3.8% 0% 0% 60.2% 49.3% 88.0% 12.0% 95.0% <layer> 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% National Geographic (Jan 2004) National Geographic (Jan 2002) Straits Times (Jan 2004) Straits Times (May 2000) CNN (Jan 2004) CNN (Jun 2000) Yahoo News (Jan 2004) Yahoo News (Oct 2000) BBC (Jan 2004) BBC (Feb 2000) ________________________________________________________________________ A1 .8% 1.

1% 1.3% 97.0% 3.1% 95.4% 98.3% 9.7% 0% 0% 5.1% 97.8% 97.com (Jan 2004) Space.7% 0% 0% 54.9% 95.9% 0% 4.4% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Slashdot (May 2000) CNet (Jan 2004) CNet (Jan 2002) Wired News (Jan 2004) Wired News (Feb 2000) The Register (Jan 2004) The Register (Feb 2000) Space.0% 77.3% 95.6% 93.0% 97.3% 96.0% 17.8% 15.7% 0.8% 98.3% 0% 0% 0% 0% 0% 0% 0% 29.4% 9.4% 97.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Slashdot (Jan 2004) Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles Index Articles 1116 2894 914 2459 661 431 744 403 402 217 857 410 786 412 679 247 426 586 616 598 1036 538 315 458 417 213 98.1% 98.2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0.1% 15.8% 97.8% 94.9% 98.0% 5.7% 13.2% 98.1% 0.4% 97.4% 12.3% 0% 0% 0.8% 99.2% 88.4% 2.5% 98.9% 9.0% 13.2% 94.7% 0% 0% 0% 0% 8.2% 95.6% 98.2% 93.6% 98.4% 98.1% 12.3% 0% 0.7% 81.3% 31.com (Mar 2000) MSNBC (Jan 2004) MSNBC (Feb 2001) Arabic News (Jan 2004) ________________________________________________________________________ A2 .3% 0% 0% 97.

4% 0% 0% 0% 0% 0% 0% Figure 16: Distribution of HTML Structure Tags ________________________________________________________________________ A3 .3% 86.PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Arabic News (Feb 2000) Index Articles 426 133 95.

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection) ________________________________________________________________________ Appendix B – Listing of Features in Feature Vector 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: Depth of current cell in the nested tables Row number of current cell in outermost table Row number of current cell in the 1st nested table Row number of current cell in the 2nd nested table Row number of current cell in the 3rd nested table Row number of current cell in the 4th nested table Column number of current cell in outermost table Column number of current cell in the 1 nested table Column number of current cell in the 2nd nested table Column number of current cell in the 3rd nested table Column number of current cell in the 4 nested table Position (1 – 9) of div Position (1 – 9) of span Position (1 – 9) of layer Percentage of words in <i> in this segment Percentage of words in <i> in this segment over all words under <i> in article Percentage of words in <b> in this segment Percentage of words in <b> in this segment over all words under <b> in article Percentage of words in <u> in this segment Percentage of words in <u> in this segment over all words under <u> in article Percentage of words in <font> in this segment Percentage of words in <font> in this segment over all words under <font> in article Number of <br> tags in this segment Percentage of <br> in this segment over all <br> in article Number of words in this segment Percentage of words in this segment over all words in article Percentage of words in <a> in this segment Number of <a> tags in this segment Percentage of <a> in this segment over all <a> in article Number of <img> tags in this segment Percentage of <img> in this segment over all <img> in article th st ________________________________________________________________________ B1 .

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->