P. 1
stylistic

stylistic

|Views: 127|Likes:
Published by sarvan21

More info:

Published by: sarvan21 on Feb 27, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

12/03/2012

pdf

text

original

Comparing to other existing techniques, the new approach is independent to

underlying documentation representation such as HTML and works well even when

the HTML structure is far different from layout structure.

This paper is also using the VIPS algorithm as described previously. The main difference

is they are handling other Control Structures like tables (Figure 1). Personally, we still

feel that using DOM-based Tree with an improved DOM Tree parser to handle tables,

layers and frames will be more effective, while preserving the generic-ness of the web

page. We would do a comparison on this in the Chapter 7 (Analysis of Results).

Lastly, we would need to observe their heuristics on stylistic detection:

• In (Yu et al, 2003) and (Cai et al, 2003), the heuristics used to split contents into

blocks are as follows:

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)

________________________________________________________________________

________________________________________________________________________

6

Cue

Purpose

Tag Cue

HTML tags like


are used to split the web page into blocks of text.

Color Cue

Difference in the background colors between 2 blocks of text means they

belongs to 2 different blocks.

Text Cue

If one block contains HTML tags and the other does, it means they belong to

2 different blocks.

Size Cue

Difference in the font sizes between 2 blocks of text means they belongs to

2 different blocks.

Figure 2:

Cues for segmentation of web pages

Visual Separator Detection used:

Pattern

Purpose

Distance

Distance between blocks on both sides of separator

Tag

Position of tags like


Font

Difference in formatting on both sides of separator

Color

Difference in background color on both sides of separator

Figure 3:

Patterns for Visual Separator Detection

The heuristics and detection methods used in VIPS are very useful and are included in

PARCELS. We further extended the set of heuristics to suit our purposes. The detailed

set can be found in Chapter 5 (Formulation of PARCELS toolkits).

As we can see, most of the related works focus on how humans interpret the web pages

based on the layout and stylistic properties. This is our main focus for the stylistic engine

as well. Thus, we shall proceed with our analysis on the layout and stylistic properties of

real web pages.

PARCELS: PARser for Content Extraction and Logical Structure (Stylistic Detection)

________________________________________________________________________

________________________________________________________________________

7

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->