You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/298297871

Web Page Segmentation Evaluation *

Conference Paper · April 2015


DOI: 10.1145/2695664.2695786

CITATIONS READS
8 803

2 authors:

Andres Sanoja Stéphane Gançarski


Central University of Venezuela Sorbonne Université
17 PUBLICATIONS   64 CITATIONS    77 PUBLICATIONS   440 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Leg@net View project

Respires View project

All content following this page was uploaded by Stéphane Gançarski on 15 March 2016.

The user has requested enhancement of the downloaded file.


Web Page Segmentation Evaluation ∗

Andrés Sanoja Stéphane Gançarski


Université Pierre et Marie Curie Université Pierre et Marie Curie
4 place Jussieu 4 place Jussieu
Paris. France Paris. France
andres.sanoja@lip6.fr stephane.gancarski@lip6.fr

ABSTRACT (changes in important blocks) from distinct versions of a


In this paper, we present a framework for evaluating seg- page[12]. This is useful for crawling optimization, as it al-
mentation algorithms for Web pages. Web page segmenta- lows tuning of crawlers so that they will revisit pages with
tion consists in dividing a Web page into coherent fragments, important changes more often [15]. It also helps controlling
called blocks. Each block represents one distinct information curation actions, by comparing the page version before and
element in the page. We define an evaluation model that in- after the action. Several web page segmentation methods
cludes different metrics to evaluate the quality of a segmen- have been proposed over the last decades as detailed in Sec-
tation obtained with a given algorithm. Those metrics com- tion 2, most of them coming along with an ad hoc evaluation
pute the distance between the obtained segmentation and a method.
manually built segmentation that serves as a ground truth. When studying the literature about Web page segmen-
We apply our framework to four state-of-the-art segmenta- tation, we noticed that there is no full comparative evalu-
tion algorithms (BOM, Block Fusion, VIPS and JVIPS) on ation of Web page segmentation algorithms. The different
several categories (types) of Web pages. Results show that approaches cannot be directly compared due to a wide di-
the tested algorithms usually perform rather well for text versity of goals, the lack of a common dataset, and the lack
extraction, but may have serious problems for the extrac- of meaningful quantitative evaluation schemes. Hence, there
tion of geometry. They also show that the relative quality is a need to define common measures to better evaluate the
of a segmentation algorithm depends on the category of the Web page segmentation algorithms. To this end, we also
segmented page. investigate, in Section 2, evaluation methods for the connex
and widely studied area of Document Processing Systems.
In this paper, we propose a new Web page segmentation
1. INTRODUCTION evaluation method based on a ground truth. Whichever al-
Web pages are becoming more complex than ever, as they gorithm is used to segment a web page, the output can be ex-
are usually not designed manually but generated by Content pressed in terms of block geometry, block content, and page
Management Systems (CMS). Thus, automatically identify- layout. In our framework, these are all taken into account to
ing different elements from Web pages, such as main content, evaluate the segmentation. We define a model and a method
menus, user comments, advertising among others, becomes for evaluation of web page segmentation, adapted from [17]
difficult. A solution to this issue is given by web page seg- which was designed for scanned page segmentation. Their
mentation. Web page segmentation refers to the process of work allow measuring the quality of a segmentation using a
dividing a Web page into visually and semantically coher- block correspondence graph between two segmentations (the
ent segments called blocks. Detecting these different blocks computed one and the ground truth). Blocks are represented
is a crucial step for many applications, such as mobile ap- as rectangles, associated with the corresponding quantity of
plications [20], information retrieval [5], web archiving [14], elements that these regions cover. Four representative web
among others. In the context of Web archiving, segmenta- page segmentation algorithms were used to perform the eval-
tion can be used to extract interesting parts to be stored. uation. A dataset was built as a ground truth with pages be-
By giving relative weights to blocks according to their im- ing manually segmented and assigned to one page type (blog,
portance, it also allows the detection of relevant changes forum, . . .). We also present a method for ground truth con-
∗ struction that eases the annotation (manual segmentation)
This work was partially supported by the SCAPE Project.
The SCAPE project is co-funded by the European Union of web pages. Each segmentation algorithm was evaluated
under FP7 ICT-2009.4.1 (Grant Agreement number 270137) with respect to the ground truth. Computed segmentations
and ground truth segmentations are compared according to
Permission to make digital or hard copies of all or part of this work for personal or
the six metrics of our evaluation method. Results show that
classroom use is granted without fee provided that copies are not made or distributed the ranking among algorithms may vary according to the
for profit or commercial advantage and that copies bear this notice and the full cita- Web page type.
tion on the first page. Copyrights for components of this work owned by others than This paper is organized as follows. In the next section,
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- we study the related work. Section 3 presents the segmenta-
publish, to post on servers or to redistribute to lists, requires prior specific permission tion algorithms that we evaluated. Section 4 describes the
and/or a fee. Request permissions from Permissions@acm.org.
SAC 2015 Symposium on Applied Computing Proceedings
evaluation model. In Section 5, we describe the experimen-
Copyright 2015 ACM 978-1-4503-3196-8/15/04 . . . $15.00 tal setup and the collection used. Section 6 presents the
http://dx.doi.org/10.1145/2695664.2695786.

753
experiments and results. Section 7 concludes the paper. correctly identify segments in modern Web pages? This
question raises the issue of evaluating page segmentation
2. RELATED WORK methods.
In this section, we first present the different approaches for
Web page segmentation, then we study the state-of-the-art 2.2 Segmentation correctness evaluation
for evaluation of segmentation correctness. Different interpretations of correctness can be used with
respect to segmentation. As defined in the literature, the
2.1 Web page segmentation correctness of an algorithm is asserted when it complies with
One of the first efforts in structuring information of Web a given specification. The problem here is that such a spec-
pages was the creation of wrappers for information extrac- ification cannot be established a priori, without a human
tion [1]. Authors intended to use those wrappers to ap- judgement. Thus, we focus on evaluation approaches based
ply existing database models and query languages on semi- on a ground truth. We also investigated the correctness issue
structured documents (e.g. web pages). The concept of frag- in the connex domain of scanned page segmentation, since
ment is introduced by [2], as “a portion of a web page which the issue of evaluating such systems is quite similar to our
has a distinct theme or functionality and is distinguishable problem.
from the other parts of the page”. Moreover, the authors in- Segmentation issues have been addressed for almost thirty
troduce the notion of “interesting fragments”, those shared years in the optical character recognition (OCR) domain [6].
with other pages served from the same web site. Different Automatic evaluation of (scanned) page segmentation algo-
terms have been used in the literature as synonymous of rithms is a very well studied topic. Several authors have
fragment: web elements, logical unit, block, sub-page, seg- obtained good results in performance and accuracy evalua-
ment, component, coherent region, pagelet, among others. tion, as well in measuring quality assurance [9, 22, 4].
Since the earlier 2000’s, web page segmentation has been There are common problems in the evaluation of Web
a very active research area. Several works were published, pages and scanned pages segmentation algorithms : the lack
showing that the visual aspect of the page was a key for of a common dataset, a wide diversity of goals/applications,
segmenting Web pages. For instance, [3] divides a Web a lack of meaningful quantitative evaluation, and inconsis-
page into nine segments using a decision-tree that uses an tencies in the use of document models. This observation
information gain measure and geometric features. [5] de- led us to closely study how segmentation is evaluated for
scribes the Vision-based Page Segmentation (VIPS) algo- scanned pages.
rithm which computes, for each candidate block a Degree- Although Web pages and scanned pages are different (pix-
of-Coherence (DoC) utilizing heuristic rules based on the els/colors vs. elements/text), the way they are analysed
DOM representation of a page as well as visual features. and the result of their segmentation are similar. In both
Blocks are generated when their DoC meets a predefined cases, blocks can be organized as a hierarchy or a set of
value (which is a parameter of the algorithm). [13] presents non-overlapping rectangles (Manhattan layout [19]). Given
an adaptation of the VIPS algorithm using Java. It follows the nature of the Web content, almost all the algorithms
the same heuristics as the original algorithm. However the represent the final segmentation of a Web page as a Man-
results are not exactly equal because there are differences in hattan layout. It can be hierarchical [5] or non-hierarchical
the detection of vertical separators. [7, 10]. The latter can be obtained from the former by only
[8] proposes a Web page analysis method, based on sup- considering the leaves.
port vector machine rules, to detect parts of the page that There is a wide range of work around automatic evaluation
belongs to high level content blocks (header, body, footer or based on a predefined ground truth in the literature. We
sidebars) and, for each of those parts, applies explicit and highlight the work of [17] to measure the quality of a page
implicit separation detection heuristics to refine blocks. segmentation by analysing the errors in the text recognized
Later on, the need for formalizing the problem to go be- by OCR. [17] presents a vectorial score that identifies the
yond heuristic methods inspired other approaches. [7] faced common classes of segmentation errors using a ground truth
the page segmentation issue from a graph-theoretic point of of annotated scanned pages. Our work is inspired by this
view. DOM nodes are organized into a complete weighted paper.
graph and the edge weights estimate the costs needed to Neither of the mentioned approaches for Web pages pro-
gather the connected nodes into one block. [10] represents vide an evaluation method that helps comparing to others
atomic content units of a Web page by a quantitative linguis- directly. Authors usually do a qualitative evaluation of their
tic measure of text-density and reduces the segmentation algorithms, asking to human assessors to validate the seg-
problem to solving a 1D-partitioning task. They propose mentation. Others use analytical approaches, measuring the
an iterative block fusion algorithm (BlockFusion), applying performance of their algorithms, for example, with cluster
methods adapted from computer vision. [16] present the correlation metrics such as Adjusted Rand Index (AdjRand)
Block-o-Matic algorithm (BoM). It is an hybrid approach or Normalized Mutual Information (NMI). Those metrics
based on the vision-based Web page segmentation approach are well suited for checking if the segmentation preserves
and the document processing model from the computer vi- the textual content pages. The problem is that they do not
sion domain. The segmentation is presented in three struc- take the geometric properties into account.
tures: DOM, content and logic structure. Each one repre- Another important issue is that the datasets are not pub-
sents a different perspective of the segmentation, being the licly available (or do not exist any more). Some provide
logic structure the final segmentation. access to the tools but it is not the general case. An in-
To sum up, several approaches have been developed for teresting experience is the work of [11]. They present a
page segmentation, most of them described in [21]. How- method for quantitative comparison of semantic Web page
ever, a question remains : how well do these approaches segmentation algorithms. They also provide two datasets of

754
annotated Web pages publicly available1 2 . This approach They visually detect separators by splitting the page around
mainly uses text content comparison in order to perform the visual blocks so that no separator intersects with a block.
the match between ground truth and segmentations blocks, Subsequently they assign weights to the separators, accord-
which is not enough, since geometry of blocks plays a key ing to certain predefined heuristic rules. From the visual
role in the segmentation. blocks and the separators they can then assemble the vision-
based content structure of the page, using the Degree of
Coherence (DoC) of each block as granularity.
3. SEGMENTATION ALGORITHMS
In this section we give a short description of the segmen- 3.4 jVIPS (Java VIPS)
tation algorithms we evaluated. jVIPS [13] is another implementation of the VIPS model
proposed by Cai [5]. Hence, the granularity parameter is the
3.1 BoM (Block-o-Matic) same as VIPS : DoC. JVIPS is implemented in Java using
BoM [16] use the geometric aspects of a page (W) and a the CSSBox rendering engine. The difference between VIPS
categorization of DOM elements to perform the segmenta- and jVIPS resides in two of the heuristic rules, the version
tion. The categorization is that specified in the HTML5 con- of jVIPS prohibiting splitting some blocks that VIPS would
tent categories. DOM elements are evaluated by these cat- split. This implies that jVIPS often generates blocks as wide
egories instead of by their tag names or by their attributes. as the Web page width.
First the elements are filtered, excluding some categories, This algorithm has been referenced and used in several
producing a segmentation of small blocks. This fine-grained projects so it is worthy to include it in our evaluation.
segmentation is used to find composite blocks, which are
the layout of the page. Small blocks which are covered by
composite blocks, are associated with the latter. In a sec- 4. EVALUATION MODEL
ond round, the small blocks are merged following heuristic The goal of the evaluation model is to compare an auto-
rules, producing blocks with a normalized block rectangle mated segmentation of a web page W with the correspond-
area greater or equal to the granularity parameter rD. The ing ground truth, in order to determine its quality. Both
outcome is a segmentation tree where the root node is a spe- segmentations are organized as non-hierarchical Manhattan
cial composite block that represents the whole page. Non- layout (cf. section 2.2), in other words, they are flat seg-
terminal nodes represent the composite blocks, and terminal mentations. It is an adaptation to web pages of the model
nodes the blocks. presented by [17] for scanned page segmentation evaluation.
The ground truth is manually designed (we explain in Sec-
3.2 BF (BlockFusion) tion 5.3 how it was built for the evaluated collection). The
The BlockFusion algorithm [10] uses the text density as a comparison focuses on block geometry and content.
valuable heuristic to segment documents. The text density Each block (B) is associated with its bounding rectangle
is calculated by taking the number of words in the text and (B.rect) and two values: the amount of HTML elements it
dividing it by the number of lines, where a line is capped to covers (B.htmlcover) and the text it covers (B.textcover)
80 characters. A HTML document is then first preprocessed in the original page W . Note that in this section, all the
into a list of atomic text blocks. The density is computed rectangles are modelled as quadruples (x, y, h, w), where x
for each atomic block. and y are the coordinates of the origin point and h and w
Iteratively, two adjacent blocks are merged if their text are the height and the width of the rectangle.
densities are below a certain threshold ϑmax . The value of A segmentation W 0 of W is defined as follows :
this threshold represents the granularity of the segmenta- W 0 = (P age, granularity)
tion. The authors report that its optimal value is ϑmax ≈
0.38 and we take it as is. where P age is a kind of special block that represents the
whole page and granularity is a parameter that affects the
3.3 VIPS (Vision-based Web Page Segmenta- size of the rectangles in the segmentation. P age is defined
tion) as follows :
The VIPS algorithm [5] was designed to segment Web P age = (rect, htmlcover, textcover, {Block})
pages as a human would do it. It thus analyses the rendered
version of a web page. It first develops a vision-based content where {Block} is the set of blocks that form the segmenta-
structure, which analyses the page with visual cues present tion, such that
in the rendered page instead of the HTML source code. This ∀ b ∈ {Block}, b.rect ⊂ P age.rect
structure is built by splitting a page into a 3-tuple consisting
of a set of visual blocks, a set of separators, and a function The quality of a segmentation can be measured in two
that describes the relationship (shared separators) between complementary ways:
each pair of blocks of a page. Separators are for example • Block correspondence : measures how well the blocks
vertical and horizontal lines, images similar to lines, headers of the computed segmentation match with the ones of
and white-space. This structure is built by going top-down the ground truth.
through the DOM tree and taking both the DOM structure • Text covering : measures to which extent the global
and the visual information (position, color, font size) into content (here expressed as the number of words) of
account. the blocks is the same as the content of the page.

1
https://github.com/rkrzr/dataset-popular 4.1 Measuring block correspondence
2
https://github.com/rkrzr/dataset-random The Block correspondence indicates whether the blocks

755
rectangles of a segmentation match those of the ground metric for measuring the quality of a segmentation.
truth. 2. Oversegmented blocks (Co ). The number of G
Consider two segmentations for a page W : a computed nodes having more than one edge. This metric mea-
one WP0 (denoted P in the rest of the section), and the sures how much a segmentation produced too small
ground truth WG0 (denoted G). Figures 1(a) and (b) give blocks. However, those small blocks fit inside a block
respectively an example for G and P . of the ground truth. In the example of Fig. 1, node 6 of
To compute the block correspondence, we build a weighted the ground truth is oversegmented in the proposed seg-
bipartite graph called block correspondence graph (BCG) as mentation. In the example, the metric value is Co = 2
follows: because nodes 6 and 2 are both over-segmented.
As seen on Figure 1(c), nodes of BCG are the blocks of 3. Undersegmented blocks (Cu ). The number of P
P and of G. An edge is added between each couple of nodes nodes having more than one edge. The same as above,
ni and nj such that the weight w(ni , nj ) of the edge is equal but for big blocks, where blocks of the ground truth
to the number of underlying HTML elements and text in fit in. For instance, on Fig. 1, node D of the proposed
the intersection of the regions covered by the rectangle of segmentation is undersegmented with respect to the
each of the blocks corresponding to the two nodes. If the ground truth, and the value for the metric is Cu = 1.
blocks rectangles do not overlap in P and G, no edge is 4. Missed blocks (Cm ). The number of G nodes that
added. Thus, the algorithm that build the BCG is the fol- have no match with any in P. This metric measures
lowing: If the computed segmentation P fits perfectly with how many blocks of the ground truth are not detected
by the segmentation. One example is node 3 shown in
Data: nodes ni ∈ G,nj ∈ P the Fig. 1 and the value of the metric is Cm = 1.
Result: vertex (ni ,nj ) and its weight (if apply) 5. False alarms (Cf ). The number of P nodes that have
if ni .rect is contained in nj .rect then no match with any in G. This metric measures how
create vertex (ni ,nj ); many blocks are “invented” by the segmentation. For
w(ni , nj ) = ni .htmlcover + ni .textcover; instance, in Fig. 1 node I has no correspondant in the
ground truth making the metric value as Cf = 1.
else if ni .rect contains nj .rect then
Tc is a positive measure, Cm and Cf are negative measures.
create vertex (nj ,ni );
Co and Cu are “something in the middle”, as they count
w(ni , nj ) = nj .htmlcover + nj .textcover;
“not too serious” errors : found blocks could match with
else
the ground truth if they were aggregated or splitted. Note
/* no vertex is created */
that the defined measures cover all the possible cases when
w(ni , nj ) = 0;
considering the matching between G and P .
end To evaluate the quality of the segmentation we define a
score Cq , as the total number of acceptable blocks discov-
ered, i.e. Cq = Tc + Co + Cu .
the ground-truth segmentation G, then the BCG will be a
perfect matching. That is, each node in the two component
of the graph has exactly one incident edge. If there are dif- 4.2 Measuring text coverage
ferences between the two segmentations, P or G may have
The intuitive idea of evaluating the covering is to know if
multiples edges. If there is more than one edge incident to
there is some content from the original page not taken into
a node n in P (resp. in G), n is considered oversegmented
account by the segmentation. The covering of a segmenta-
(resp. undersegmented). Using these definitions, we can in-
tion W 0 is given by the Textcover function, which returns
troduce several measures for evaluating the correspondence
the proportion of words that appear in W but not in the
of a web page segmentation algorithm.
blocks of W 0 , as follows :
Intuitively, if all blocks in G are in P , this means that the
algorithm has a good performance. If one set of blocks in P
G are grouped into one block in P or if one block in G is b.textcover
b∈blocks
divided in several blocks in P then there is an issue with T extcover(W 0 ) =
P age.textcover
respect to the granularity but no error. We determine a
segmentation error if one block in the ground truth is not More complex functions can be used to measure the text
found in the computed segmentation or if there are blocks coverage, but this is left for future work.
that were “invented” by the algorithm.
The measures for block correspondence are defined as fol-
lows: 4.3 Normalization
1. Total correct segmentation (Tc ). The total number In order to compare two segmentations, we need to nor-
of one-to-one matches between P and G. A one-to-one malize the rectangles and the granularities.
match is defined by a couple of nodes (ni , nj ), ni in P , Given a segmentation W 0 , its normalized version fits in
nj in G, such that w(ni , nj ) ≥ tr , where tr is a thresh- a ND × ND square, where ND is a fixed value. In our
old that defines how well a detected block must match experimentations, we fixed this value to 100. Thus W 0 has
to be considered as correct. For instance, in Fig. 1, the new following property Nrect (normalized rectangle):
there is an edge between node 2 and node B and an-
other one between node 2 and node C. However, as the W 0 .P age.N rect = {0, 0, N D, N D}
weight w(2, C) is less than tr , and the weight w(2, B)
is greater tr , B is considered as a correct block. The Each block rectangle is then normalized according to the
metric value for the example is Tc = 2 . Tc is the main stretch ratio of the page, i.e.

756
Figure 1: (a) Ground-truth segmentation. (b) Proposed segmentation. (c) BCG.

BlockFusion and JVIPS), the adaptation has been made on


0
N D × W .P age.rect.x the source code. For VIPS, the adaptation has been made
∀ b ∈ W 0 .Blocks, b.N rect.x = on the output.
W 0 .P age.rect.w
The other values of the block rectangle (y, w and h) are 5.1.1 BoM
normalized in the same way. This allows for defining the As BoM is implemented 3 in Javascript it was very straight-
normalized area of a block as forward to adapt it for evaluation by adding a new custom
javascript function. The information is extracted from the
b.N area = b.N rect.h × b.N rect.w terminal blocks of the segmentation tree. Rectangles are
taken from these blocks and the values from the DOM ele-
ND2 ments associated to the latter.
= (b.N rect.h × b.N rect.w) ×
page area
5.1.2 BF
where,
There is an implementation of this algorithm, included
page area = W 0 .P age.rect.w × W 0 page.rect.h in the BoilerPipe 4 application. As BF is text-oriented, we
had to modify the original Boilerpipe source code to get the
rectangles and their content values. The strategy used was
5. SETUP to pre-process the input page. An input page is traversed
Our evaluation framework allows running different web using a browser and, for each of its elements, an attribute
page segmentation algorithms on a collection of web pages geometry is added. This attribute contains the rectangle
and measuring their correctness, as defined in Section 4. dimension and its (recursive) word count. The outcome is
Four algorithms are tested, adapted in such a way that it a set of rectangles represented by the TextBlocks produced
was possible to extract the page, the block geometries and by the algorithm using the ARTICLE extractor of BF.
the word counts as well. At a glance the framework gets
an URL (the page to be segmented) and a granularity, and 5.1.3 VIPS
produces one score for the covering and five for the corre- There is an implementation of this algorithm, in the form
spondence, using the ground truth, as described in Section of a Dynamic Linked Library (DLL) 5 . As VIPS is imple-
4. mented as a DLL, we chose Microsoft Visual Basic.NET de-
velopment environment to build a wrapper in order to obtain
5.1 Adaptation of Algorithms the information needed.
We adapted the implementation of the tested algorithms 3
https://github.com/asanoja/
in order to get the information needed for the comparison: web-segmentation-evaluation/tree/master/
the rectangles of the page (Block.rect), the HTML elements chrome-extensions/BOM
number (Block.htmlcover) and the word count of the whole 4
https://code.google.com/p/boilerpipe/
page (Page.textcover) and for each block (Block.textcover). 5
http://www.cad.zju.edu.cn/home/dengcai/VIPS/VIPS.
For algorithms where the source code was available (BoM, html

757
5.1.4 jVIPS The ground truth was assessed by human assessors at lab-
With jVIPS, obtaining the required data for evaluation oratory. However it is planed to crawl a bigger set of pages
was straightforward because the source code is publicly avail- and to include other assessors to do this task.
able 6 . When the visual content structure is completed (the
predefined granularity is met) the rectangles, HTML ele- 6. EXPERIMENTS AND EVALUATION
ments and word count are obtained from the corresponding In this section, we present the results of evaluating the
blocks. four segmentation algorithms described in Section 3.
The algorithms were evaluated using the GOSH collection
5.2 Collections, Crawling and Rendering based on the measures defined in Section 4. These measures
A segmentation repository holds the offline version of Web evaluate different aspects of a segmentation algorithm for a
pages, together with their segmentations (including the ground given quadruple (page, render engine, algorithm, granular-
truth), organized in collections. ity) as a parameter.
Within a collection, each page is rendered with different
rendering engines and at different granularity values. To 6.1 Setting the granularity
each quadruple (page, render engine, algorithm, granular- The accuracy of the measures directly depends on the way
ity) corresponds a segmentation performed on that page, the ground truth is built. If the human assessors defined a
rendered by that engine, using one algorithm at that gran- granularity in the ground truth, the granularity parameter
ularity. For the work presented in this paper, we built the of each algorithm needs to be adjusted accordingly.
GOSH (GOogle SearcH) collection 7 described below. In the present work our goal is to detect blocks of medium
size. We focus neither in detecting only large blocks, such as
5.2.1 GOSH Collection header, menu, footer and content, nor in detecting blocks at
Web pages in this collection are selected by their “func- a too high level of detail (sentences, links or single images).
tional” type, or category. This selection is based in the Instead, we focus on detecting parts of the page that rep-
categorization made by Brian Solis [18], ”The Conversation resent significant pieces of information, such as a blog post,
Prism”. It depicts the social media landscape from ethnogra- table of content, image and caption, set of images, forum
phy point of view. In this work, we considered the five most response, and so forth. This is a more challenging task for
common of these categories, namely Blog, Forum, Picture, segmentation algorithms.
Enterprise, and Wiki. For each category, a set of 25 sites Thus, in the following experiments, the granularity was
have been selected using Google search to find the pages set so that each algorithm produces medium size blocks.
with the highest PageRank. Within each of those sites, one
page is crawled 8 . 6.2 Setting the thresholds
Setting the relative threshold tr is not so obvious, as the
5.2.2 Rendering notion of “good block” is quite subjective.
Several rendering engines are used. They are encapsulated In this paper, we fixed tr to 0.1 as we observed, on a signif-
using Selenium WebDriver 9 . Although Selenium can handle icant number of example, that it corresponds to our notion
several browser engines, only Chrome and Internet Explorer of good block. In the future, we plan to perform supervised
are used in the present work. machine learning with a large number of users to determine
the right value. Each user will annotate the segmentation
5.2.3 Collection post-processing blocks with the corresponding block in the ground truth if
A Web page rendered with different engines may result in (s)he thinks that the blocks sufficiently match.
differences in the display. The most common case are the Because rendering engines may produce some small differ-
white spaces between the window borders and the content. ences in their rendering, we introduce a geometric tolerance
We must assure that all renders of the same Web page have tt to help in the comparison of the rectangles. The value
the same dimensions. For that reason, we check the above of this parameter is fixed based on the experience in work-
mentioned white space and remove them. ing on the collection. It is category-dependent. In general,
block rectangle do not differ in more than ± 5 pixels. For
5.3 Ground truth construction the whole collection, the best value was found to be 2 pixels.
The human assessor selects a set of elements that compose 6.3 Computing block correspondence
a block. Then we must deduce the bounding rectangle of the We computed the different metrics for block correspon-
block and compute the word count. This is a time consuming dence, as defined in Section 4.1. Table 1 shows the scores
and error prone task. To speed up the process we have (average of the metrics on all the documents of the collec-
developed the tool MoB (Manual-design of Blocks)10 . It tion) obtained by the different algorithms on the global col-
assists human assessors to select those elements that form a lection. Several observations can be done:
block and automatically extracts all the information needed. • BoM obtains the best overall result Tc , as it is more
6 accurate. It produces very few serious errors (Cm , Cf )
https://github.com/tpopela/vips_java
7 with respect to the other algorithms, but could be im-
http://www-poleia.lip6.fr/~sanojaa/BOM/inventory/
8 proved in terms of granularity as indicated by its high
https://github.com/asanoja/
web-segmentation-evaluation/tree/master/dataset values for Co and Cu .
9
http://docs.seleniumhq.org/projects/webdriver/ • BF obtains the worst result for Tc , but with a low level
10
https://github.com/asanoja/ of false alarms. In other words, BF does not detect all
web-segmentation-evaluation/tree/master/ the correct blocks (mainly, it misses the blocks that
chrome-extensions/MOB are not located in the center of the page) but detects

758
Algorithm Tc Co Cu Cm Cf Algorithm all forum blog wiki picture enterprise
BF 0.98 0.40 0.70 4.07 0.52 BF 56 42 85 91 14 37
BOM 2.99 1.41 0.79 0.99 0.86 BoM 69 75 87 91 61 71
JVIPS 1.29 1.42 0.97 1.75 5.46 JVIPS 86 100 80 98 87 92
VIPS 1.36 0.96 0.89 1.88 1.73 VIPS 95 96 95 94 95 95

Table 2: Coverage values for each algorithm


Table 1: Correspondence metrics for the global col-
lection

by algorithms.
• BF does perform well for Forum. As those kinds of
pages contain many text (question/responses) blocks,
the text density is sufficient to detect most of them,
but not those surrounding the main content.
• JVIPS has problems with the Picture collection. Those
pages do not have headers and footers, i.e. blocks that
occupy the whole width of the page. Instead, they
have many small blocks that JVIPS cannot detect.

6.4 Computing text coverage


We computed the text coverage as defined in Section 4.2.
Table 2 gives the (rounded) values for the whole collection
and for each category of pages. The first observation is that
the coverages obtained by all the algorithms are quite high.
This means that each of them is able to perform the basic
task of text extraction. It appears that BF does not perform
well for Picture and Enterprise. However, this is mainly due
to the fact that, for those categories, BF misses a lot of
blocks (as seen above), thus misses their content.
Figure 2: Total correct blocks by categories
7. CONCLUSION
good blocks, with a rather good granularity. This is In this paper, we present a framework for evaluating and
mainly due to the fact that BF uses the text density comparing Web pages segmentation algorithms. To the best
for determining blocks. As the blocks on the sides of of our knowledge, this is the first work that focuses on seg-
the pages have a low text density, it is hard for BF to mentation intrinsic properties, which are document layout,
detect them. content and blocks geometry. Existing approaches do have
• VIPS and JVIPS have comparable results in terms of evaluation, but they are driven by specific applications and
correct blocks and missed blocks. However, JVIPS thus are not generic enough to compare all the segmenta-
generates a lot of false alarms. This is due to a spe- tions algorithms. Our approach is based on a ground truth,
cific heuristic rules used in JVIPS that tends to detect built thanks to a tool (MoB) we developed and that sub-
blocks as wide as the page width. This is worth for stantially eases the manual design of a segmentation. Our
blocks like headers or footers, but not for the content dataset contains 125 pages, covering five categories (25 pages
located in the center of the page. per category).
In order to study the adequacy between segmentation al- We present an evaluation model that defines several useful
gorithms and Web page categories, Figure 2 shows the qual- metrics for the evaluation. One metric is devoted to the text
ity of the segmentation, represented by the average values of extraction task, the other ones compute how well the blocks
the metric Cq , described in section 4.1, for the five Web page detected by a given algorithm match the ones of the ground
categories above mentioned. Each algorithm is represented truth.
by a color bar, the dashed lines are the averages over all the We use this model for evaluating and comparing four seg-
collection. The AVG line represents the average number of mentation algorithms, adapted in order to fit into our frame-
correct blocks while TAVG represents the average number of work. The results show that the algorithms perform reason-
expected blocks in the ground truth. We make the following ably for extracting text from pages. With respect to geomet-
observations: ric block detection, results slightly depend on the category
• The best results are obtained for the Picture collection. of the pages considered. For instance, VIPS performs well
The reason is probably because picture pages have a for the Forum and Wiki categories, while it is much weaker
regular and simple structure. This observation also for the three other ones. BoM seems to give the best over-
holds, though attenuated, for the Enterprise category. all results. It overpasses the other algorithms for the Blog,
• The worst results are obtained for the Forum category. Wiki and, most importantly, Forum categories.
The reason for this, is probably that forum pages are There are many directions for future work. First, we plan
constituted of several question/answers blocks, each of to use machine learning (ML) techniques for learning the
them having a complex structure (including avatars, tolerance parameter tr . We also plan to use ML for discov-
email addresses, and so on) which is not easy to detect ering new relevant score functions, based on the feedback of

759
users giving manual score to segmentations from a training web pages for small-screen devices. IEEE Internet
set. Second, we will continue to experiment segmentation Computing 9(1), 50–56 (2005)
algorithms on more pages and more page categories. Our [9] Hu, J., Kashi, R., Wilfong, G.: Document image
aim is to develop a complete evaluation framework in order layout comparison and classification. In: 1999. ICDAR
to help users in choosing the best segmentation algorithm ’99. Proceedings of the Fifth International Conference
depending on their application and on the category of pages on Document Analysis and Recognition. pp. 285–288
they manipulate. Of course, as the results show that some (Sep 1999)
algorithms have problems with some categories, they can [10] Kohlschütter, C., Nejdl, W.: A densitometric
also be used to help improving the efficiency of segmenta- approach to web page segmentation. In: Proceedings
tion algorithms for those categories. of the 17th ACM conference on Information and
Third, we plan to evaluate the segmentation algorithms knowledge management. pp. 1173–1182. ACM (2008)
with respect to the type of task that uses the segmentation. [11] Kreuzer, R.: A Quantitative Comparison of Semantic
Task types include Web entity extraction, layout detection, Web Page Segmentation Algorithms. Master’s thesis,
boilerpipe detection, visualization in small screen devices, Universiteit Utrecht (2013)
and, in the context of digital libraries, optimization of Web [12] Pehlivan, Z., Saad, M.B., Gançarski, S.: Vi-diff:
archives crawling, change detection between web page ver- Understanding web pages changes. In: DEXA (1). pp.
sions, among others. This implies defining scripts that per- 1–15 (2010)
form the task (including calls to segmentation) and defining
[13] Popela, T.: IMPLEMENTACE ALGORITMU PRO
new ad hoc metrics for each task.
VIZUALNI SEGMENTACI WWW STRANEK.
Finally, we will work on enhancing the model. We would
Master’s thesis, BRNO University of Technology
like to include the importance in the evaluation model, so
(2012)
that algorithms that detect important blocks get a better
score. Also, we would like to define a generic model for [14] Saad, M.B., Gançarski, S.: Using visual pages analysis
web page segmentation that can express all the existing ap- for optimizing web archiving. In: Proceedings of the
proaches. This would allow for an analytic evaluation of 2010 EDBT/ICDT Workshops. p. 43. ACM (2010)
segmentation algorithms. [15] Saad, M.B., Gançarski, S.: Archiving the web using
page changes patterns: a case study. Int. J. on Digital
Libraries 13(1), 33–49 (2012)
8. REFERENCES [16] Sanoja, A., Gançarski, S.: Block-o-matic: A web page
[1] Abiteboul, S.: Querying semi-structured data. In: segmentation framework. In: International Conference
Afrati, F.N., Kolaitis, P.G. (eds.) Database Theory - on Multimedia Computing and Systems (ICMCS’14).
ICDT ’97, 6th International Conference, Delphi, Marrakeh, Morroco (2014)
Greece, January 8-10, 1997, Proceedings. Lecture [17] Shafait, F., Keysers, D., Breuel, T.: Performance
Notes in Computer Science, vol. 1186, pp. 1–18. evaluation and benchmarking of six-page segmentation
Springer (1997) algorithms. Pattern Analysis and Machine Intelligence,
[2] Asakawa, C., Takagi, H.: Annotation-based IEEE Transactions on 30(6), 941–954 (2008)
transcoding for nonvisual web access. In: Proceedings [18] Solis, B.: The conversation prism (2014),
of the Fourth International ACM Conference on https://conversationprism.com/
Assistive Technologies. pp. 172–179. Assets ’00, ACM, [19] Tang, Y.Y., Suen, C.Y.: Document structures: a
New York, NY, USA (2000), survey. International journal of pattern recognition
http://doi.acm.org/10.1145/354324.354588 and artificial intelligence 8(05), 1081–1111 (1994)
[3] Baluja, S.: Browsing on small screens: recasting [20] Xiao, Y., Tao, Y., Li, Q.: Web page adaptation for
web-page segmentation into an efficient machine mobile device. In: Wireless Communications,
learning framework. In: Proceedings of the 15th Networking and Mobile Computing, 2008. WiCOM
international conference on World Wide Web. pp. ’08. 4th International Conference on. pp. 1–5 (2008)
33–42. ACM (2006) [21] Yesilada, Y.: Web page segmentation: A review. Tech.
[4] Breuel, T.M.: Representations and metrics for off-line rep., University of Manchester and Middle East
handwriting segmentation. In: Frontiers in Technical University Northern Cyprus Campus (2011)
Handwriting Recognition, 2002. Proceedings. Eighth [22] Zhang, Y., Gerbrands, J.: Objective and quantitative
International Workshop on. pp. 428–433. IEEE (2002) segmentation evaluation and comparison. Signal
[5] Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Extracting processing 39(1), 43–54 (1994)
content structure for web pages based on visual
representation. In: APWeb 2003. LNCS, vol. 2642, pp.
406–417. Springer (2003)
[6] Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.:
Geometric layout analysis techniques for document
image understanding: a review. ITC-irst Technical
Report 9703(09) (1998)
[7] Chakrabarti, D., Kumar, R., Punera, K.: A
graph-theoretic approach to webpage segmentation.
In: Proceedings of the 17th international conference
on World Wide Web. pp. 377–386. ACM (2008)
[8] Chen, Y., Xie, X., Ma, W.Y., Zhang, H.J.: Adapting

760

View publication stats

You might also like