• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
Unsupervised Learning of Tree Alignment Models for Information Extraction
Philip Zigoris, Damian Eads, and Yi ZhangDepartment of Computer ScienceUniversity of California, Santa Cruz1156 High StreetSanta Cruz, CA 95064
{
zigoris,eads,yiz
}
@soe.ucsc.edu
Abstract
We propose an algorithm for extracting fields from HTML search results. The output of the algorithm is adatabase table– a data structure that better lends itself tohigh-level data mining and information exploitation. Our algorithmeffectively combinestree andstring alignmental-gorithms, as well as domain-specific feature extraction tomatch semantically related data across search results. Theapplications of our approach are vast and include hiddenweb crawling, semantic tagging, and federated search. Webuild on earlier research on the use of tree alignment for information extraction. In contrast to previous approachesthat rely on hand tuned parameters, our algorithm makesuseofavariantofSupportVectorMachines(SVMs)tolearna parameterized, site-independent tree alignment model.This model can then be used to deduce common structuraland textual elements of a set of HTML parse trees. We re- port some preliminary results of our system’s performanceon data from websites with a variety of different layouts.
1 Introduction
There is a proliferation of research in the field of Knowledge Discovery in Databases (KDD) aiming to de-rive high value conclusions from information stored in adatabase[7, 9]. Many of the advances in the field rely onthe presence of highly structured data. Unfortunately, avast amount of content on the Internet is in the form of semi-structured HTML search results, making the data un-usable to many algorithms. The field of information extrac-tion (IE) tries to address this problem by developing toolsfor transforming semi-structured text into highly-structureddatabase content[17, 1, 11, 2, 18, 6, 3]. Thus, successesin IE will enable the exploitation of a broad class of KDDand Data Mining (DM) algorithms on the largest source of information in the world–the Internet.The main intuition driving our approach to informationextraction is that search results will often contain a highdegree of repetition and this indirectly yields informationaboutthe structureof the data. In orderto identifythe repet-itive elements we use a variety of parameterized
tree align-ment models
. Simply put, a tree alignment model assigns acost of pairing vertices from two trees. By finding a mini-mum cost pairing between vertices we presumably identifythecommonstructural,andpossiblytextual,elementsofthetrees. These models are introduced formally in Section 2.One of the major contributions of our work is an un-supervised method for learning the tree alignment param-eters. With well-tuned parameters these models are re-silient to structural variation in dynamic HTML returnedby the web site and a well-informed alignment model willoften have too manyparameters to effectivelytune by hand.Specifically,weexploretheuseofSupportVectorMachines(SVM), a popular machine learning algorithm, for learningthese parameters. The details of our approach are presentedin Section 3.The other novel aspect of our work is a simple methodfor generating and representing schema. Rather than induc-ing a set of rules for processing unseen data, our methodworksbysimply comparingnewdata againstthat whichhasalready been seen; analogous to nearest neighbor classifica-tion. The details of this method are presented in Section4.In Section 5 we present a preliminary evaluation of ourwork.
2 Tree Edit Distance
This section introduces two ideas which are central toour approach: tree alignments and tree edit distance, bothoriginally due to Tai [15]. They are, respectively, anal-ogous to the well studied concepts of string alignment andstringedit-distance. Insteadofstrings,however,wearecon-
 
cernedwith
vertex labeledtrees
, a tree coupledwith a label-ing function
l
(
v
)
that maps vertices to a
label
. In our work we study HTML parse trees where vertices are labeled withtag identifiers or free text. We will often refer to vertices la-beled with text as
textual vertices
to distinguish them fromvertices labeled with HTML tags.Intuitively, a
tree alignment 
is an association betweenvertices in two labeled trees,
1
and
2
. The
tree edit-distance
between the two trees corresponds to the minimalcost of transforming
1
into
2
. The edit-distance providesa measure of similarity between trees (and their sub-trees)and the alignment provides a way to identify the commonelements of each tree, with respect to both structure and thevertex labeling.In this work we concern ourselves with
rooted ordered labeled trees
. This is a special case of labeled trees wherea vertex
r
is identified as the root and the children of everynode have a fixed ordering. We can, therefore, speak of the ’left’ and ’right’, as well as the
i
th
, child of a node.Throughout this paper we will assume all trees are rooted,ordered, and labeled. With a fixed ordering and designatedroot, we can formally define an
alignment 
between trees
1
and
2
asa set
A ⊂
2
1
×
2
suchthatforall
(
a,b
)
,
(
a
,b
)
A
a
=
a
b
=
b
,
a
is to the left of 
a
b
is to the left of 
b
, and
a
is an ancestor of 
a
b
is an ancestor of 
b
In other words, an alignment is a list of pairs of nodes, onefrom each tree, such that each vertex is paired with at mostone other vertex and there are no “crossovers”. An exampleof an alignmentbetween two HTML parse trees is shown inFigure 1.Associated with an alignment is a set of editing opera-tions for transforming one tree into the other: Every pair
(
a,b
)
in the alignment corresponds to a relabeling/copyingof a vertex, an operation we denote by
R
(
a,b
)
. A vertex
a
from
1
that does not occur in the alignment correspondsto a
deletion
, denoted by
D
(
a
)
. Similarly, a vertex
b
2
not occurring in the alignment corresponds to an
insertion
,denoted by
(
b
)
.In order to discuss tree edit distance we require a func-tion
c
θ
that assigns a cost to each operation. We will oftenrefer to the parameter
θ
as the
operation costs
, although theterminology is somewhat imprecise. It is natural, given
c
θ
,to define the cost of an alignment,
θ
(
A
)
, as the sum of costs of its associated operations:
(
a,b
)
∈A
c
θ
(
R
(
a,b
)) +
a
1
:
b
2
(
a,b
)
∈A
c
θ
(
D
(
a
)) +
b
2
:
a
1
(
a,b
)
∈A
c
θ
(
(
a
))
Finally, the tree edit distance between trees
1
and
2
,
d
θ
(
1
,
2
)
is defined to be the minimum cost over all align-
<font color="red"><big><b>DunePrice $9<td><big><b>MiddlesexPrice $15<td>Free Shippingem
Dune
Price $9
Middlesex
 
Free Shipping
 
Price $15
Figure 1. An example of an alignment be-tween two HTML parse trees. The renderedHTML text is illustrated at the bottom left ofeach box. Vertices labeled with HTML tagsare shown as ellipses and
textual vertices
areshown as squares.
ments. We will denote by
A
θ
(
1
,
2
)
the minimizingalign-ment.Finding the minimum cost alignment can be done in, atbest, cubic time with dynamic programming [8, 4]. How-ever, for large trees even cubic running time can be pro-hibitive. To alleviate this, other work in information extrac-tion has relied on approximate alignment algorithms suchas
partial tree alignment 
[17] and
restricted top down map- pings
[14]. In our work the trees were small enough thatresorting to such methods was unnecessary. However, in-corporating our methodology into a practical system wouldprobably require their use.
2.1 Cost functions
The task of information extraction requires a high de-gree of specificity in the cost function. A field will typicallycontain different strings with similar semantics (e.g. prices,dates, ISBN). In order for the vertices in a field to alignwell, the cost function must assign a low cost to aligningstrings with similar content. For instance, consider aligningthe text “Price $4.99” with “$100”. Despite the large syn-tactic differences between them, both strings have a similar
 
function (i.e. to convey the price of an item).In order to study the sensitivity to these issues we de-veloped three different cost functions with varying degreesof specificity. The first, referred to as the
Simple
cost func-tion, is parameterizedby onlythree numbers:
θ
is the costof copying any vertex,
θ
ID
is the cost of inserting/deletingany vertex, and
θ
R
is the cost of relabeling any vertex. TheSimple function completely ignores the semantics of the la-beling and so aligning “$100” with “$5” will have the samecost as aligning “$100” with “July 4th”.The second cost function we developed differs from theSimple functiononly in the cost of aligning textual vertices.We refer to this as the
string edit distance
or
sed 
cost func-tion and it includes one extra parameter,
θ
S
. The cost oaligning two text labeled vertices is
θ
S
times the normal-ized string edit distance of those two strings.The third cost function incorporates some simple rulesfor determining the semantic relationship between twostrings of text and so we refer to this as the
Semantic
costfunction. Here the cost of aligning two vertices labeledwith strings
s
1
and
s
2
, respectively, is
¯
θ
S
·
(
s
1
,s
2
)
where
(
·
,
·
)
is a feature vector and
¯
θ
S
are the weights associatedwith each feature. The features include whether or not bothstrings represent a(n): street, email or web address, date,phone number, or price. In total the Semantic cost functionhas 24 parameters.
3 Learning Operation Costs with
SV
struct
In the previous section we presented parameterized treealignments. Here we presentanalgorithmforlearningtheseparametersfrom what is, effectively,unlabeled data. It is anextension of work by Tsochantaridis, et al. [16] on usingSupport Vector Machines (SVMs) for learning structuredlabels. In their work they outline a very general frame-work that accommodates settings such as multi-class learn-ing,grammarlearning,andlearningsequencealignmentpa-rameters.Assume we are given a collection of HTML parse trees
T  
thatareeachlabeledwiththeirsite oforigin
s
, where
is the set of web sites (e.g., google.com, amazon.com,soe.ucsc.edu). In our work each parse tree
T  
corre-sponds to one data record from a search results page. De-note by
T  
s
all trees originating from site
s
.The constraint we impose is that two trees from one sitemust be closer to one another, in terms of tree-edit distance,than to a tree from another website, illustrated in Figure 2.Note that the labeling, i.e. the site of origin, is providedby the system that fetches search results and so no manuallabeling is required.Formally, we seek 
θ
(the cost function parameters) suchthat
s
1
,s
2
,
1
,
1
T  
s
1
,
2
,
T  
s
2
the following
ABAmazon.comDVD.comGodiva.com
Figure 2. The points represent records fromthree different sites. The task is to find oper-ationcostssuchthatpreservesthisgroupingunder tree-edit distance.
A
represents inter-site distance and
B
represents site width.
holds:
d
θ
(
1
,
1
)
d
θ
(
1
,
2
)
We refer to the maximum distance between any two treesfrom site
s
as the
width
of 
s
. We refer to the mini-mum distance between any two trees in sites
s
1
and
s
2
asthe
inter-site distance
.It may be the case that no
θ
satisfies the above con-straints. Accordingly, we can relax the constraints by intro-ducing slack variables
ξ
s
for
s
and penalize a solutionbythesum oftheslack variables. Ouroptimizationproblembecomes
min
θ,ξ
0
s
S
ξ
s
s
1
,s
2
1
,
1
T  
s
1
d
θ
(
1
,
1
)
d
θ
(
1
,
2
)
ξ
s
1
2
,
T  
s
2
Note that
ξ
s
correspondsto the maximuminter-site distancebetween
s
and any other site.In order to introduce a notion of 
margin
we require thatthe distance function scales linearly with the parameter val-ues. That is
θ,α >
0
, d
αθ
(
a,b
) =
αd
θ
(
a,b
)
All of cost functions discussed previously satisfy this re-quirement. In this way we can specify a unique solution bygiving favor to parameters settings that are small in mag-nitude. This has been shown to improve generalization er-ror in the case of linear separators[5]. We balance the costof the slack variables with the parameter magnitude withthe parameter
by adding the term
12
||
θ
||
2
to the objectivefunction, giving us a quadratic program. The effect of max-imizing the margin is to not only maximize the inter-sitedistances but also minimize the widths of the sites.The above optimization problems are difficult to solvedirectly, because the distance between two trees is a func-
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...