function (i.e. to convey the price of an item).In order to study the sensitivity to these issues we de-veloped three different cost functions with varying degreesof specificity. The first, referred to as the
Simple
cost func-tion, is parameterizedby onlythree numbers:
θ
M
is the costof copying any vertex,
θ
ID
is the cost of inserting/deletingany vertex, and
θ
R
is the cost of relabeling any vertex. TheSimple function completely ignores the semantics of the la-beling and so aligning “$100” with “$5” will have the samecost as aligning “$100” with “July 4th”.The second cost function we developed differs from theSimple functiononly in the cost of aligning textual vertices.We refer to this as the
string edit distance
or
sed
cost func-tion and it includes one extra parameter,
θ
S
. The cost of aligning two text labeled vertices is
θ
S
times the normal-ized string edit distance of those two strings.The third cost function incorporates some simple rulesfor determining the semantic relationship between twostrings of text and so we refer to this as the
Semantic
costfunction. Here the cost of aligning two vertices labeledwith strings
s
1
and
s
2
, respectively, is
¯
θ
S
·
f
(
s
1
,s
2
)
where
f
(
·
,
·
)
is a feature vector and
¯
θ
S
are the weights associatedwith each feature. The features include whether or not bothstrings represent a(n): street, email or web address, date,phone number, or price. In total the Semantic cost functionhas 24 parameters.
3 Learning Operation Costs with
SVM
struct
In the previous section we presented parameterized treealignments. Here we presentanalgorithmforlearningtheseparametersfrom what is, effectively,unlabeled data. It is anextension of work by Tsochantaridis, et al. [16] on usingSupport Vector Machines (SVMs) for learning structuredlabels. In their work they outline a very general frame-work that accommodates settings such as multi-class learn-ing,grammarlearning,andlearningsequencealignmentpa-rameters.Assume we are given a collection of HTML parse trees
T
thatareeachlabeledwiththeirsite oforigin
s
∈
S
, where
S
is the set of web sites (e.g., google.com, amazon.com,soe.ucsc.edu). In our work each parse tree
T
∈ T
corre-sponds to one data record from a search results page. De-note by
T
s
all trees originating from site
s
.The constraint we impose is that two trees from one sitemust be closer to one another, in terms of tree-edit distance,than to a tree from another website, illustrated in Figure 2.Note that the labeling, i.e. the site of origin, is providedby the system that fetches search results and so no manuallabeling is required.Formally, we seek
θ
(the cost function parameters) suchthat
∀
s
1
,s
2
∈
S
,
∀
T
1
,T
1
∈ T
s
1
,
∀
T
2
,
∈ T
s
2
the following
ABAmazon.comDVD.comGodiva.com
Figure 2. The points represent records fromthree different sites. The task is to find oper-ationcostssuchthatpreservesthisgroupingunder tree-edit distance.
A
represents inter-site distance and
B
represents site width.
holds:
d
θ
(
T
1
,T
1
)
≤
d
θ
(
T
1
,T
2
)
We refer to the maximum distance between any two treesfrom site
s
∈
S
as the
width
of
s
. We refer to the mini-mum distance between any two trees in sites
s
1
and
s
2
asthe
inter-site distance
.It may be the case that no
θ
satisfies the above con-straints. Accordingly, we can relax the constraints by intro-ducing slack variables
ξ
s
for
s
∈
S
and penalize a solutionbythesum oftheslack variables. Ouroptimizationproblembecomes
min
θ,ξ
≥
0
s
∈
S
ξ
s
∀
s
1
,s
2
∈
S
∀
T
1
,T
1
∈ T
s
1
d
θ
(
T
1
,T
1
)
−
d
θ
(
T
1
,T
2
)
≤
ξ
s
1
∀
T
2
,
∈ T
s
2
Note that
ξ
s
correspondsto the maximuminter-site distancebetween
s
and any other site.In order to introduce a notion of
margin
we require thatthe distance function scales linearly with the parameter val-ues. That is
∀
θ,α >
0
, d
αθ
(
a,b
) =
αd
θ
(
a,b
)
All of cost functions discussed previously satisfy this re-quirement. In this way we can specify a unique solution bygiving favor to parameters settings that are small in mag-nitude. This has been shown to improve generalization er-ror in the case of linear separators[5]. We balance the costof the slack variables with the parameter magnitude withthe parameter
C
by adding the term
12
||
θ
||
2
to the objectivefunction, giving us a quadratic program. The effect of max-imizing the margin is to not only maximize the inter-sitedistances but also minimize the widths of the sites.The above optimization problems are difficult to solvedirectly, because the distance between two trees is a func-
Leave a Comment