You are on page 1of 55

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2808445

Logical Analysis of Numerical Data

Article  in  Mathematical Programming · October 2000


DOI: 10.1007/BF02614316 · Source: CiteSeer

CITATIONS READS

170 443

4 authors, including:

Endre Boros Toshihide Ibaraki


Rutgers, The State University of New Jersey Kyoto College of Graduate Studies for Informatics
297 PUBLICATIONS   4,319 CITATIONS    455 PUBLICATIONS   9,184 CITATIONS   

SEE PROFILE SEE PROFILE

Alexander Kogan
Rutgers Business School
95 PUBLICATIONS   2,074 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Box clustering View project

Impact of Business Analytics and Enterprise Systems on Managerial Accounting View project

All content following this page was uploaded by Endre Boros on 22 January 2013.

The user has requested enhancement of the downloaded file.


Rutcor
Research
Report
Logical Analysis of Numerical
Data

Endre Borosa Peter L. Hammer2


Toshihide Ibarakib Alexander Kogan2c

RRR 04-97, February 1997

RUTCOR  Rutgers Center a RUTCOR, Rutgers University, P.O. Box 5062, New Brunswick, NJ
for Operations Research  08903, U.S.A., email: ( fboros,hammer,kogang@rutcor.rutgers.edu)
b Department of Applied Mathematics and Physics, Graduate School
Rutgers University  P.O.
Box 5062  New Brunswick
of Engineering, Kyoto University, Kyoto, Japan 606, email: (
ibaraki@kuamp.kyoto-u.ac.jp)
New Jersey  08903-5062 c Department of Accounting and Information Systems, Faculty of Man-
Telephone: 908-445-3804 agement, Rutgers University, Newark, NJ 07102, U.S.A.
Telefax: 908-445-5472
Email: rrr@rutcor.rutgers.edu
http://rutcor.rutgers.edu/rrr
Rutcor Research Report
RRR 04-97, February 1997

Logical Analysis of Numerical Data

Endre Boros Peter L. Hammer Toshihide Ibaraki


Alexander Kogan

Abstract.The \Logical Analysis of Data" (LAD) is a methodology developed since


the late eightees, aimed at discovering hidden structural information in data sets.
LAD was originally developed for analyzing binary data by using the theory of
partially de ned Boolean functions. An extension of LAD for the analysis of nu-
merical data sets is achived through the process of \binarization" consisting in the
replacement of each numerical variable by binary \indicator" variables, each show-
ing whether the value of the original variable is above or below a certain level.
Binarization was succesfully applied to the analysis of a variety of real life data sets.
This paper develops the theoretical foundations of the binarization process studying
the combinatorial optimization problems related to the minimization of the number
of binary variables. To provide an algorithmic framework for the practical solution
of such problems, we construct compact linear integer programming formulations
of them. We develop polynomial time algorithms for some of these minimization
problems, and prove NP-hardness of others.

Acknowledgements: The authors gratefully acknowledge the partial support by the Oce
of Naval Research (grants N00014-92-J1375 and N00014-92-J4083).
RRR 04-97 Page 1

1 Introduction
1.1 Logical Analysis of Data
The study of numerical data, common to a large variety of disciplines, leads to interesting
problems of combinatorial optimization, the solution of which can have an immediate im-
pact on the understanding of the nature of the phenomena under investigation, the factors
governing them, and on the ways we can in uence their development.
The data we are examining here consists of collections of \observations" represented
as points in IRd . These observations can be associated to patients in a medical study,
consumers in a marketing analysis, probes in a potential oil eld, etc. The components
of the vectors, called \attributes" stand for the results of various medical tests, nancial
parameters, geological measurements, etc. The observations fall in two classes: \positive"
ones (e.g. healthy patients, potenitial buyers, oil rich areas, etc.) and \negative" ones (e.g.
sick patients, etc.)
The goal of data analysis is to arrive to logical explanations distinguishing positive and
negative observations. It was shown in [10, 14] that { in the case of binary data { this goal
can be achived by the use of partialy de ned Boolean functions.
The usefulness of logical or Boolean techniques is an important ingredient of many ap-
proaches in this eld [1, 9, 11, 16, 21, 22]. The classi cation power of the Boolean-based
methodology of the logical analysis of data (LAD) and its applicability to oncology, psy-
chiatry, oil exploration, economic analysis, and other elds are described in [6, 15, 18].
Preliminary results reported in [6] already indicate that the LAD approach not only has
high classi cation power for distinguishing between positive and negative examples, but also
is quite successful in extracting underlying structural information from the given data, which
is useful in understanding why and how phenomena occur. While the techniques of LAD
are entirely di erent from those of the pioneering works of Mangasarian (e.g. [17]), the two
approaches provide strongly complementary and mutually reinforcing results.
The methodology of LAD is extended to the case of numerical data by a process called
binarization, consisting in the transformation of numerical (real valued) data to binary (0,1)
ones. In this transformation we map each observation u = (uA ; uB ; :::) of the given numerical
data set to a binary vector x(u) = (x1; x2; :::) 2 f0; 1gn by de ning e.g. x1 = 1 i uA  1,
x2 = 1 i uB  2, etc, and in such a way that if u and v represent, respectively, a positive
and negative observation point, then x(u) 6= x(v). The binary variables xi, i = 1; :::; n
associated to the real attributes are called indicator variables, and the real parameters i,
i = 1; :::; n used in the above process are called cut points.
Page 2 RRR 04-97

The binarization process described above has been included in a current implementation
of LAD and has been successfully used in the computational experiments reported in [6].
The object of this paper is to provide a theoretical foundation for the binarization process
with the primary focus on the combinatorial optimization problems related to the minimiza-
tion of the number of cut points. The paper studies the computational complexity of these
problems, develops polynomial time algorithms for some of them, and provides compact
integer programming formulations for them.

1.2 Examples
The meaning of a \logical" explanation of numerical data can be best illustrated by an
example. Let us consider Table 1, containing a set S + of positive observations and a set S ?
of negative observations having the attributes A; B; C . To be more speci c, we may think of
the phenomenon X as a headache and the attributes A; B and C as representing say, blood
pressure, body temperature, and the number of pulses, respectively (the numerical values
being normalized appropriately).

Attributes A B C Boolean variables xA xB xC


S : positive examples 3.5
+
3.8 2.8 T : true points 1 1 0
2.6 1.6 5.2 0 0 1
1.0 2.1 3.8 0 1 1
?
S : negative examples 3.5 1.6 3.8 F : false points 1 0 1
2.3 2.1 1.0 0 1 0
Table 1: A numerical data set (S ; S ?).
+ Table 2: A binarization of Table 1.

In order to extract a logical explanation from this data set, let us rst introduce the
following cut points
A = 3:0; B = 2:0; C = 3:0
for the attributes A; B; C , respectively, and then transform the numerical attributes to binary
indicator variables by de ning xA = 1 i uA  A , xB = 1 i uB  B and xC = 1 i
uC  C for all points u = (uA; uB ; uC ) 2 S + [ S ?. The result of this binarization of Table 1
is given in Table 2.
RRR 04-97 Page 3

It would be natural to view Table 2 as a truth table of a partially de ned Boolean


function (pdBf) over the Boolean variables xA; xB ; xC . In a pdBf like Table 2, the binary
vectors resulting from the numerical vectors in S + are considered to be true points, and those
resulting from S ? to be false points. The sets of true and false points in a pdBf are denoted
by T and F respectively. Table 2 represents a partially de ned function since some Boolean
vectors do not appear in the table. If all Boolean vectors do appear in a table, such a table
represents a mapping f0; 1gn ! f0; 1g, and is called a Boolean function.
Now consider a pdBf g = (T; F ) and a function f de ned over the same set of n variables.
If f is consistent with g, i.e., if all true points (resp. false points) of g are also true points
(resp. false points) of f , the function f is called an extension of the pdBf g. In general,
there are many extensions f of a given pdBf g. For example, an extension of the pdBf in
Table 2 is given by the following disjunctive normal form (DNF):
f = xA xB _ xAxB _ xB xC : (1)
We view this as a logical explanation of Table 1; it tells that the phenomenon X occurs if
the values of attributes A and B are both high, or both low, or the values of attributes B
and C are both high.

Table 3: Another binarization of Table 1


Boolean variables xB x0C x00C
T : true points 1 0 1
0 1 1
1 0 1
F : false points 0 0 1
1 0 0

Although exactly one cut point was introduced for each attribute in the example above,
there is no reason to prohibit the introduction of more than one, or of no, cut point for some
attributes. As another example, let us introduce the following cut points,
B = 2:0; 0C = 5:0; 00C = 1:5;
that is, one cut point for the attribute B , two cut points for C , and none for A. Using two
cuts points per one attribute may be natural; x0C = x00C = 1 would mean \high", x0C = 0 and
Page 4 RRR 04-97

x00C = 1 would mean \intermediate", and x0C = x00C = 0 would mean \low" (note that the
combination x0C = 1 and x00C = 0 is not feasible). The resulting pdBf is shown in Table 3 , 1

which has an extension given by the following DNF:


f = xB x00C _ x0C : (2)
The DNF (2) may be considered simpler than the DNF (1) since it has fewer terms and
literals. Furthermore, (2) possesses the important feature of being positive (i.e., monotone
increasing) in all its variables i.e. changing the value of one of the variables from 0 to 1
cannot decrease the value of the function from 1 to 0, whatever the values of the other
variables are. In some cases, this fact itself may be regarded as an important information
for the understanding of the mechanism causing the phenomenon.

1.3 Outline of the paper


The simple examples in Section 1.2 show that there are many possible ways to introduce
cut points, as well as to interpret (i.e., to derive extensions of) the resulting logical data.
It is frequently important to nd an extension which is as simple as possible under some
well-de ned criterion of simplicity, or to nd an extension having a special functional form
(e.g., positive), or both.
The problem of checking whether a pdBf (i.e., a binary data set partitioned into two
classes) has an extension in a speci ed class C has been studied in relation to the detection of
cause-e ect relationships [10, 14], learning and identi cation of a Boolean function [1, 4, 23],
structural analysis of binary data [5], and others. Beside studying extensions in the class of all
Boolean functions, we shall pay special attention to the important classes of monotone, Horn,
threshold, linear and quadratic functions (see Section 2.1 for their de nitions). The class of
\decomposable" Boolean functions and decomposable extensions of pdBfs were studied in [5].
The complexity of nding extensions of a pdBf in various classes of functions was considered
in [7], and extended in [8] to the more general cases in which some bits in the data vectors
may be missing. These problems were shown to be polynomially solvable for all of the above
six classes, as will be recalled in Section 2.1. Several classes of Boolean functions which may
be particularly relevant to LAD were studied in [12].
In this paper we show rst that for any numerical data set and for any of the six classes
of functions above it can be checked in polynomial time whether there exists a binarization
admitting an extension in the given class. We also consider the special case in which at
1 Note that the rst and the third vectors of T coincide.
RRR 04-97 Page 5

most one cut point per numerical attribute is used in the binarization, and show that this
important variant is NP-hard for all of the six classes considered above.
The main focus of this paper is on the problem of nding the minimum number of cut
points (and their locations) under the constraint that the resulting pdBf has an extension in
some given class C .
In order to provide an algorithmic framework for this minimization problem we formulate
it as a linear integer programming problem for all of the six classes above. For some of the
classes this formulation turns out to be a set covering problem. If the problem sizes are
not too large, these formulations can be used to solve the problems exactly, and even if the
sizes are large, they can serve as starting points for the development of e ective heuristic
algorithms. Some heuristic algorithms were already implemented in the report [6], and we
are currently elaborating on them to pursue further improvements.
We analyze the restricted case of the minimization problem in which the dimension of the
numerical data set is a xed constant, and show that this restricted problem is polynomially
solvable for the class of linear functions. We also prove that the minimization problem for
the class of positive Boolean functions can be solved in polynomial time when the numerical
data set is of dimension 2.
Finally, it is shown that the minimization problem becomes NP-hard for all other classes
considered in this paper, even if the dimension of the numerical data set is restricted.
It should be remarked that although we do not formulate explicitly, one can easily produce
the dual forms for all the results in this paper for the six classes above using the well known
Boolean duality between 0 and 1, x and x, and ^ and _. This is accomplished by swapping
the true and the false points of the pdBf's and using the dual forms of the classes above. For
example, the results about CLINEAR can be reformulated for the class of Boolean functions
represented by single elementary conjunctions; the results about CHORN can be reformulated
for the class of functions represented by Horn conjunctive normal forms, etc. Note that the
class CTH (as well as CALL, of course) is dual to itself.

2 De nitions and Preliminaries


2.1 Partially de ned Boolean functions and their extensions
For a Boolean function (or simply, function) f : f0; 1gn ! f0; 1g, the vector x 2 f0; 1gn
is called a true point (resp. false point) if f (x) = 1 (resp. f (x) = 0). T (f ) (resp. F (f ))
denotes the set of true points (resp. false points) of f ; clearly, T (f ) \ F (f ) = ; and
Page 6 RRR 04-97

T (f ) [ F (f ) = f0; 1gn . For two functions f and h, we denote f  h if f (x)  h(x) for all
x 2 f0; 1gn . A function is positive (or monotone) if x  y (i.e., xj  yj for j = 1; 2; : : : ; n)
implies f (x)  f (y). A function f is called threshold if there exist n + 1 real numbers
wj ; j = 0; 1; 2; : : : ; n; such that
8
< 1 if Pnj wj xj  w
f (x) = : (3)
0 if Pnj wj xj < w :
=1 0

=1 0

The Boolean variables x1; x2; : : : ; xn and their complements x1; x2; : : : ; xn are called literals.
A conjunction of literals (or simply, a term) t = Vj2Pt xj Vj2Nt xj , with Pt \ Nt = ;, is called
an implicant of f if t  f (where t is understood as the function it represents). An implicant
t is called prime if there is no implicant t0 = Vj2Pt0 xj Vj2Nt0 xj such that Pt0  Pt ; Nt0  Nt
and Pt0 [ Nt0 6= Pt [ Nt . A disjunction Wmi=1 ti of terms ti = Vj2Pti xj Vj2Nti xj , i = 1; 2; : : : ; m,
is called a disjunctive normal form (DNF), and it is well-known that every Boolean function
can be represented by a DNF. A DNF is a k-DNF if jPti [ Nti j  k for all i. In particular,
we call a 2-DNF quadratic and a 1-DNF linear. A DNF Wmi=1 ti is Horn if every ti satis es
jNti j  1. A function f is called Horn if it has a Horn DNF. It is known (e.g., [11]) that f
is Horn if and only if F (f ) is closed under intersection operations (i.e., x; y 2 F (f ) implies
x ^ y 2 F (f ), where (x ^ y)j = xj ^ yj ; j = 1; 2; : : : ; n).
We shall consider in this paper the following special classes of Boolean functions: the
class CALL of all Boolean functions, the class CP of positive functions, the class CHORN of
Horn functions, the class CTH of threshold functions, the class CLINEAR of linear (i.e., 1-DNF)
functions, and the class CQUAD of quadratic (i.e., 2-DNF) functions.
A partially de ned Boolean function (pdBf, for short) is de ned by a pair (T; F ), where
T; F  f0; 1gn . A function f is an extension of pdBf (T; F ) if T  T (f ) and F  F (f ) hold.
The following six lemmas, proved in [7, 10], give the basis for our discussion.

Lemma 2.1 A pdBf (T; F ) has an extension in CALL if and only if T \ F = ;. 2

Lemma 2.2 A pdBf (T; F ) has an extension in CP if and only if there is no a 2 T and
b 2 F such that a  b. 2

Lemma V2.3 A pdBf (T; F ) has an extension in CHORN if and only if there is no a 2 T such
that a = b2F : : ba b.
st 2
RRR 04-97 Page 7

Lemma 2.4 A pdBf (T; F ) has an extension in CTH if and only if the system of inequalities
X
n X
n
aj wj  w ; for all a 2 T
0 and bj wj  w ? 1; for all b 2 F;
0
j =1 j =1

has a solution. 2

Lemma 2.5 De ne I and I for a given pdBf (T; F ) by


0 1

I = fj 2 [n] j all b 2 F satisfy bj = 1g and I = fj 2 [n] j all b 2 F satisfy bj = 0g; (4)


0 1

where [n] = f1; 2; : : : ; ng. Then (T; F ) has an extension in CLINEAR if and only if f =
W x W x satis es T (f )  T . 2
j 2I1 j j 2I0 i

Lemma 2.6 Let I and I as in (4), and de ne I ; I and I for a given pdBf (T; F ) by
0 1 00 01 11

I 00 = f(i; j ) 2 [n]  [n] j every b 2 F satis es either bi = 1 or bj = 1g


I 01 = f(i; j ) 2 [n]  [n] j every b 2 F satis es either bi = 1 or bj = 0g (5)
I 11 = f(i; j ) 2 [n]  [n] j every b 2 F satis es either bi = 0 or bj = 0g;
where [n] = f1; 2; : : : ; ng. Then the pdBf (T; F ) has an extension in CQUAD if and only if
_ _ _ _ _
f= xj xi xixj xixj xixj
j 2I1 j 2I0 i;j )2I11
( i;j )2I01
( i;j )2I00
(

satis es T (f )  T . 2

It is easy to see that the conditions in these lemmas can be checked in time polynomial
in the input length of (T; F ), i.e., n(jT j + jF j). For example, the condition of Lemma 2.4
can be tested by linear programming in polynomial time.

2.2 Cut points, master pdBf and extensions


Let us now return to the numerical data as shown in Table 1. We assume that each observa-
tion is a d-dimensional real vector (i.e., there are d attributes), and denote the set of positive
observations by S + and the set of negative observations by S ?, respectively.
Page 8 RRR 04-97
(K )
For each attribute A, let u(0)A > uA >    > uA be all the distinct values of the A-
(1)

component of u 2 S + [ S ? arranged in decreasing order. We shall associate to A several cut


points Ai and the corresponding binary indicator variables xAi de ned by
8
<
xAi = : 1 if A  Ai;
0 if A < Ai:
The cut points will be chosen in a way which allows to distinguish between positive and
negative observations, if possible. Clearly, it is meaningless to introduce a cut point above
(K ) (i) (i+1)
u(0)
A or below uA , or a cut point between any values uA and uA realized by vectors
belonging only to S or only to S , or more than one cut point between u(Ai) and u(Ai+1).
+ ?
Therefore, the largest set of cut points for the attribute A is obtained when one cut point
Ai = (u(Ai) + u(Ai+1))=2 is introduced between each pair of u(Ai) and u(Ai+1) for which there
are observations u0 2 S + and u00 2 S ? (or u0 2 S ? and u00 2 S +) such that u0A = u(Ai)
and u00A = u(Ai+1). If for an attribute A we introduce K 0 cut points, then the value uA is
transformed to a K 0-dimensional Boolean vectors. When we introduce the largest set of cut
points for each attribute, the resulting pdBf (T ; F ) is called the master pdBf of (S +; S ?).
Table 4 is the master pdBf of the numerical data in Table 1, where
A1 = 3:0; A2 = 2:4; A3 = 1:5
B1 = 3:0; B2 = 2:0
C1 = 4:0; C2 = 3:0; C3 = 2:0:
Remark that the binarizations of Table 1 introduced in Tables 2 and 3 correspond to subsets
of columns of the master pdBf of Table 4.
Table 4: The master pdBf of the set of numerical data in Table 1.
Attributes A B C
Variables xA1 xA2 xA3 xB xB
1 2 xC 1 xC 2 xC 3

T 1 1 1 1 1 0 0 1
0 1 1 0 0 1 1 1
0 0 0 0 1 0 1 1
F 1 1 1 0 0 0 1 1
0 0 1 0 1 0 0 0

The rst problem considered in this paper is to decide whether there exists a set of cut
points such that the resulting pdBf has an extension in a speci ed class C .
RRR 04-97 Page 9

Lemma 2.7 Let C be one of the six classes of functions listed in Section 1. If there is a set
of cut points such that the resulting pdBf has an extension in C , then the master pdBf also
has an extension in C .
Proof. Immediate from the de nitions of these classes, since adding variables (corresponding
to new cut points) to binary vectors does not prevent (T; F ) from having an extension in C .
2
Lemma 2.7 shows that the master pdBf is the richest representation needed for nding
an extension in the classes considered.
For some classes of functions, the condition for the existence of an extension for the
master pdBf can be stated directly in terms of the original numerical data set.
Lemma 2.8 Let us denote the master pdBf of a numerical data set (S ; S ?) by (T ; F ).
+

Then,
(i) (T ; F ) has an extension in CALL if and only if S + \ S ? = ;.
(ii) (T ; F ) has an extension in CP if and only if there is no pair u 2 S + and v 2 S ? such
that u  v.
Proof. Immediate from Lemmas 2.1 and 2.2, and the de nition of the master pdBf. 2
Let us consider the class CTH , and let us recall that two subsets S + and S ? of the
Euclidean space IRd are said to be linearly separable if there exist d + 1 real numbers Wj ,
j = 0; 1; 2; : : : ; d; such that
X
d X
d
Wj uj  W ; for all u 2 S
0
+
and Wj vj < W ; for all v 2 S ?:
0 (6)
j =1 j =1
Linear separability of two sets S + and S ? is well-known to be equivalent to the disjointness
of their convex hulls, i.e. to conv S + \ conv S ? = ;. Let us remark that for nite sets the
integrality of the coecients W0 and Wj , j = 1; :::; d can also be assumed.
One can easily see that there are examples of (S +; S ?), which are not linearly separable,
but for which a corresponding pdBf (T; F ) does have a threshold extension. If for example
S = fu = (1:0; 1:0)g and S ? = fv = (2:0; 0:0); v = (0:0; 2:0)g;
+ (1) (1) (2)

then
u = 21 v + 21 v ;
(1) (1) (2)
Page 10 RRR 04-97

showing that S + and S ? are not linearly separable. However, if we introduce the cut points
A = 0:5 and B = 0:5, then the resulting pdBf (T; F ),
T = fa = (1; 1)g and F = fb = (1; 0); b = (0; 1)g;
(1) (1) (2)

has a threshold extension with coecients w1 = w2 = 1:0 and w0 = 2:0.


Let us show next that whenever the sets S + and S ? are linearly separable, i.e. there exist
reals Wj , j = 0; 1; :::; d satisfying (6), the master pdBf (T ; F ) does always have a threshold
A > uA >    uA denote the distinct values of attribute A
(KA )
extension. Let us recall that u(0) (1)

(for A = 1; :::; d) occurring in S + [ S ?, and that the binary values of the indicator variables
8
< i
xAi = : 1 if uA > uA
( )

0 otherwise
for i = 1; :::; KA correspond to u 2 S + [ S ? in the master pdBf (T ; F ). Hence we have
Pd W u = Pd W u KA + PKA x (u i? ? u i )
( ) ( 1) ( )
A A A A A A i Ai A A
=1
P d
=1
A ( P d )P
= A WA uA + A i xAiWA (uAi? ? uAi ):
K K A
=1
( 1) ( )
=1 =1 =1

Thus the coecients w = W ? PdA WAuAKA , and wAi = WA (uAi? ? uAi ) for A = 1; :::; d
0 0 =1
( ) ( 1) ( )

and i = 1; :::; KA de ne a threshold extension of the master pdBf (T ; F ).

Theorem 2.1 A numerical data set (S ; S ? ) is linearly separable only if its master pdBf
+

(T ; F ) has a threshold extension.


2
Given a numerical data set (S +; S ?), we can rst construct its master pdBf (T ; F ),
and then check whether it has an extension in the class C . This can be done by testing the
conditions in the corresponding lemma that describes when an extension in this class exists.
The whole task can be carried out in time polynomial in the input length d(jS + j + jS ? j),
since the construction of the master pdBf (T ; F ) can be done in d(jS + j + jS ?j)2 time,
and checking the conditions of Lemmas 2.1{2.6 can also be done in time polynomial in
n(jT j + jF j) ( d(jS + j + jS ?j)2). This establishes the following theorem.

Theorem 2.2 It can be checked in polynomial time whether the master pdBf (T ; F ) of a
numerical data set (S +; S ? ) has an extension in any of the six classes of functions listed in
Section 1. 2
RRR 04-97 Page 11

3 Single Cut Point for Each Attribute


In some cases, it is important to restrict the number of cut points introduced for each
attribute. As a special case, we consider the following problem.
SINGLE-CP(C )
Input: A numerical data set (S + ; S ? ).
Question: Is there a set of cut points, with at most one cut point per attribute, such that
the resulting pdBf (T; F ) has an extension in C ?
In this section we show that the problem SINGLE-CP(C ) is NP-complete for all six classes
introduced in Section 1.

Theorem 3.1 The problem SINGLE-CP(C ) is NP-complete for each of the classes CALL ,
CHORN , CTH , CLINEAR, and CQUAD.
Proof. It is clear that SINGLE-CP(C ) is in NP for all the classes listed in the theorem
statement. In order to prove that SINGLE-CP(C ) is NP-hard, we shall reduce to it the
3SAT problem, a well-known NP-complete problem.
3SAT
Input: A set of clauses Ci , i = 1; 2; : : : ; m, each containing at most three literals from
fX1, X1, : : :, Xn , Xn g.
Question: Is there an assignment of values to the Boolean variables X1 ; X2 ; : : : ; Xn such
that all the clauses are satis ed?
We shall associate to a given set of clauses Ci; i = 1; 2; : : : ; m, a numerical data set
(S ; S ? ), involving n attributes and m + 1 input vectors, such that SINGLE-CP(C ) has a
+

solution for this (S + ; S ?) if and only if the answer of 3SAT is YES. The set S ? will consist
of a single negative point (1; 1; : : : ; 1), while the set S + will include a positive point u(i) for
every clause Ci = Wj2Pi Xj Wj2Ni X j , de ned by
8
i < 2 if j 2 Pi
>
uj = > 0 if j 2 Ni
( )

: 1 if j 26 Pi [ Ni:

For example, for the clauses C1 = (X2 _ X 3 _ X 5), C2 = (X 1 _ X4), C3 = (X1 _ X 2 _ X5),
and C4 = (X3 _ X 4 _ X 5) the corresponding numerical data set (S + ; S ?) will be
Page 12 RRR 04-97

S : positive examples 1
+
2 0 1 0
0 1 1 2 1
2 0 1 1 2
1 1 2 0 0
S ? : negative examples 1 1 1 1 1

To solve SINGLE-CP(CALL ), we shall introduce at most one cut point per attribute in
such a way that the resulting pdBf (T; F ) has an extension in CALL (i.e. T \ F = ;). Clearly,
for each attribute j , the cut point j must be either 0.5 or 1.5. Let us associate to the cut
points the following binary assignment of the Boolean variables Xj :
8
<
Xj = : 1 if j = 1:5 (7)
0 if j = 0:5
With this de nition, there is an assignment that satis es all clauses Ci if and only if the
associated binarization produces a consistent pdBf (T; F ) with T = fx(u(i)) j u(i) 2 S +g
and F = fx(v) j v = (1; 1; : : : ; 1)g. Indeed, point u(i) 2 S + can be separated from point
v 2 S ? either by setting j = 1:5 for some j 2 Pi, or by setting j = 0:5 for some j 2 Ni,
i.e. exactly when the associated binary assignment satis es clause Ci.
Note that if the pdBf (T; F ) is consistent, then it has an extension represented by the
DNF Wnj=1 Xj j , where Xj j = Xj if j = 1:5 and Xj j = Xj if j = 0:5. Clearly, this function
is linear, and hence it is also quadratic, Horn, and threshold. Therefore, (S + ; S ?) has an
extension in CALL if and only if it has an extension in all of the classes CHORN , CTH , CLINEAR,
and CQUAD . 2

Theorem 3.2 The problem SINGLE-CP(CP ) is NP-complete.


Proof. It is clear that SINGLE-CP(CP ) is in NP. We shall prove the NP-hardness by a
reduction from 3SAT, as de ned in the proof of Theorem 3.1.
To accomplish this, we shall associate to a given set of clauses Ci i = 1; :::; m a numerical
data set (S + ; S ?), involving n real attributes and 2m input vectors. Let us choose subsets
Ai  [n] of size jAij = b n2 c and Bi = [n] n Ai, for i = 1; :::; m, such that jAi \ Bk j  7 if i 6= k,
and jA
 bin=\7Bc k j = 0 if i = k. It is easy to see that such subsets exist if n is large enough, since
m  bn=14c for large n. Furthermore, let us denote by Ii the set of indices of the variables
in the clause Ci, i = 1; :::; m, and de ne S + = fa(i)ji = 1; :::; mg and S ? = fb(i)ji = 1; :::; mg
RRR 04-97 Page 13

by
8 8
< +1 if Xj 2 Ci or j 2 Ai n Ii;
> < +1 if j 2 Ai n Ii;
>
aj = > 0 if Xj 2 Ci;
i
( ) i
and bj = > 0 if Xj 2 Ci;
( )

: ?1 if j 2 Bi n Ii; : ?1 if Xj 2 Ci or j 2 Bi n Ii:


We can observe that, since each attribute can have only three distinct values, 1; 0; ?1,
there are only two possible cut points j = 0:5 and j = ?0:5 for each j . Therefore, let us
associate an assignment to the Boolean variables Xj by
8
<
Xj = : 1 if j = 0:5; (8)
0 if j = ?0:5:
We claim that the cut points j , j = 1; :::; n, de ne a pdBf which has a positive extension
if and only if the associated assignment satis es all clauses Ci i = 1; :::; m. Let us remark
rst that for i 6= k, a(i) and b(k) are always positively separated in j 2 (Ai \ Bk ) n (Ii [ Ik),
(i.e., a(ji) > j > b(jk)), regardless of the value of j . Since jIi [ Ik j  6 and jAi \ Bk j  7 for
i 6= k, such an index j always exists. We can also notice that the vectors a(i) and b(i) have
the same components in all j 62 Ii. Hence, to separate them, there must be a j 2 Ii such
that Xj 2 Ci and j = 0:5, or Xj 2 Ci and j = ?0:5. In other words, a(i) and b(i) will be
separated positively if and only if the corresponding assignment (8) satis es the clause Ci.
This proves the claim, and concludes the proof. 2

4 Minimization of the Number of Cut Points


Given a numerical data set (S +; S ?) we have shown above that the corresponding master
pdBf (T ; F ) can be used to check in polynomial time the existence of an extension in a
given class C . If such extensions exist, the next fundamental problem is to nd a \most
compact" extension. In this paper, the number of cut points will be used as a measure of
compactness. Therefore, we shall study the problem
MIN-CP(C )
Input: A numerical data set (S + ; S ? ) such that its master pdBf has an extension in C .
Output: A minimum cardinality set of cut points such that the corresponding pdBf (T; F )
has an extension in C .
It will be seen in Section 4.1 that the problem MIN-CP(C ) is NP-hard for all the six
classes. Motivated by this, we shall also consider in Section 5 and in the Appendix the special
Page 14 RRR 04-97

cases in which the dimension d of S + and S ? is xed to 2, 3, and an arbitrary constant k


(positive integer). These problems will be denoted as MIN-2CP(C ), MIN-3CP(C ), and MIN-
kCP(C ) respectively. It will be seen that even in these special cases only the problems
MIN-2CP(CP ) and MIN-kCP(CLINEAR) are polynomially solvable.

4.1 NP-hardness results


In this subsection we show that the problem MIN-CP(C ) is NP-hard for all the six classes
considered in Section 1. To prove this, we shall use a reduction from the minimum set
covering problem, MIN-COVER, which is known to be NP-hard [13], and which is de ned
as follows.
MIN-COVER
Input: An m  d matrix A = faij g, where aij = 0 or 1 for all i 2 [m] and j 2 [d],
satisfying Pdj=1 aij  1 for all i 2 [m].
Output: A set of columns J of A such that Pj2J aij  1 for all i 2 [m], and the size jJ j
is minimal.
Theorem 4.1 The problem MIN-CP(C ) is NP-hard for all of the six classes de ned in Sec-
tion 2 .
Proof. Given a MIN-COVER problem, let us de ne a numerical data set (S ; S ?) by +

S = f(a ; a ; : : :; a d); : : :; (am ; am ; : : :; amd)g and S ? = f(0; 0; : : : ; 0)g


+
11 12 1 1 2

Although the data are already binary, it is necessary to introduce a cut point j = 0:5 if the
attribute j is used as a Boolean variable xj in the pdBf (T; F ) obtained after binarization.
Let J be the set of indices j for which the cut points j = 0:5 are introduced. Then
T = f(a j ; j 2 J ); (a j ; j 2 J ); : : :; (amj ; j 2 J )g and F = f(0; 0; : : : ; 0)g;
1 2

where each vector has dimension jJ j. By Lemma 2.1, this pdBf (T; F ) has an extension in
CALL if and only if Pj2J aij  1 holds for all i 2 [m]. Since the number of cut points is
jJ j, (T; F ) is a desired pdBf if and only if J is a solution to MIN-COVER. This proves that
MIN-COVER is reduced to MIN-CP(CALL), and hence MIN-CP(CALL ) is NP-hard.
Furthermore, notice that, if J allows extensions in CALL, one of them can be represented
as f = _j2J xj , which is positive, Horn, threshold and linear (hence quadratic). This
means that MIN-CP(C ) is NP-hard for all the six classes CALL ; CP ; CHORN ; CTH ; CLINEAR,
and CQUAD . 2
RRR 04-97 Page 15

As a remark, we note that the above construction uses at most one cut point j = 0:5
for each attribute j of (S + ; S ?). Therefore, MIN-CP(C ) remains NP-hard for all the six
classes even after imposing the additional constraint of using at most one cut point for each
attribute.

4.2 Integer programming formulations


In this subsection, we formulate MIN-CP(C ) as integer programming problems for all the
classes C under consideration. In the rst part we formulate MIN-CP(C ) problems for the
classes CALL , CP , and CLINEAR as MIN-COVER problems (see also [10]). In the second part
we give general integer programming formulations of MIN-CP(C ) for CHORN , CQUAD and
CTH .
The integer programming formulations given in this subsection provide a good starting
point for developing practical algorithms for the MIN-CP(C ) problems. Although MIN-
COVER is known to be NP-hard, various exact and heuristic algorithms have beed developed
for this problem, since it is encountered frequently in many operations research applications
[2]. If problem sizes are not too large, there is a good chance of nding an exact solution.
Heuristic algorithms can be used to solve large size problems, or to solve moderate size
problems faster.
Let us rst consider CALL. Given a numerical data set (S +; S ?), we rst construct the
master pdBf (T ; F ) as in Section 2.2. Since each indicator variable in the master pdBf
corresponds to a cut point, solving MIN-CP(CALL) is equivalent to nding a minimum subset
of the indicator variables which can still separate the true vectors in T  from the false vectors
in F . To model this as MIN-COVER, let yj be a decision variable, j = 1; :::; n, describing
whether the indicator variable xj of the master pdBf (and hence the cut point associated
with it) is chosen (yj = 1) or not (yj = 0) in a minimum subset. For every pair of u 2 S +
and v 2 S ? let x(u) 2 T  and x(v) 2 F  respectively denote the n-dimensional Boolean
vectors obtained from u and v. For every j = 1; :::; n let us de ne
8
<
aj(u;v) = : 1 if xj (u) 6= xj (v) (9)
0 otherwise
Consider now the following MIN-COVER problem.
X
n
minimize yj
j =1
X n
subject to aju;v yj  1; for all pairs (u; v) such that u 2 S and v 2 S ? (10)
( ) +

j =1
Page 16 RRR 04-97

yj = 0; 1; j = 1; 2; : : : ; n:
Here each constraint P aju;v yj  1 guarantees that u 2 S and v 2 S ? are transformed
( ) +

into distinct Boolean vectors using only the chosen variables xj . If a solution y is feasible
(i.e., satis es all the constraints), then the resulting pdBf (T; F ) satis es the condition of
Lemma 2.1, and if y is optimal, it gives the minimum number of cut points necessary to
satisfy Lemma 2.1. Let n be the number of cut points introduced to de ne the master pdBf.
This MIN-COVER problem has n variables, and jS j  jS ?j constraints.
+

We extend now the same idea to the case of CP . This can be accomplished by simply
changing the de nition of aju;v as follows.
( )

8
<
aju;v = : 1 if xj (u) = 1 and xj (v) = 0
( )

0 otherwise.
The constraint P aju;v yj  1 now guarantees that, for each pair of u 2 S and v 2 S ?, at
( ) +

least one j having the property xj (u) > xj (v) is selected. Therefore, the resulting (T; F )
constructed from (T ; F ), by using the coordinates j for which yj = 1 satis es the condition
in Lemma 2.2. This MIN-COVER formulation for CP has the same number of variables and
constraints as the above formulation for CALL .
As the third case of MIN-COVER formulation, we consider CLINEAR. In this case, Lemma
2.5 states that a cut point Ai for the attribute A of the given data set (S ; S ?) is useful only
+

if the resulting indicator variable xAi belongs to one of the sets I or I de ned in Lemma
0 1

2.5. This is possible if and only if either (i) vA  Ai for all v 2 S ?, or (ii) vA < Ai for all
v 2 S?.
Therefore, without loss of generality we can restrict our attention to the following two
cut points for each attribute A:
A = minfvA j v 2 S ?g ?  =2 and A = maxfvA j v 2 S ? g +  =2;
0 0 1 1 (11)
here 0 > 0 is the gap between minfvA j v 2 S ?g and the largest uA ; u 2 S + , which is
smaller than minfvA j v 2 S ?g (if there is no such uA , the cut point A0 is not needed),
and 1 > 0 is de ned in a similar way. The situation is illustrated in Figure 1 for the
two dimensional case, where xAi and xBj are the indicator variables associated with the cut
points Ai and Bj , respectively.
The problem MIN-COVER can now be constructed for CLINEAR in the same manner
as the formulation for CALL . The only di erence from the case CALL is that here each yj
corresponds to A0 or to A1 for some attribute A. A solution y of MIN-COVER gives the
RRR 04-97 Page 17

B
x A0 xA1 ∈ S+

∈S-
x B1
α B1

A
αA0 α A1

Figure 1: Possible cut points for CLINEAR.

1-DNF 0 1 0 1
_ _
f =@ xA A _ @ xA A ;
1 0
yA1 =1 yA0 =1
which is an extension of (T; F ). This MIN-COVER problem has only 2d variables, where d
is the dimension of vectors in (S +; S ? ). The number of constraints remains the same as in
the above two cases.
We shall present now integer programming formulations of MIN-CP(C ) for CQUAD , CHORN
and CTH . The problem MIN-CP(CQUAD) can be stated as an integer program, which is
resembling MIN-COVER. For each linear or quadratic term tk , which corresponds to an
element in I0; I1; I00; I01; I11 of Lemma 2.6,
 let us introduce a decision variable zk 2 f0; 1g
(k = 1; 2; : : : ; K ). Clearly, K  n + 3 2 , unless F  = ;. Let us denote the degree of tk by
n
dk (here, dk is either 1 or 2), and de ne for each u 2 S +
8 8
< <
aku
( )
= : 1 if tk (x(u)) = 1 and bjk
( )
= : 1 if either xj or xj is involved in tk (12)
0 otherwise, 0 otherwise.

Note that by de nition Pnj=1 b(jk) = dk for all k = 1; 2; : : : ; K .


Similarly to the previously discussed cases, we introduce a decision variable yj for each
indicator variable xj in the master pdBf (and hence to the cut point associated with it)
Page 18 RRR 04-97

to describe whether xj is selected (yj = 1) or not (yj = 0). Then an integer program for
MIN-CP(CQUAD) is the following.
X
n
minimize yj
j =1
X K
subject to aku zk  1; for all u 2 S
( ) +

k=1
Xn
bjk yj  dk zk ; for all k = 1; : : : K
( )
(13)
j =1
yj = 0; 1; j = 1; 2; : : : ; n
zk = 0; 1; k = 1; 2; : : : ; K:
The constraint PKk=1 a(ku)zk  1 guarantees that at least one quadratic (or linear) term tk
with tk (x(u)) = 1 is chosen for every u 2 S + . The constraint Pnj=1 b(jk)yj  dk zk becomes
redundant if zk = 0, but if zk = 1 it enforces yj = 1 whenever b(jk) = 1 (as a result of
Pn b(k) = d ). This guarantees that all the indicator variables x in each chosen term t
j =1 j k j k
(i.e., with zk = 1) are selected. Therefore, any feasible solution of (13) de nes a pdBf that
satis es the conditions of Lemma 2.6, and the optimum value of the objective function gives
the minimum number nof cut points necessary to satisfy Lemma
  2.6. The integer program (13)
has at most 2n + 3 2 variables and at most jS + j + n + 3 n2 constraints.
It is straightforward to see how the above integer program can be modi ed for nding
the minimum number of cut points such that the resulting pdBf has an extension in the class
of d-DNFs, for any xed d.
The integer program (13) can be modi ed for solving MIN-CP(CHORN ), based on the
following observation. For u 2 S + , let Ju = fj j xj (u) = 1; j = 1; 2; : : : ; ng. Then a Horn
extension exists if and only if, for every u 2 S + , at least one of the Horn terms Vj2Ju xj and
xi Vj2Ju xj for i 2 f1; 2; : : : ; ngn Ju takes the value 0 on all points x(v), v 2 S ?. Therefore, a
Horn DNF (if any) can be constructed by taking the disjunction of a number of such terms.
It should be noted that some literals may be removed from the selected terms in order
to decrease the number of cut points, as long as the resulting DNF still represents a Horn
extension of (S +; S ? ). To achieve this goal, let tk ; k = 1; 2; : : : ; K , denote all the terms of
the above type. Clearly, K  njS + j. Let us introduce decision variables zk (k = 1; 2; : : : ; K )
for such terms, and de ne
8 8
< >
< 1 if xj (v) = 0 and tk contains literal xj ;
(u)
ak = : 1 if t k (x(u)) = 1 (v;k )
and bj = > or xj (v) = 1 and tk contains xj
0 otherwise, : 0 otherwise.
RRR 04-97 Page 19

Let us further introduce a decision variable yj to describe whether the Boolean variable xj
in the master pdBf (and hence the cut point associated with it) is chosen (yj = 1) or not
(yj = 0). The resulting integer program for the solution of MIN-CP(CHORN ) becomes
X
n
minimize yj
j =1
X K
subject to aku zk  1; for all u 2 S
( ) +

k=1
Xn
bjv;k yj  zk ; for all k = 1; 2; : : : ; K and for all v 2 S ?
( )
(14)
j =1
yj = 0; 1; j = 1; 2; : : : ; n
zk = 0; 1; k = 1; 2; : : : ; K:
The constraint PKk=1 a(ku)zk  1 guarantees that there is at least one Horn term tk for every
u 2 S +, which satis es tk (x(u)) = 1. The constraint Pnj=1 bj(v;k)yj  zk guarantees that,
for every selected term tk and v 2 S ?, at least one indicator variable xj involved in this
term will also be selected guaranteeing that this term take the value 0 on v. Therefore, any
feasible solution of (14) de nes a pdBf which satis es the conditions of Lemma 2.3, and if y
is optimal, the optimum value gives the minimum number of cut points necessary to satisfy
Lemma 2.3. The integer linear program (14) has at most n + njS + j variables and at most
jS +j + njS + j  jS ? j constraints.
The integer program for the solution of MIN-CP(CTH ) is based on Lemma 2.4 that
requires the following linear program to have a solution:
Pn a w  w ; for all a 2 T 
j j j =1 0

Pn b w  w ? 1; for all b 2 F : (15)


j j j =1 0

For the development that follows, we need to establish an upper bound M on the absolute
values of the w's. Since the number of variables in (15) is n + 1 and since all the coecients
are +1, ?1, or 0, each component of a solution (if any) can be represented as the ratio of
two determinants of dimension n + 1 with elements +1, ?1, or 0 (see, e.g., [3]). Based on
this, we can use the Hadamard inequality to show that M satis es
M  n n2 : (16)
Introducing now a decision variable yj to describe whether the Boolean variable xj in the
master pdBf (and hence the cut point associated with it) is chosen (yj = 1) or not (yj = 0),
Page 20 RRR 04-97

a mixed integer program for MIN-CP(CTH ) can be formulated as follows:


X
n
minimize yj
j =1
X n
subject to aj wj  w ; for all a 2 T 
0
j =1
X n
bj wj  w ? 1; for all b 2 F 
0 (17)
j =1
wj  ?Myj ; j = 1; 2; : : : ; n
wj  Myj ; j = 1; 2; : : : ; n
yj = 0; 1; j = 1; 2; : : : ; n;
where M satis es (16).
If yj = 0, the constraints wj  ?Myj and wj  Myj force wj = 0, while if yj = 1,
these constraints have no in uence on the existence of a solution. The rst two groups
of constraints in (17) are the separation conditions of Lemma 2.4. Therefore, any feasible
solution of (17) de nes a pdBf that satis es the conditions of Lemma 2.4, and if y is optimal,
it gives the minimum number of cut points necessary to satisfy Lemma 2.4. The mixed
integer linear program (17) has n binary variables, n + 1 real variables, and 2n + jS + j + jS ?j
constraints.
Finally note that all the integer programs formulated in this subsection can be easily
modi ed for the case of a single cut point for each attribute. Indeed, if the set JA denotes
all the indices of cut points for the attribute A, then it is sucient to add to each problem
the following d constraints:
X
yj  1; for all attributes A:
j 2JA

5 Low dimensional data sets


5.1 Polynomially solvable cases
Although MIN-CP(C ) is NP-hard for all the classes under consideration (as shown in The-
orem 4.1), we point out in this subsection that in certain cases it can be solved in poly-
nomial time if the dimension d of the given data set is restricted. Two such cases are
MIN-kCP(CLINEAR) and MIN-2CP(CP ).
We shall rst consider the problem MIN-kCP(CLINEAR). As described in Section 4.1, this
problem can be formulated as MIN-COVER having 2d variables and jS + jjS ?j constraints.
RRR 04-97 Page 21

Since d is limited by a constant k, in order to solve this problem, we generate all the possible
vectors y 2 f0; 1g2k to nd a feasible vector that minimizes P yj . Since the number of
possible Boolean vectors is bounded by a constant 22k , the whole process can be performed
in polynomial time, thus proving the next theorem.
Theorem 5.1 If the dimension d of the given data set (S ; S ?) is limited by a constant k,
+

then problem MIN-kCP(CLINEAR) can be solved in polynomial time. 2

Before studying the problem MIN-2CP(CP ), let us consider a data set (S + ; S ?) and its
extension in the two-dimensional space (i.e., d = 2), where the two attributes are denoted A
and B . The positive and negative observations are represented in the 2-dimensional plane
as shown in Figure 2(a). We draw vertical lines corresponding to the cut points introduced
for A, and horizontal lines for B , thus creating a number of rectangular regions. A minimal
rectangular region (not containing other rectangular regions) is called a cell, and each cell
corresponds to a unique Boolean vector resulting from the binarization using the cut points
(see Fig. 2(b)). The resulting pdBf (T; F ) is obtained by de ning T as the set of all the
Boolean vectors corresponding to the cells containing positive observations, and de ning F as
the set of all the Boolean vectors corresponding to the cells containing negative observations.
Recall that the master pdBf of a numerical data set (S + ; S ?) has a positive extension
if and only if no pair of u 2 S + and v 2 S ? satis es u  v (by Lemma 2.8). We say
that (S +; S ? ) is positively oriented if this condition is satis ed. For example, the data set
(S +; S ? ) in Figure 3 is positively oriented.
Let us call a set of cut points (or the associated pdBf (T; F )) consistent if no resulting
cell contains both positive and negative observations. Lemma 2.1 states that there is an
extension in CALL if and only if the given set of cut points is consistent. The same statement
also holds for the class CP if the data set (S + ; S ?) is positively oriented. So the problem
here is how to introduce a minimum number of vertical and horizontal lines in order to be
consistent.
We shall call a pair of positive and negative points in the 2-dimensional plane critical if
the minimum rectangle containing them contains no other point of S + [ S ?. The broken line
segments in Figure 2 indicate all critical pairs in this data set. We say that a line (vertical or
horizontal) separates a critical pair if the two points of the pair are located in the opposite
sides of the line. The following lemma is a restatement of the above argument.
Lemma 5.1 Given a 2-dimensional positively oriented data set (S ; S ?), let (T; F ) denote
+

the pdBf resulting from a set of cut points. Then the following three conditions are equivalent.
Page 22 RRR 04-97

: positive example

α : negative example
B2
: critical pair

α
B1

α α α A
A1 A2 A3

(a) Positive and negative examples, and critical pairs

(00011) (10011) (11011) (11111)

α B2

(00010) (10010) (11010) (11110)

αB1
(00000) (10000) (11000) (11100)

α A1 α A2 α A3 A

(b) Boolean vectors ( xA1 , xA2 , x , x , x )


A3 B1 B2

Figure 2: A 2-dimensional data set.


RRR 04-97 Page 23

B
(1) : positive point
u
: negative point
: extreme positive point
v (1) (2) : extreme negative point
u

v (2) u (3)

u (4) = u (5)
v (3) = v (4)

v (5) = v (6)
A

Figure 3: An example of positively oriented data set (S +; S ?).

(i) (T; F ) has an extension in CP .


(ii) The lines of the cut points separate all critical pairs.
(iii) The set of cut points is consistent.
2

We shall describe now a polynomial time procedure to solve MIN-2CP(CP ).


Let u0; u00 2 S + and v0; v00 2 S ? . The two pairs (u0; v0) and (u00; v00) will be called
independent if there is no cut point that separates both these pairs (i.e., none of the vertical
or horizontal lines separates both (u0; v0) and (u00; v00)). In other words, the open interval
(vA0 ; u0A) does not intersect the interval (vA00 ; u00A), and the interval (vB0 ; u0B ) does not intersect
the interval (vB00 ; u00B ). For example, the positively oriented (S +; S ?) in Figure 4 contains ve
pairwise independent pairs indicated by solid line segments. It is clear that the number of
cut points needed for the pdBf (T; F ) to be consistent is at least the maximum number of
pairwise independent pairs in (S + ; S ?), because no line can separate more than one pair of
a family of pairwise independent pairs. The polynomial time algorithm to be given below is
a by-product of the following minimax result.
Page 24 RRR 04-97

Theorem 5.2 If a given data set (S ; S ? ) is positively oriented, then the maximum number
+

of pairwise independent pairs in (S + ; S ? ) equals the minimum number of cut points needed
to produce a consistent pdBf (T; F ).
After introducing some more notations and de nitions, we shall prove this statement by
actually presenting a construction of a consistent separation that involves as many cut points
as the number of pairwise independent pairs.
Let us call a positive point u 2 S + extreme if there is no distinct positive point u0 2 S +
such that u0 < u. Let us denote by Sex+ the set of extreme positive points in S +. Similarly,
a negative point v 2 S ? is called extreme if there is no distinct negative point v0 2 S ? such
that v0 > v. Let us denote by Sex? the set of extreme negative points in S ?. The signi cance
of extreme points consists in the fact that any separation of all critical pairs of extreme
points will be at the same time a separation of all critical pairs in Lemma 5.1(ii). In Figure
3, there are ve extreme positive points and ve extreme negative points, which are indicated
as double circles.
For a point w, let us denote by NW (w) the region of all 2-dimensional points located to
the \north-west" from w. More precisely,
NW (w) = fw0 j wA0  wA ; wB0  wB g:
Similarly,
NE (w) = fw0 j wA0  wA ; wB0  wB g;
SW (w) = fw0 j wA0  wA; wB0  wB g;
SE (w) = fw0 j wA0  wA; wB0  wB g:
We say that a point w0 precedes another point w00 if w0 2 NW (w00), and denote this relation
by w0  w00. Since any two extreme points u0; u00 2 Sex satisfy either u0 2 NW (u00) or
+

u00 2 NW (u0), this precedence relation is a linear order on Sex. Similarly, this relation is also
+

a linear order on Sex? .


We shall describe now a method to construct a set of extreme points which will be used
to nd pairwise independent pairs for Theorem 5.2, and then to nd cut points to separate
them. Let u be the -smallest element in Sex , and v be the -smallest element in Sex? .
(1) + (1)

Recursively, let u i be the -smallest element in SE (v i )\Sex , and v i be the -smallest


( +1) ( ) + ( +1)

element in SE (u i ) \ Sex? . In this process, if SE (v k ) \ Sex or SE (u k ) \ Sex? is empty, then


( ) ( ) + ( )

the corresponding (k + 1)-th point is not de ned, and the construction thereafter is stopped.
Let us call all the constructed points u i and v i , i = 1; 2; ::: corner positive and corner
( ) ( )
RRR 04-97 Page 25

negative points, respectively. Note that SE (u(i+1))  SE (u(i)) and SE (v(i+1))  SE (v(i))
hold for all i. Figure 3 contains corner positive points u(1); u(2); : : : ; u(5) and corner negative
points v(1); v(2); : : :; v(6), for which v(3) = v(4); v(5) = v(6) and u(4) = u(5) hold.
Assume that the positive corner points u(1); u(2); : : :; u(ku ) and the negative corner points
v(1); v(2); : : :; v(kv) are constructed in the above process. Since the construction alternates
between positive and negative points, it is easy to see that ku and kv di er at most by one.
Let us denote
k = minfku; kv g:
Clearly, k  1 holds if S + 6= ; and S ? 6= ;. It is important to notice that the above
construction rule implies that the pairs (u(i); v(i)) and (u(j); v(j)) are independent for any
i 6= j . Therefore, we thus nd k pairwise independent pairs (u(i); v(i)); i = i; 2; : : : ; k;
(represented in the example of Figure 3 by solid line segments).
For a pair (u(i); v(i)), the two points u(i) and v(i) may be either -comparable or not. If
they are not -comparable, then v(i)  u(i). On the other hand, if they are -comparable,
the subsequent pairs have a very special structure, as shown in the next lemma.

Lemma 5.2 If there exists i such that u i  v i , then v i = v i( ) ( ) ( ) ( +1)


and v (i+1)  u(i+1).
Similarly, if v i  u i , then u i = u i and u i  v i .
( ) ( ) ( ) ( +1) ( +1) ( +1)

Proof. We consider only the case of u i  v i , since the other case can be proved in a
( ) ( )

similar way. In this case, we have v i 2 SE (u i ). Moreover, by de nition, v i is the -


( ) ( ) ( )

smallest element in SE (u i ) \ Sex? , since if there were a -smaller element v0 in SE (u i ) \ Se?,


( ) ( )

then v0 would satisfy v0 2 SE (u i ) \ Sex?  SE (u i? ) \ Sex? and v0 would play the role of
( ) ( 1)

v i , a contradiction. Therefore, v i = v i . If u i exists, then u i 2 SE (v i ) \ Se


( ) ( +1) ( ) ( +1) ( +1) ( ) +

holds by de nition, and it follows that v i = v i  u i . ( +1) ( ) ( +1)


2
As a result of Lemma 5.2, if there is a pair (u(i); v(i)) such that u(i)  v(i), then
ui vi =vi
( ) ( ) ( +1)
 u i = u i      v kv? = v kv (or u ku? = v ku ):
( +1) ( +2) ( 1) ( ) ( 1) ( )

A similar property holds also for the case of v(i)  u(i).


In the example of Figure 3, the ve independent pairs of corner points satisfy v(1)  u(1),
v(2)  u(2), and u(3)  v(3) = v(4)  u(4) = u(5)  v(5) = v(6).
Let
A = minfjwA0 ? wA00 j j w0; w00 2 S + [ S ? ; wA0 6= wA00 g;
B = minfjwB0 ? wB00 j j w0; w00 2 S + [ S ? ; wB0 6= wB00 g:
Page 26 RRR 04-97

It can be seen that if points w; w0 2 S + [ S ? , then wA0 62 (wA ? 12 A; wA) or wB0 and
wB0 62 (wB ; wB + 12 B ). For the subsequent construction we shall introduce the following
potential cut points A = A(i) and B = B (i).
A(i) = u(Ai) ? 21 A
B (i) = vB(i) + 12 B :
We are now ready to turn to
Proof of Theorem 5.2. We have seen that the above procedure constructs k pairwise
independent pairs. We shall prove the theorem by constructing the same number k of cut
points, which separate all critical pairs.
The selection of the cut points starts from the very last (i.e., the k-th) pair u(k) and v(k).
Let us consider rst the case in which there exists an i  k such that the points u(i) and
v(i) are -comparable. Then the corner points u(k) and v(k) are -comparable by Lemma
5.2. If u(k)  v(k), then we choose B = B (k) as the rst cut point, and continue choosing cut
points alternating between B s and As, (B = B (k); A = A(k?1); B = B (k?2); : : :) until the rst
pair (u(1); v(1)) is separated. If, on the other hand, v(k)  u(k), then we choose alternatively
A = A(k); B = B (k?1); A = A(k?2); : : : until the rst pair (u(1); v(1)) is separated. These
alternating cut points form a staircase as shown in Figure 4. To show that this staircase
separates all critical pairs in (S +; S ?), it is sucient to prove that there is no positive point
to the south-west of this staircase, and no negative point to the north-east of it.
Let us consider in detail the case of u(k)  v(k), since the case of v(k)  u(k) can be
treated in a similar way. We show rst that there is no negative point to the north-east
of the staircase. For the area NE (A(k?1); B (k)), this amounts to showing that there is no
negative point in NE (u(Ak?1); vB(k)) (except for v(k)). Obviously,
NE (u(Ak?1); vB(k)) = NE (u(k?1)) [ NE (v(k)) [ (SE (u(k?1)) \ NW (v(k))):
The area NE (u(k?1)) does not contain a negative point, since any w 2 NE (u(k?1)) satis-
es w  u(k?1) and the data set (S +; S ? ) is positively oriented. The area NE (v(k)) does
not contain any negative point, since v(k) is an extreme negative point. Finally, the area
(SE (u(k?1)) \ NW (v(k))) does not contain any negative point, since v(k) is the -smallest
negative point in SE (u(k?1)). In a similar way, it can be shown that there is no negative
point in NE (A(k?3); B (k?2)), and so on. In Figure 4, it is illustrated by broken lines how
A ; vB ) can be represented as NE (u ) [ NE (v ) [ (SE (u ) \ NW (v )).
NE (u(2) (3) (2) (3) (2) (3)

If k is even, the argument is complete, since the north-east area of the staircase is the
union of k=2 north-east regions NE (A(2j?1); B (2j)); j = 1; 2; : : : ; k=2. If k is odd, then it
RRR 04-97 Page 27

(1)
u
B=B (1)

v (1) (2)
u

(3)
A=A (2) B=B
v (2) u (3)
A=A (4)
( A (2) , B (3) )

v (3) = v (4) u (4) = u (5)


( u A(2) , v B(3) ) B=B (5)

( A (4) , B (5) )
v (5) = v (6)

Figure 4: Separation of the independent pairs in the previous data set.

remains to be shown that there is no negative point with B  B (1), or equivalently, that
there is no negative point in the region with B  vB(1), which is NE (v(1)) [ NW (v(1)). Indeed,
NE (v(1)) contains no negative point, since v(1) is an extreme negative point. Also, NW (v(1))
contains no negative point, since v(1) is the rst corner negative point.
Let us now discuss why there is no positive point to the south-east of the staircase. Notice
rst that there is no positive points with B  B (k) (equivalently, B  vB(k)). Indeed, SW (v(k))
does not contain any positive point, since (S + ; S ?) is positively oriented, and SE (v(k)) does
not contain any positive point, since u(k)( v(k)) is the last corner positive point. In order to
show that there is no positive point in SW (A(k?1); B (k?2)) (equivalently, SW (u(Ak?1); vB(k?2))),
we need the following representation:
SW (uAk? ; vBk? ) = SW (u k? ) [ SW (v k? ) [ (SE (v k? ) \ NW (u k? )):
( 1) ( 2) ( 1) ( 2) ( 2) ( 1)

The area SW (u(k?1)) does not contain any positive point, since u(k?1) is an extreme positive
point. The area SW (v(k?2)) does not contain any positive point w, since (S +; S ? ) is positively
oriented. Finally, the area (SE (v(k?2)) \ NW (u(k?1))) does not contain any positive point,
since u(k?1) is the -smallest positive point in SE (v(k?2)). In a similar way, it can be shown
that there is no negative point in SW (A(k?3); B (k?4)), and so on.
If k is odd, the argument is complete, since the south-west area of the staircase is the
union of (k ? 1)=2 south-west regions SW (A(2j); B (2j?1)) and of the rst half-plane de ned
Page 28 RRR 04-97

by B  vB(k). If k is even, then it remains to be shown that there is no positive point with
A  A(1) (equivalently, A  u(1) A ). Indeed, the area SW (u ) contains no positive point,
(1)

since u(1) is an extreme positive point; and NW (u(1)) contains no positive point, since u(1)
is the rst corner positive point. This completes the analysis of the case u(k)  v(k).
Finally, let us consider the case in which there is no i  k such that the points u(i)
and v(i) are -comparable. Let us analyze rst the case ku = kv . There are two ways of
constructing the separating staircase using k cut points: either B (k); A(k?1); B (k?2); : : : or
A(k); B (k?1); A(k?2); : : :. In either case, an analysis similar to the above shows that there is
no negative point to the north-east of the staircase, and no positive point to the south-west
of it. Let us consider now the case kv = ku + 1 (hence k = ku ). In this case v(k+1) 6=
v(k), and the staircase is constructed by choosing the cut points B (k); A(k?1); B (k?2); : : :.
Notice that in this case there is no positive point with B  B (k) (equivalently, B  vB(k)).
Indeed, SW (v(k)) contains no positive point (by the assumption that (S +; S ? ) is positively
oriented), and SE (v(k)) contains no positive point (since u(k) is the last corner positive
point). The rest of the argument proceeds as above. (Note that the construction of the
staircase using the cut points A(k); B (k?1); A(k?2); : : : would not work, since the point v(k+1)
is located to the north-east of the staircase.) The last case ku = kv + 1 (hence k = kv )
is similar to the above, and we omit its proof. Figure 5 shows the structure with four
independent corner pairs when ku = 4 and kv = 5. It also shows the staircase constructed
as in the proof. The broken lines in the gure indicate how SW (u(3) A ; vB ) is represented as
(2)

SW (u(3)) [ SW (v(2)) [ (SE (v(2)) \ NW (u(3))). 2


Finally we have arrived at the next main theorem of this subsection.
Theorem 5.3 The problem MIN-2CP(CP ) can be solved in polynomial time.
Proof. Immediate from the construction in the proof of Theorem 5.2, which can obviously
be completed in polynomial time. In fact, it is not dicult to implement the algorithm so
that the running time is O(djS + [ S ?j log jS + [ S ? j). 2

5.2 NP-hard cases


5.2.1 NP-hardness of CALL for 2-dimensional data sets
We shall show in this and the next subsections that MIN-CP(C ) is NP-hard for CALL , CHORN ,
CTH and CQUAD with d = 2, and for CP with d = 3. The proofs will be obtained by a reduction
from the following problem, which is known to be NP-hard [13].
RRR 04-97 Page 29

(1)
u

v (1) (2)
u

u (3)
v (2)

u (4)
v (3)

v (4) v (5)

Figure 5: Separation of independent pairs without -comparability.

MAX-2SAT
Input: A set of m clauses, each containing two literals chosen from fX1; X1; X2; X 2; : : :;
Xn ; Xn g.
Output: An assignment to Boolean variables X1; X2 ; : : : ; Xn such that the num-
ber of satis ed clauses is maximized. (A clause is satis ed if at least one of its
literals is given the value 1 by the assignment.)
To prove these results, we have to introduce some terminology, to describe some building
blocks, and to prove some lemmas.
Construction of a grid structure: As a rst step of our construction, we prepare
a grid structure, as shown in Fig. 6, by placing positive and negative examples along the
left-most and the bottom straight edges. It is immediate to see that (1) critical pairs in
this structure are only the contingent pairs of positive and negative examples, and (2) one
vertical or horizontal line (i.e., cut point) is required to separate each pair. Clearly, the
resulting structure is consistent. It can also be easily seen that, by changing the locations
of positive and negative examples, the grid structure can be controlled so that various other
grid structures in the subsequent discussion can be built in a similar manner. For simplicity,
we represent each pair of double lines by a single line in most of the subsequent gures.
We shall place some additional examples in the cells created by the grid. Since examples in
Page 30 RRR 04-97

di erent cells of the grid are already separated by the grid lines, we can restrict our discussion
only to those critical pairs which are contained in single cells. This will greatly simplify the
argument.
In the following discussion, we shall count only those lines which are introduced to sepa-
rate additional examples, since all the grid lines are forced and their number can be consid-
ered to be xed.

Figure 6: Construction of a grid structure.

Squares for variables Xj : For each variable Xj of a problem instance of MAX-2SAT,


we prepare an m  m square of cells as shown in Fig. 7(a), where the vertical and horizontal
lines are the grid lines created in the previous step, and there is a pair of positive and negative
examples in each cell.
To separate all critical pairs in each Xj -square, we need at least m vertical and/or
horizontal lines of new cut points. It is not dicult to see that there are only two possible
ways to realize this with m lines; either m vertical lines or m horizontal lines as illustrated
in Fig. 7(b). We shall interpret the former type as representing the assignment Xj = 1, and
the latter one as Xj = 0.
Grid structure for MAX-2SAT: For a given instance of MAX-2SAT, we prepare the
grid structure as shown in Fig. 8. The total size of this structure is (nm +2m)  (nm +2m), if
RRR 04-97 Page 31

(a) X j - square

Xj = 1 Xj = 0

(b) Two types of separations by m lines

Figure 7: Xj -squares and their separation.


Page 32 RRR 04-97

( mn+2m, mn+2m )

2 • C1 •
• •
2 • C2

2m
• • • C/X • • • • • •

2 Cm
• •
• •
m X1 • •

mn
• • • • • • • • • C/X

• •
m • Xn •
• •

m m 2 2 2
(1, 2)
mn 2m
(1, 1)
(2, 1)

Figure 8: Grid structure for MAX-2SAT.


RRR 04-97 Page 33

we take a cell in Xj -squares as a unit. The entire area is divided into four sub-areas; the lower-
left area contains n Xj -squares arranged diagonally, the upper-right area of size 2m  2m
contains m squares of sizes 2  2, named Ci-squares, which are also arranged diagonally,
and the remaining two areas, the upper-left and the lower-right ones, will represent the
connection between the clauses Ci and the literals Xj ; Xj . The last two sub-areas will be
called the C=X area and the C=X area respectively, for the reason to become clear later.

Unit cell C i - square

Figure 9: The square for a clause Ci.

Variables, clauses and their connection: Fig. 9 shows a unit cell in an Xj -square
and a Ci-square. A Ci-square has four unit areas, and a positive example is placed in the
upper-right corner of the upper-right unit while a negative example is in the lower-left corner
of the lower-left unit.
To represent the fact that each clause Ci contains two literals, we place critical pairs in
the unit areas of the following locations:
(m(j ? 1) + i; mn + 2m ? 2(i ? 1)) if Xj is the rst literal in Ci
(m(j ? 1) + i; mn + 2m ? 2(i ? 1) ? 1) if Xj is the second literal in Ci
(mn + 2i; mn ? m(j ? 1) ? (i ? 1)) if Xj is the rst literal in Ci
(mn + 2i ? 1; mn ? m(j ? 1) ? (i ? 1)) if Xj is the second literal in Ci;
where the location of a unit area is described by the location of its upper-right corner in
the A-B plane. We call these unit areas containing critical pairs connectors, and use names
such as Ci-connectors (if they are related to Ci), Xj -connectors (if they are related to Xj ),
Xj -connectors (if they are related to Xj ), and C=X (resp., C=X )-connectors (if they are in
C=X (resp., C=X )-area). These rules may be best illustrated by an example such as Fig.
10, in which two clauses C1 = (X1 _ X2) and C2 = (X2 _ X3) are represented.
Page 34 RRR 04-97

C1
B

10

C2
8

X1

X2
0 A
6 8 10

X3

Figure 10: Construction for C1 = (X1 _ X2) and C2 = (X2 _ X3) with n = 3 and m = 2.
RRR 04-97 Page 35

Separation of Ci-squares and connectors: Now we consider how to separate all


critical pairs in Ci-squares and connectors. As each clause Ci contains exactly two literals,
there are basically three types of connector locations, as shown in Fig. 11, depending on the
polarity of the literals. We consider only the type (c) of Fig. 11 in the following discussion,
since other cases can be treated analogously.
Recall that, for each Xj -square, we need m lines, either all vertical (representing Xj = 1)
or all horizontal (representing Xj = 0). If Xj = 1, all Xj -connectors are also separated by
these lines used for Xj -square, while if Xj = 0, all Xj -connectors are separated. Therefore,
if the current assignment satis es a clause Ci = (Xj _ Xl ), then we need to separate at most
one connector corresponding to Xj 2 Ci or Xl 2 Ci, and the Ci-square. As shown in Fig.
12(a), (b) or (c), this can be done by introducing only one additional line v or h. However, if
Ci is not satis ed by the current assignment, we require two lines v and h to separate them
(see Fig. 12(d)).
Summing up these lines, we see that mn +2m ? k lines are sucient to separate all critical
pairs, if there is an assignment to X1; X2; : : : ; Xn such that k of the m clauses C1; C2; : : :; Cm
are satis ed, where mn lines are used to separate all Xj -squares and 2m ? k lines are used
to separate the rest.
To show that the converse of the above argument also holds, assume that mn+2m?k lines
can separate all critical pairs. Since each Xj -square requires at least m lines for separation,
we can conclude that mn lines are used for them, which will determine an assignment to
X1; X2; : : : ; Xn in the manner described above. But these mn lines do not separate Ci-
squares. Analyzing the locations of connectors, we can then show that (1) each Ci-square
requires at least one line for its separation, but such a line is useless to separate other Ci-
squares, and (2) the Ci-square and related connectors require at least two lines if none of
the connectors associated with it is separated by the rst mn lines. This means that there
are at least k clauses Ci, each of which has at least one connector separated by the rst mn
lines. This implies that the above assignment satis es at least k clauses.
Consequently, there is an assignment to X1; X2; : : :; Xn that satis es k clauses if and only
if all the critical pairs are separated by mn + 2m ? k lines. Since maximizing k is equivalent
to minimizing mn +2m ? k, this proves that MAX-2SAT is reduced to MIN-2CP(CALL ), and
establishes the next theorem.

Theorem 5.4 MIN-2CP(CALL) is NP-hard. 2


Page 36 RRR 04-97

Xj Xl Ci Ci

(a) Ci = ( X j ∨ X l )

Xj
Xj Ci

Xl

(b) Ci = ( X j ∨ X l )

Xl

(c) C = ( X ∨ X )
i j l

Figure 11: Three types of connectors for a clause Ci.


RRR 04-97 Page 37

Ci Ci
v v

Xj Xj

Xl Xl

(a) X j = 1, X l = 0 (b) X j = 1, X l = 1

Ci Ci
Xj Xj v

h h

Xl
Xl

(c) X j = 0, X l = 0 (d) X j = 0, X l = 1

Figure 12: Four cases of separations of Ci = (Xj _ Xl ).


Page 38 RRR 04-97

5.2.2 NP-hardness of other classes with d = 2 or d = 3


Modifying the proof in the previous subsection, we show in this subsection that MIN-CP(C )
is NP-hard for CHORN , CTH and CQUAD with d = 2, and for CP with d = 3. Let us rst
consider CHORN . From the closure property of F (f ) under intersection, and Lemma 2.3, we
can obtain the next lemma for the case of d = 2.

Lemma 5.3 Given a 2-dimensional data set (S ; S ?) on attributes A and B , the pdBf (T; F )
+

resulting from a set of cut points has an extension in CHORN if and only if the following two
conditions hold (see Fig. 2).
(i) (T; F ) is consistent.
(ii) For any subset V  S ?, the cell containing the point ^v2V v does not contain any posi-
tive example u 2 S + , where v^v 0 for v; v 0 2 R2 denotes the point (minfvA; vA0 g; minfvB ; vB0 g).

Proof. Condition (i) is already in Lemma 5.1 for CALL. Let x(u) denote the Boolean vector
obtained from u 2 S [ S ? using the introduced cut points (see Fig. 2(b)). Then if condition
+

(ii) fails to hold, it implies


^v2V x(v) = x(u)
for some V  S ? and u 2 S . Since x(v) 2 F for all v 2 V and x(u) 2 T , and x(v)  x(u)
+

in the above situation, this is equivalent to the condition in Lemma 2.3. These observations
prove the lemma. 2
Now let us examine all the new points generated by operations ^v2V v for V  S ? in
the construction for the proof of Theorem 5.4. First, all the negative examples used to build
the grid (see Fig. 6) can generate only one new point at the very lower-left corner, which
violates none of the conditions (i) and (ii). To see the e ect of other negative examples, we
illustrate in Fig. 13 all the points generated by such operations (shown as ) for the case
of C1 = (X1 _ X2) and C2 = (X2 _ X3 ) corresponding to Fig. 10. It is straightforward to
observe that conditions (i) and (ii) hold if the set of cut points separates all critical pairs in
Fig. 10. This can be generalized to other cases, though we omit the details. In other words,
the pdBf (T; F ) obtained in the construction for Theorem 5.4 has an extension in CHORN ,
and hence establishes the next theorem.

Theorem 5.5 MIN-2CP(CHORN ) is NP-hard. 2


RRR 04-97 Page 39

Figure 13: The points generated by operations ^v2V v; V  S ? .

Now we turn to the class CTH . For a given 2-dimensional data set (S + ; S ?), assume
that a consistent set of cut points is at hand. Call a cell positive (resp., negative) if it
contains a positive example (resp., a negative example). The property of consistency implies
that there is no cell which is positive and negative at the same time. A sequence of cells
Q = q1; q2; : : :; qk is called an alternating cycle if
(i) q1 = qk ,
(ii) positive and negative cells alternate in Q, and
(iii) the edges connecting qi and qi+1 for i = 1; 2; : : : ; k ? 1 alternate between vertical and
horizontal.
An example of an alternating cycle is shown in Fig. 14.
Lemma 5.4 Given a 2-dimensional data set (S ; S ? ), the pdBf (T; F ) resulting from a set
+

of cut points has an extension in CTH if and only if the following two conditions hold.
(i) (T; F ) is consistent.
(ii) There is no alternating cycle.
Page 40 RRR 04-97

+ +

+ + +

Figure 14: An alternating cycle.

Proof. Let the set of cut points in coordinates A and B be A ; A ; : : : ; Ap and B ; B ; : : :; Br ,


1 2 1 2

respectively, and let xA (u) = (xA1(u); xA2(u); : : :; xAp(u)) and xB (u) = (xB1(u); xB2(u); : : :; xBr(u)),
where xAi (u) = 1 if uA  Ai and 0 otherwise; similarly for xBj . By the asummability con-
dition mentioned in Section 2.2, (T; F ) has an extension in CTH if and only if there is no set
of u  0; u 2 S + and v  0; v 2 S ? such that
X X
u = v > 0;
u2S + v2S ?
X X
u(xA(u); xB (u)) = v (xA(v); xB(v)): (18)
u2S + v2S ?
We rst prove the only-if-part. Condition (i) follows from Lemma 5.1. To show condition
(ii), assume that an alternating cycle Q = q1; q2; : : :; qk exists, and let U + (resp., U ? ) be the
set of positive examples (resp., negative examples) obtained by selecting exactly one example
from each participating cell. Then
jU +j = jU ?j
X X
(xA(u); xB (u)) = (xA(v); xB (v))
u2U + v2U ?
hold since Q is alternating. Therefore, by (18), (T; F ) does not have an extension in CTH , a
contradiction.
To prove the if-part, assume that (T; F ) does not have an extension in CTH . If it does
not have an extension in CALL, then the condition (i) is violated. If it has an extension in
CALL but not in CTH , there are u  0; u 2 S + and v  0; v 2 S ? satisfying (18). We
RRR 04-97 Page 41

point out that, by the de nition of binarization as explained in Section 2.2, xA(u) is equal
to one of the p-dimensional vectors (0    00); (0    01); (0    011); : : : ; (1    111), and xB (v)
is equal to one of the r-dimensional vectors (0    00), (0    01), (0    011), : : :, (1    111).
Note that the p vectors (resp., r vectors) (0    01), (0    011), : : :, (1    111) among those
form a basis of the p-dimensional (resp., r-dimensional) linear space, and that (00    00)
does not contribute to the sum in (18). Linear algebra then tells that
X X
uxA (u) = v xA (v) (19)
u2S + ; xA (u)=a v2S ? ; xA (v)=a
holds for every a = (0    00); (0    01); (0    011); : : : ; (1    111), and that similar equations
hold for xB (u).
Now let us start with a positive cell containing some u 2 S + with u > 0. Then by
(19), there must be an example v 2 S ? such that xA (v) = xA(u) (but xB (v) 6= xB (u)) and
v > 0. This edge from the cell of u to the cell of v is horizontal. We then apply a similar
argument to the negative cell of v, and nd a positive cell containing u0 2 S + such that
xB (u0) = xB (v) and u0 > 0. This implies that there is a vertical edge from the negative
cell of v to the positive cell of u0. Repeating this, we grow an alternating path, but, since
the number of points with non-zero coecients in the relation (18) is nite, a cell must be
repeated at a certain step, thus forming an alternating cycle. We conclude, therefore, that
there is an alternating cycle, and the condition (ii) is violated. 2
Next we show that any set of cut points producing an extension in CALL in the construc-
tion in the proof of Theorem 5.4 satis es the above two conditions. The condition (i) is
immediate from Lemma 5.1. To show that the condition (ii) is also satis ed, we proceed
step by step.
1. If there is an alternating cycle Q, it does not use the examples introduced to build the
grid (like Fig. 6). Indeed, if such a Q exists, it must use a horizontal or vertical edge from
one of positive examples among them, which is perpendicular to the array of the examples;
but it is impossible since there is no other example in the row or column of such edge.
2. The alternating cycle Q cannot use a cell in Xj -squares. To prove this, assume without
loss of generality that an Xj -square is separated by m horizontal lines. Then, any row of
cells of this Xj -square contains only positive examples or only negative examples, implying
that the row cannot produce a horizontal edge in Q. (Note, in this case, there may be a
column containing both positive and negative examples.)
3. The alternating cycle Q cannot use a cell in C=X or C=X area. This is because Q
then contains an edge connecting the cell to a cell in some Xj -square, which was ruled out
Page 42 RRR 04-97

in 2.
4. As a result of 13, the alternating cycle Q must consist only of cells in Ci-squares.
However, this is obviously impossible since the cells in this area are diagonally arranged.
This proves that the resulting pdBf (T; F ) has an extension in CTH , and establishes the
next theorem.
Theorem 5.6 MIN-2CP(CTH ) is NP-hard. 2

Our next target is CQUAD . First note that each region represented in the 2-dimensional
plane by a quadratic term such as XAi XAj , XAi XBj , XAi XBj or XAi XBj is one of those
illustrated in Fig. 15. It is also allowed to use regions of linear terms such as those illustrated
in Fig. 1. Call such a region a positive quadratic region (PQR) if it contains at least one
positive example but no negative example. We say that such PQR covers the positive
examples in it. Then the next lemma tells when an extension in CQUAD exists.

X Ai X Bj X Ai X Bj

X Bi X Bj

X Ai X Aj
X Ai X Bj X Ai X Bj

Figure 15: Regions represented by quadratic terms.

Lemma 5.5 Given a 2-dimensional data set (S ; S ? ), the pdBf (T; F ) resulting from a set
+

of cut points has an extension in CQUAD if and only if the following two conditions hold.
RRR 04-97 Page 43

(i) (T; F ) is consistent.


(ii) For any positive example, there is a PQR that covers it.
Proof. Immediate from the de nitions and Lemma 5.1. 2
To prove the NP-hardness of MIN-2CP(CQUAD ), we introduce a slightly modi ed version
of MAX-2SAT.
MAX-2COMPSAT
Input: A set of m clauses, each containing two literals chosen from fX1 ; X1 ; X2 ; X 2 ; : : :;
Xn ; Xn g such that one is uncomplemented and the other is complemented.
Output: An assignment to Boolean variables X1 ; X2 ; : : : ; Xn such that the num-
ber of completely satis ed clauses is maximized. (A clause is completely satis ed
if both of its two literals are given the value 1 by the assignment.)
This problem is known to be NP-hard (because an NP-hard problem MAX-CUT for
graphs can be easily reduced to MAX-2COMPSAT).
Now, given an instance of MAX-2COMPSAT, we modify slightly the construction in the
proof of Theorem 5.4. We explain this modi cation using an example of C1 = (X1 _ X2)
and C2 = (X2 _ X3), illustrated in Fig. 16. Here, Xj -squares and the connectors in C=X
and C=X areas have the same structures as before. However, the locations of connectors are
changed. The upper-right corner of the connector corresponding to Xj 2 Ci is located at
(m(j ? 1) + i; mn + 2m ? 2(i ? 1)) and the connector corresponding to Xj 2 Ci is located
at (mn + 2i; mn ? m(j ? 1) ? (i ? 1)). In addition, each Ci-square is substantially di erent
from the previous one and is shown in Fig. 17.
To separate Xj -squares, we again use either all vertical lines or all horizontal lines, rep-
resenting Xj = 1 or Xj = 0, respectively. However, to have PQRs that contain all positive
examples in Xj -squares, while avoiding the negative examples used to build the grid, we
need double lines for each column or row. This is illustrated in Fig. 18. It is also easily seen
that if Xj = 1 (resp., Xj = 0) is chosen, all Xj -connectors (resp., Xj -connectors) are also
separated by such double lines.
Finally we consider how to separate Ci-squares and provide PQRs for the positive ex-
amples in them. For a clause Ci = (Xj _ Xl), this is illustrated in Fig. 19 for the three
cases in which none, one, and two literals in Ci are respectively satis ed by the assignment
to X1; X2; : : :; Xn (and the corresponding cells denoted Xj and Xl in C=X and C=X regions
are separated as discussed above). We can see that each case exhibits a minimum number
Page 44 RRR 04-97

C1

C2

X1

X2

X3
A

Figure 16: Construction for CQUAD with C1 = (X1 _ X2) and C2 = (X2 _ X3).
RRR 04-97 Page 45

Figure 17: Ci-square for class CQUAD .

of lines; six lines are required if at most one of Xj and Xl is satis ed, but four lines are
necessary and sucient if both of Xj and Xl are satis ed.
Therefore, the minimum number of lines such that the resulting pdBf (T; F ) has an exten-
sion in CQUAD provides a solution to MAX-2COMPSAT. This shows that MAX-2COMPSAT
is reduced to
MIN-2CP(CQUAD ), and establishes the next theorem.
Theorem 5.7 MIN-2CP(CQUAD) is NP-hard. 2

As the nal case in this section, we deal with MIN-3CP(CP ) in the 3-dimensional space
with coordinates A; B and C . We call the plane obtained by xing A (resp., B; C ) to a
constant an A-plane (resp., B -plane, C -plane).
In this case, we build the required grid not in the C -plane with C = 0, but in the triangu-
lar region T0 that lies in the plane de ned by the three points (K; 0; 0); (0; K; 0); (0; 0; K ),
which is illustrated in Fig. 20. To draw the lines of the grid, we place positive examples  and
negative examples  in the C -plane with C = 0 along the straight line segment connecting
(K; 0; 0) and (0; K; 0) in the manner of Fig. 21. Although we omit the details, it is clear that
the desired grid structure can be built in the square area indicated by bold borders. Also
note that all cells created in this way are consistent with respect to the examples used for
Page 46 RRR 04-97

v1 v 2 v 3 v4

h
h 34
h2
h1
Xj

Xj
(a) Xj = 1 (b) X j = 0

Figure 18: Separation of an Xj -square.

building the grid, and furthermore, no pair of a positive example u and a negative example
v satis es v  u. The lines in Fig. 21 represent A-planes and B -planes perpendicular to the
C -plane, which then intersect the triangle T0 of Fig. 20 and project the grid structure of
Fig. 21 on T0.
Then, using the grid built on T0, we construct on T0 the structure used in the proof of
Theorem 5.4. Fig. 22 shows how it is realized for the example case of Fig. 10, where shaded
regions represent Xj -squares, Ci-squares and connectors in C=X and C=X areas. In this
construction, however, we shrink the size of each critical pair (i.e., we put  and  closer)
as illustrated in Fig. 23. Although not exhibited, the critical pairs in connectors are also
modi ed in the same manner. We emphasize that all these examples are put on T0.
Next, for every negative example  in the construction (including those used to build the
grid), we introduce a new positive example  and place it right above the corresponding 
(in the direction of increasing C -coordinate). This necessitates the introduction of a C -plane
to separate each critical pair just created in this process. These C -planes intersect T0 and
produce horizontal lines, which are illustrated in Figs. 22 and 23 by broken lines. Note
that C -coordinate values decrease if we move upward in these gures, and these broken lines
separate none of the critical pairs which were already present before this modi cation.
RRR 04-97 Page 47

v3 v2 v 1 Ci v2 v1 Ci
Xj Xj
h1 h1
h h2
2
h3 h3
h4

Xl
Xl
(a) None of X j and X l (b) One of X j and X l
is separated (say X l ) is separated

Ci
v2 v 1

Xj
h1
h2

Xl
(c) Both X j and X l are
separated

Figure 19: Covering Ci-square together with the associated connectors (Ci = (Xj _ Xl)).
Page 48 RRR 04-97

( 0, 0, K) T0

( 0, K, 0)

A
( K, 0, 0)

Figure 20: Triangle T0 in the 3-dimensional space.

B A

( 0, K, 0) ( K, 0, 0)

( 0, 0, 0 )

Figure 21: Building the grid structure.


RRR 04-97 Page 49

B A

( 0, 0, K )

Figure 22: Construction of the structure of Fig. 10 on T0.

(a) X j - square (b) C - square


i
( m = 2)

Figure 23: Locations of critical pairs.


Page 50 RRR 04-97

Now it is not dicult to observe that the structure constructed on T0 in the 3-dimensional
space is the same as the one constructed for the proof of Theorem 5.4 in the 2-dimensional
space. Therefore, the previous proof can also be applied to this case in exactly the same
manner, and we conclude that minimizing the number of lines to separate all critical pairs
is equivalent to solving the given instance of MAX-2SAT.
Finally, we point out that no pair of a positive example u and a negative example v in
this construction satis es v  u in R3, because they are located on T0 (except for those used
to build the grid in Fig. 21, and those placed above negative examples, for which the proof is
also straightforward). If a set of cut points is introduced to separate all the remaining critical
pairs, then it can be shown from the construction that the Boolean vectors x(u) and x(v)
for any pair of a positive example u and a negative example v do not satisfy x(v)  x(v).
(The cut points for the broken lines in Figures 22 and 23 were introduced to guarantee this
property.) Then Lemma 2.2 tells that the resulting pdBf (T; F ) has an extension in CP , and
establishes the next theorem.

Theorem 5.8 MIN-3CP(CP ) is NP-hard. 2

References
[1] D. Angluin, Queries and concept learning, Machine Learning, 2 (1988) 319-342.
[2] E. Balas and A. Ho, Set covering algorithms using cutting planes, heuristics, and sub-
gradient optimization: A computational study, Mathematical Programming, 12 (1980)
37-60.
[3] R. Bellman, Introduction to Matrix Analysis , McGraw-Hill, New York, 1960.
[4] J. C. Bioch and T. Ibaraki, Complexity of identi cation and dualization of positive
Boolean functions, Information and Computation, 123 (1995) 50-63.
[5] E. Boros, V. Gurvich, P.L. Hammer, T. Ibaraki, and A. Kogan, Decompositions of
partially de ned Boolean functions, Discrete Applied Mathematics, 62 (1995) 51-75.
[6] E. Boros, P. L. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz and I. Muchnik, An imple-
mentation of logical analysis of data, Technical Report RRR 22-96, RUTCOR, Rutgers
University, 1996.
RRR 04-97 Page 51

[7] E. Boros, T. Ibaraki and K. Makino, Error-free and best- t extensions of partially
de ned Boolean functions, Technical Report RRR 14-95, RUTCOR, Rutgers University,
1995.
[8] E. Boros, T. Ibaraki and K. Makino, Extensions of partially de ned Boolean functions
with missing data, Technical Report RRR 06-96, RUTCOR, Rutgers University, 1996.
[9] P. Clark and T. Niblett, The CN2 induction algorithm, Machine Learning, 3 (1989)
261-283.
[10] Y. Crama, P. L. Hammer and T. Ibaraki, Cause-e ect relationships and partially de ned
boolean functions, Annals of Operations Research, 16 (1988) 299-326.
[11] R. Dechter and J. Pearl, Structure identi cation in relational data, Arti cial Intelligence
58 (1992) 237-270.
[12] O. Ekin, P.L. Hammer and A. Kogan, On connected Boolean functions, Technical Re-
port RRR 36-96, RUTCOR, Rutgers University, 1996.
[13] M. R. Garey and D. S. Johnson, Computers and Intractability, Freeman, New York,
1979.
[14] P. L. Hammer, Partially de ned boolean functions and cause-e ect relationships, in
International Conference on Multi-Attribute Decision Making Via OR-Based Expert
Systems, University of Passau, Passau, Germany, April 1986.
[15] A. B. Hammer, P. L. Hammer and I. Muchnik, Logical analysis of Chinese productivity
patterns, RUTCOR Report RR 1-96, RUTCOR, Rutgers University, 1996.
[16] S. J. Hong, R-MINI: An iterative approach for generating minimal rules from examples,
IEEE Trans. on Knowledge and Data Engineering, to appear.
[17] O.L. Mangasarian, Mathematical programming in machine learning, In G. Di Pillo and
F. Giannessi, eds., Nonlinear Optimization and Applications, pp. 283-295, New York,
1996 (Plenum Publishing).
[18] I. Muchnik and L. Yampolsky, A pilot study of the concurrent validity of logical analysis
of data, as applied to two Beck inventories. Technical Report RTR 02-95, RUTCOR,
Rutgers University, 1995.
[19] S. Muroga, Threshold Logic and Its Applications, Wiley-Interscience, 1971.
Page 52 RRR 04-97

[20] P. M. Murphy and D. W. Aha, UCI repository of machine learning databases: Machine
readable data repository, Department of Computer Science, University of California,
Irvine, 1994.
[21] J. R. Quinlan, Induction of decision trees, Machine Learning, 1 (1986) 81-106.
[22] J. R. Quinlan and R. Rivest, Inferring decision trees using minimum description length
principle, Information and Computation, 80 (1989) 227-248.
[23] L. G. Valiant, A theory of the learnable, Communications of the ACM, 27 (1984) 1134-
1142.

View publication stats

You might also like