p24 CreatingBalancedAndConnectedCl

J Syst Sci Syst Eng (Dec 2010) 19(4): 453-480 ISSN: 1004-3756 (Paper) 1861-9576 (Online)
DOI: 10.1007/s11518-010-5150-x CN11-2983/N
CREATING BALANCED AND CONNECTED CLUSTERS TO IMPROVE

SERVICE DELIVERY ROUTES IN LOGISTICS PLANNING
Buyang CAO1 Fred GLOVER2

1
Esri, Inc., 380 New York Street, CA 92373, USA
bcao@esri.com
2
OptTek Systems, Inc., 2241 17th Street, Boulder, CO 80302, USA
glover@opttek.com ( )
Abstract
A challenging problem in real world logistics applications consists in planning service territories
for customer deliveries, in contexts where customers must be clustered into groups that satisfy various
conditions such as balance and connectivity. In this paper we propose new algorithms for producing
such clusters based upon special procedures for exploiting Thiessen polygons. Our methods are able to
handle multiple criteria for balancing the clusters, such as the number of customers in each cluster, the
service revenue in each cluster, or the delivery/pickup quantity in each cluster. Computational results
demonstrate the efficacy of our new procedures, which are able to assist users to plan service personal
service territories and vehicle routes more efficiently.
Keywords: Clustering, K-means, logistics, routing, Thiessen polygon
1. Introduction VRP/TW is an NP hard problem, and it has

In practical applications of routing and attracted a lot of research interest. A
distribution, a service or logistics provider comprehensive survey of VRP/TW and of
continually faces the challenge of providing the various algorithms for solving this type of
best service to customers. Typically this problem appears in Braysy & Gendreau (2005).
challenge is cast in the framework of Because of the complexity of a VRP/TW, it
determining the most effective way to pick up is very difficult if not impossible to achieve a
and deliver freight or services from customers to satisfactory solution to many real world
other customers within a specified area, subject instances of the problem within a reasonable
to limitations on resources such as service computational time. It is not practical to solve a
personnel and vehicles (drivers). This VRP/TW to optimality; especially under the
optimization problem can advantageously be frequently encountered conditions where
modeled as a vehicle routing problem with time response times (the time window within which
windows (VRP/TW), yielding a formulation that an answer is needed) is a critical concern. A
has been studied by many researchers. A variety of different solution strategies have been
© Systems Engineering Society of China & Springer-Verlag Berlin Heidelberg 2010

Cao and Glover: Creating Balanced and Connected Clusters to Improve Service Delivery Routes
454 J Syst Sci Syst Eng
devised in order to solve the VRP/TW more the business practice; in particular,
effectively. Most of these algorithms employ a customary rules require that a person who
cluster-first and route-second strategy in order services a given sub-area must not service a
to address real application problems effectively. customer inside of another sub-area. The
Another commonly encountered challenge in creation of connected sub-areas has another
solving these problems arises from the fact that, advantage. If the sub-areas have clean
due to the limited availability of resources, a boundaries, then it will be much easier for
service or logistics provider is generally unable the system user to adjust the sub-areas
to service all customers within a service territory. created by the computer in order to meet
Therefore, it can become highly desirable to special business needs that had not
divide the entire area into several subareas previously been envisioned or that are
according to the business practices for providing difficult to embody within the model
services to these subareas. For instance, a given framework.
subarea may be serviced on certain days of the z The sub-areas should be balanced. Balance
week or by a certain sets of drivers (vehicles). criteria can be expressed as bounds on

The problem then arises: how to create these differences in the numbers of customers in
subareas to meet the business logic. different sub-areas, or in the service value of
The problem we address has the following different sub-areas. These balance criteria
characteristics. For a given service territory, are motivated, for example, by the desire to
there are thousands of customers, and each reduce fluctuations in daily service revenue
customer has an associated service value and in the requirements for personnel to
(delivery/pick up quantity, service revenue, or service the sub-areas.
service time). The service resource is limited. In the following discussion we will use the
Therefore, not all customers can be serviced on term cluster to represent a sub-area. Expressed
a single day. We are required to develop a more generally, the problem consists in
decision-support system to help the user create determining how to group or cluster the
sub-areas within this territory so that each customers, which may abstractly be treated as
sub-area can be serviced on a single day. The points within the service territory, while
following criteria must be considered while honoring the rules imposed by business practice
building these sub-areas: such as restrictions on total service capacity
z Each sub-area should be as compact as (total working hours combined with vehicle
possible so that the expected service cost capacities including weights and volumes),
and (especially) the travel time and distance balanced work load (service levels or
to service the sub-areas will be minimal. delivery/pickup quantities), non-overlapped
z Each sub-area must be connected, which subareas, etc. A crucial goal in creating such a
means no customer of a sub-area can lie clustering problem is to form well separated
within the geographic boundaries of another point groups (clusters) such that the average
sub-area. This requirement is imposed by travel time or distance within each cluster is
J Syst Sci Syst Eng 455
minimal, subject to assuring that the real total between two locations if their straight line link
travel time or distance to service a cluster will was intersected by geographic obstacles such as
be minimum as well. rivers, mountains, highways, etc. Fan
Clustering problems, particularly these represented obstacles by superimposing convex
having a mix of both spatial and non-spatial or concave polygons upon the underlying
attributes, can be found in variety of applications. geographic data. By this means it was
General clustering algorithms can be found in anticipated that distances between locations
various applications such as logistics industry, would be able to take geographic features into
imagery processing, data mining etc. For account more realistically.
instance, Humair & Willems (2006) proposed an GIS (geographic information system)
algorithm to identify clusters (or clusters of technology is often utilized in such applications
commonality) in supply chain networks, whose due to its ability to supply vital information such
purpose is to provide more efficient structure to as geographic feature data, street network
solve certain optimization problems such as information, speed limits on street segment and
optimizing safety stock levels and locations. lengths of street segments, and to keep track of
Barreto et al. (2007) presented clustering restrictions such as vehicle heights, weights, and
methodology for solving capacitated volumes that need to be considered by
location-routing problems to find locations to set optimization procedures. For example,
up warehouses or service centers in order to Estivill-Castro & Lee (2001) combine data
solve the resultant VRP problems more mining and GIS as a means to consider
effectively. geographic obstacles such as hills or rivers. The
In the context of multi-depot vehicle routing authors devise a clustering algorithm utilizing a
problems with time windows, Dondo & Cerda Voronoi diagram to set up a topological structure
(2007) proposed a cluster-based approach to for a given set of points as a basis for retrieving
give a basis for generating tighter routes. In this spatial information related to various definitions
procedure, all customer locations are clustered of neighbors, and report their method to be
and the clusters in turn are assigned to vehicles. successful for handling the presence of obstacles.
The approach accounts for vehicle capacities, Kwon et al. (2007) proposed a Tabu Search
time windows and idle time. The goal is to build algorithm to solve capacitated vehicle routing
a cluster yielding a low average travel distance problems, using a Voronoi diagram to narrow the
for each location. search space during the solution process.
Fan (2009) models a site selection problem However, the results did not find that the
as a point clustering problem (and proposes a contribution of the Voronoi diagram to
hybrid algorithm combining K-means clustering narrowing the search space was sufficient to beat
and simulated annealing to solve it. In his the existing benchmark results.
method, the distance measurement is based on Within a practical setting, Blakeley et al.
Euclidean distances. However, he modified this (2001) report the application of GIS and
measure to incorporate an “obstructed distance” optimization technologies in technician
scheduling and dispatching for Schindler problem is to produce a minimum weight

Elevator Corporation. The application system solution that accounts for the vertex weight
improved Schindler service routes, increased balancing constraint. The authors developed an
profitability, and saved over $1million annually. algorithm called OPOSSUM (Optimal
Zhang et al. (2007) use related analysis to study Partitioning of Space Similarities Using Metis
the problem of creating geographic (Karypis & Kumar 1998)) using the Metis
non-overlapping clusters, where geographic approach proposed by Karypis and Kumar as the
areas containing special features of interest are multi-objective graph partitioning engine.
defined in order to facilitate decision making; Huff (1963) proposed an interesting model
for instance, identifying two subareas one of for retail trade area analysis that determines how
which contains low-income households while customers within a region should be assigned to
the other contains high-income households, in a given set of service centers (e.g., stores).
order to establish different policies for these Huff’s model specifies the probability of
subareas. In this case the authors develop an assigning a customer to a store as follows:
algorithm utilizing a spatial distance measure
Aαj Dij− β
capable of considering non-spatial attributes and Pij = n
geographic non-overlapping constraints ∑ Aαj Dij− β
simultaneously. j =1
Strehl & Ghosh (2002) developed an where:

interesting algorithm to address real-life z Aj is a measure of attractiveness of store j
data-mining problem found in retail-industry (such as square footage).

and some web applications, in which data reside z Dij is the distance from i to j.
in a very high dimensional space. Their z α is an attractiveness parameter estimated
approach introduces a similarity relationship from empirical observations.

defined on each pair of data samples. After z β is the distance decay parameter estimated
similarities are computed, the problem is from empirical observations.

transformed to one over the similarity domain z n is the total number of stores.
and the original high-dimensional space is no For a store that has a larger attractiveness
longer needed. The goal is to cluster data measure and lies closer to a customer, the
samples into k groups so that data samples for preceding formulation assures that the customer
different clusters have similar characteristics. will have a larger probability of being assigned
The authors formulated this problem as a to the store. However, if a store is far away from
vertex-weighted graph partitioning problem, a customer, then the power of attracting this
where each vertex (data sample) is assigned a customer fades. Therefore, larger stores that are
weight representing its importance and each pair closer to the area where most customers are
of vertices is connected by an undirected edge located will generally have a higher probability
whose weight is determined by their similarity of getting customers.
measurement. The objective of this partitioning It is important to note, however, that none of
the foregoing applications includes balanced clustering problems. Computational

consideration of designing clusters capable of results documenting the effectiveness of our new
meeting balancing criteria, as embodied in the procedures are presented in section 4. Finally,
need to assure balanced service levels and we summarize our findings and present some
delivery/pickup quantities and non-overlapped conclusions in section 5.
clusters simultaneously. The present work
undertakes to address these crucial concerns. 2. Problem Description
In particular, we face the challenge of
developing optimization algorithms capable of 2.1 Conventions and Terminology
exploiting GIS technology to create balanced We formulate our problem by reference to a
and connected clusters, where each cluster can graph G = (N, E) where N is a set of nodes that
be treated as a service territory. We especially are to be clustered, and E is a set of edges
seek to produce a flexible design that will permit joining pairs of these nodes. In our present
the criteria for balancing the clusters to embrace routing application, the sets N and E are derived
a variety of options, such as those based on the from Thiessen polygons. Figure 1 displays an
number of customers in each cluster, the service illustrative set of points and Figure 2 identifies
revenue in each cluster, or the delivery/pickup the Thiessen polygons based upon these points.
quantity in each cluster.
In sum, our methodology undertakes to
provide the following contributions to handling
service territory planning and design issues in
the logistics and service industry:
z A new framework that enables users in
logistics and service industry to plan and

design their service areas more efficiently.
z A capability to address issues of creating
geographically non-overlapped and balanced

clusters simultaneously. Figure 1 Customer locations (points)
z Methodology to effectively maintain the
connectivity of clusters during the cluster The motivation for creating the Thiessen
creation and improvement processes by polygons is as follows. We imagine each
drawing on topologic relationships derived polygon to be represented by a node of the graph
from Thiessen (Voronoi) Polygons. G. If two polygons touch (i.e., share a common
z Flexibility and scope to handle multiple boundary or at least one vertex of a boundary),
clustering objectives. then the two nodes corresponding to these
In the following exposition, Section 2 polygons are linked by creating an edge of the
describes the clustering problem in more detail, graph that joins them. Nodes joined by edges are
and section 3 presents the algorithms to solve called adjacent.
predefined number of clusters to be created.

There are two important variations to this
objective.
Variation 1: According to a specified level
of priority, create Ω in relation to a distance
measure so that each cluster C ∈ Ω is composed
of nodes that are closer to other nodes of the
same cluster than they are to nodes of other
Figure 2 Thiessen polygons for the given point set
clusters.)
The adjacency relationship is used to Variation 2: Create Ω in relation to a
overcome some shortcomings of common specified center point so that each cluster C ∈ Ω
clustering algorithms such as K-means that are lies in its own region relative to this point, where
susceptible to generating clusters that overlap, the regions resemble slices of a pie passing
i.e., where some nodes of a cluster are through this center.
completely embedded in other clusters. Zhang et The motivation underlying the Problem
al. (2007) listed a sufficient condition to Objective and the two preceding variations is to
guarantee non-overlapping clusters, however, produce clusters that give a foundation for
this condition usually is not met in real world creating routes that can be served by different
settings. Therefore, extra effort is needed to vehicles and on different days. The second
ensure the resultant clusters are connected, i.e., variation refers to routes that all start from the
they are not overlapping. same center point. Each of the variations is
In our problem, each node p ∈ N contains a assumed to be compatible with the initially
specified capacity, or weight, w(p) and the stated objective and with each other.
weight w(C) of a cluster C is defined by w(C) = The following requirement also has a critical
∑(w(p): p ∈ C). We allow for the possibility that relevance to routing problems.
a node may have more than one capacity or
weight. A collection Ω of clusters is understood 2.3 Connectivity Requirement
to be complete (and feasible) if its nodes create a Each cluster must be connected, i.e., every
partition of the set N, i.e., the node sets of the two distinct nodes in the cluster must be joined
clusters C ∈ Ω are pairwise disjoint and their by a path formed of edges in E (and their
union is N. associated nodes in N).
For the following we represent the node set
2.2 Problem Objective N by writing N = {pi: i ∈ I} where I = {1, … , n}
The goal of our problem, roughly stated, is to is the index set for the nodes pi in N.
create a clustering set Ω so that each cluster C ∈
Ω contains approximately the same number of 3. Solution Methodology
nodes and each cluster has approximately the K-means (MacQueen 1967) is one of the
same weight, where |Ω| = k and k is the simplest unsupervised learning algorithms for
solving the point clustering problem. The show that the outcome obtained by the neural
algorithm attempts to minimize the value of a network is generally better than that obtained by
squared error function defined as: the original K-means approach. It should be
pointed out that this finding contradicts some
V = ∑∑ (| xi − X k 2
+ yi − Yk |2 )
published studies claiming that neural networks
where |xi – Xk| 2+ |yi – Yk|2 is a chosen distance are not superior to the original K-means
measure between a node pi, i ∈ I, represented by approach.
its xi and yi coordinates and the centroid of In the following algorithm description
cluster Ck if this node is contained in this cluster sections, we use Euclidean distance as the
(Here we consider points lying on a 2-dimension distance measurement in the algorithms.
plane, where each node’s location is presented However, this is not a limitation. Our algorithms
by its x and y coordinates). can readily use alternative distance measures,
The K-means algorithm consists of the which may be employed to address geographic
following steps: 1) Start with K randomly obstacles such as rivers, bays, and mountains.
selected points as the centroids for all clusters Relevant examples include:
respectively; 2) Assign each node to the cluster z the distance calculation method suggested by
whose centroid is closest to the node; 3) Fan (2009), or

Re-compute the centroids for all clusters that z the distance computed by using GIS system
receive new nodes; 4) Repeat steps 2) and 3) that utilizes the underlying street network.
until no centroid will be changed. The procedure When a distance between two locations is
creates the clustering result in which the computed based upon the real street network,
objective function V cannot be improved further any geographic obstacles that may be present are
by performing steps 2) and 3). automatically taken into account while
The K-means algorithm is relatively easy to computing the distance. The distance computed
implement and reasonably effective for many in this manner has an important effect on the
types of applications. However, in its original clusters produced by the clustering algorithm
form, the method is unable to handle the goals of proposed in this paper. The following example
creating balanced and non-overlapping clusters. demonstrates this. Figure 3 displays the stops to
In the following sections we propose algorithms be clustered and the area where these stops are
that can address these issues that cannot be located. As seen from the picture, a river divides
resolved by applying the K-means algorithm the region into two sections. Figure 4 depicts the
directly. result obtained by utilizing Euclidean distance
The limitations of the original K-means and Figure 5 shows the result obtained by the
algorithm are demonstrated by Hruschka & same solver based upon the real street distance.
Natter (1999), who present an interesting The clusters based on Euclidean distance
comparison of performance between K-means create some undesirable outcomes, including
and a feedforward neural network in finding situations where clusters cross the river, and
market segmentation (structure). Their results where some stops cannot be reached at all. Such
outcomes are to be expected because the

Euclidean distance doesn’t consider the
underlying geographic characteristics.
Figure 5 Result obtained based on real street

distance
Figure 3 Stops to be clustered 3.1 Modified K-means Algorithm

We first describe a preliminary version of a
balanced and connected cluster algorithm called
the modified K-means algorithm for creating the
clustering set Ω. The purpose of this method is
to build an initial Ω which is then enhanced by
an improving procedure described in the
subsequent sections. The modified K-means
algorithm is designed to insure that the
connectivity requirement will be fulfilled, which
is not insured by the ordinary K-means
Figure 4 Result obtained based upon Euclidean
algorithm. The modified K-means is also able to
distance
quickly create an initial solution for our more
By contrast, Figure 5 demonstrates clearly advanced procedure discussed below.
that as long as the real street distance is Let pi and pj, i, j ∈ I, be two nodes that
employed, the proposed clustering solver is able belong to two different clusters Cr and Cs
to consider geographic obstacles effectively. Of respectively. If pi and pj connect by an edge (pi,
course some stops are left un-clustered because pj) ∈ E, then pi and pj are called boundary nodes,
they are not reachable via streets. Hence in the and the move pi → Cs, which reassigns pi to the
following discussion we will denote the distance cluster Cs (dropping it from Cluster Cr) is called
between two nodes pi and pj by d(pi, pj) without an admissible move. Note that a given boundary
explicitly indicating the form of the distance node pi of Cr may have more than one
measure d, understanding that real street admissible move; that is, there may be an edge
distances may be used in conditions where they (pi, pq) ∈ E joining pi to a node pq that belongs
are appropriate. to a cluster Ct different from Cs. Furthermore, an
admissible move may not be a feasible move in closest to the centroid of the nodes in N, and
the sense of preserving connectivity. create the initial form of the first cluster by
We consider the following two balancing setting C1 = {i*}, together with setting Io =
scenarios. {i*}.
Let Navg denote the “average” number of For h = 2 to k:
nodes in a cluster, where Navg = the nearest Select a node pi*, i* ∈ I – Io whose
integer neighbor of n/k and k is the number of minimum distance from the nodes pj, j ∈ Io
clusters to be created. For any boundary node pi is maximum; i.e.,
in a cluster Cr, define the balance value of an i* = arg max (Min(d(pi, pj): j ∈ Io): i ∈ I –
admissible move pi → Cs to be: (1) 2 if | Cr | > I o) .
Navg and | Cs | < Navg, (2) 1 if (1) is not true but | Let Ch = {i*} and Io := Io ∪ {i*}.
Cr | > | Cs | + 1; (3) 0 if | Cr | = | Cs | + 1, and (4) Endfor
−1 if | Cr | ≤ | Cs |. Denote the centroid of cluster Ch by ch, for h
Define the weight W(C) of a cluster C to be ∈ K = {1, …, k}.
the sum of the weights in C, and let Wavg denote 1. (Assignment Step) For each node pi, i ∈ I
the overall average cluster weight to be the sum assign node i to the cluster Ch* whose
of all weights divided by the number of clusters centroid ch is closest to pi; i.e., select h* = arg
k. Then, for cluster C, we define the absolute min(d(pi, ch): h ∈ K) and set Ch* := Ch* + pi.
value weight measure WM(C) = |W(C) – Wavg|. (If ties exist in identifying h*, select h* to be
Note this value is 0 if the weight of C equals the the tied value of h such that Ch contains the
“perfect” weight Wavg, and otherwise WM(C) is a fewest number of nodes.)
positive value indicating how much W(C) differs Identify the new centroid ch of each resulting
from this perfect weight. Finally, define the cluster Ch.
weight value of the move pi → Cs to be WM(Cr) If no centroid ch changes, then stop.
+ WM(Cs) – WM(Cr – pi) – WM(Cs + pi), where Otherwise, repeat the Assignment Step.
Cr – pi is the cluster that results from dropping pi 2. (Improving Method) Improve the initial
from Cr and Cs + pi is the cluster that results by solution:
adding pi to Cs. This weight value is the The goal is to select an admissible move
improvement (if positive) or the deterioration (if having a largest non-negative balance value
negative) in achieving the target weights Wavg and, subject to this, having a maximum
for the two clusters that are changed. weight value. Define a move to be improving
Modified K-means algorithm outline: if it has either (1) a positive balance value or
0. (Starting Method) Generate an initial (2) a non-negative balance value and a
collection Ω = {C1, …, Ck} of k clusters: positive weight value.
Select k initial “seed points” by the following Make improving moves until no more remain,
rule, where Io denotes the index set of nodes or the predefined number of iterations has
currently selected. been reached. Save the current best solution.
First choose a node pi*, i* ∈ I, that lies Special Provision: Each time an improving
move is made, it must be checked for track of the best cluster assignment for each
feasibility by a “Feasibility Checking node, which will change only if the evaluation
Method” subsequently described. If a chosen of assigning this node to the new cluster
move is infeasible (i.e., if it destroys the qualifies it as the new best, or if the previous
connectivity of a new cluster created) then a best cluster for the node happens to be the
“Feasibility Preservation Method” is cluster that has changed. By this means, each
employed that modifies the choice of an successive step of evaluations and comparisons
admissible move. can be performed very rapidly.
3. (Optional step) Generation of new solution The second main divergence from the
space: classical K-means approach is to change the
Select a new set of seed points and repeat evaluation rule that only accounts for the
steps 0 to 2. distances from nodes to cluster centroids. We
Enhanced assignment method for step 1 of propose to replace this evaluation by a “Min
the modified K-means algorithm Worst Deviation Rule,” which we describe in the
We now introduce an enhanced assignment form where it is joined with the one-at-a-time
algorithm to replace the approach of Step 1 of rule, as follows.
the preceding method, by incorporating a Min Worst Deviation Rule: During the
strategy that further diverges from the classical process of constructing the clusters, each cluster
K-means method in two important ways. is to receive a weight as close as possible to a
First, instead of assigning each node to the target weight given by Target(C) = |C|AvgWeight,
cluster whose centroid is closest and updating where AvgWeight = the sum of all node weights
the centroids after all nodes have been assigned divided by n (the number of nodes). Upon
as the original K-means algorithm does, for each adding a node to a given cluster C to produce a
node we evaluate the assignments for all clusters new cluster C+, the number of elements in C+
and then make only the assignment with the will be given by |C+| = |C| + 1. We apply a
highest evaluation. The new cluster thus choice criterion that selects a node to add to C
produced is immediately updated before making that will make the new weight W(C+) of C+ as
any further assignments, thereby identifying a close as possible to Target(C+). Upon identifying
changed centroid for this new cluster that can a “best node” pi to add to each cluster Ci (i =
change the evaluations previously generated. 1, …,k) by this criterion, then we identify the
This one-at-a-time process of changing specific cluster Cq such that adding pq to Cq will
clusters could require greater execution time, but minimize the worst deviation of any present
it allows a more responsive mechanism for cluster from its target weight. Cq will often be
evaluating assignments. Moreover, the execution the cluster that already has the worst deviation
time can be greatly reduced by accounting for from its target weight, and then we will just pick
the fact new evaluations need only be carried out the best possible pq to add to it.
in relation to the cluster that changes. We can We observe that the strategic ideas embodied
use an efficient update derived from keeping in this rule can be applied in other settings by
changing the definition of Target(C) to reflect identify the node and edge sets respectively that
the objectives of alternative contexts. belong to the set of all clusters Ω.
The Min Worst Deviation Rule can be We specifically make use of the following
applied to Step 1 of the Modified K-means definitions.
algorithm as follows: Adjacent(p) = {q: (p, q) ∈ E}
Once the seed points are selected, then we
By common terminology, we say q is
choose a node p to add to a chosen cluster C at
adjacent to p if q ∈ Adjacent(p), hence if p and q
each step based upon: first, the added node p
are joined by an edge. (By symmetry, q ∈
must be adjacent to a node that already belongs
Adjacent(p) evidently implies p ∈ Adjacent(q).)
to the cluster C. Second, pick p and C so that the
clusters are as close as possible to being Outside = N – N(Ω)
balanced by weight at each step, employing the Hence Outside is the set of nodes that do not
criterion of the Min Worst Deviation rule. This belong to any cluster.
constructive method is additionally controlled to Together with the foregoing, we use the
prevent the number of nodes in any cluster from notation T(C) to refer to a selected spanning tree
growing too large. contained in cluster C, and denote the root of T
However, our proposed modifications of the = T(C) by r. The purpose of utilizing a spanning
K-means algorithm are not yet sufficient to tree is to gain an efficient data structure for the
handle our goal of creating balanced and solver implementation, effective evaluation of
connected clusters. A number of subtle solutions’ qualities, and feasibilities. (The
considerations must be taken into account to particular cluster C that r is associated with will
achieve this goal, which we address by always be clear from the context, so we do not
identifying a more comprehensive algorithm that have to write r = r(C).) In the following
takes our preceding observations to a discussion, r is the node selected to be the
significantly more advanced level. centroid of cluster C in the solution procedure.
Key Observations:
3.2 Advanced Balanced and Connected 1) The Connectivity Requirement is
Cluster Algorithm equivalent to saying that it is possible to identify
Our advanced Balanced and Connected a spanning tree T(C) associated with each cluster
(B&C) cluster algorithm consists of a C.
Construction Phase and an Improvement Phase. 2) The tree T(C) can be oriented so that each
We begin by describing the Construction Phase node p of the tree (hence every node of C)
which contains ideas that are fundamental for except for the root r, has a unique predecessor
setting the stage for the Improvement Phase. which we will denote by pred(p). A trace of
Basic notation and definitions predecessors, that starts with a node p and
For any subgraph S of G, let N(S) and E(S) iteratively sets p:= pred(p) identifies the unique
respectively denote the node and edge sets of S. path in the tree from the initial node p to the root.
For example, we write N(Ω) and E(Ω) to Each node on a predecessor path is distinct from
all other nodes on the path (understanding that ambiguous by referring to more than one tree
the path ends with a root r, where by convention T(C). Consequently, a single distance function
pred(r) = –1). Distance(p) can be used for all trees. Given that
3) When C gains a new node q during a Distance(p) = 0 identifies p as a root node r, we
constructive process, it also gains all edges employ the convention that Distance(p) = – 1 if
connecting q to other nodes of C. However, T(C) p does not belong to any cluster C at the present
gains exactly one of these edges, together with state of construction.
the node q. (Any of these edges can be added to We make use of the distance function to
T(C) to create a new tree, but we will choose a decide which nodes and edges should be added
specific edge based on a rule described to clusters during the construction phase. For
subsequently.) this, we first need to identify the cluster nodes to
Let length(p, q) denote the distance measure which new edges (leading outside the cluster)
between nodes p and q, which is assumed can be permissibly added.
positive if p ≠ q. For a tree T = T(C), we define: A node p ∈ C is called extensible if there
exists an edge (p, q) joining p to some outside
Distance(p) = distance measure of the path
node q (i.e., q ∈ Outside, and more particularly,
from the root r to p; i.e., the sum of the
q ∈ Outside∩Adjacent(p)).
quantities length(p, q) over all edges (p, q)
This terminology is motivated by the
∈ E that lie on the path
observation that we can select the node p and
By convention, Distance(p) = 0 if p is the add the edge (p, q) to “extend” cluster C
root node r. In the following description we will (causing it to contain an additional node). As
refer to Distance(p) as the Distance Function. emphasized in Key Observation (1), the fact that
We point out that if the lengths of all edges we identify the edge (p, q) is important to assure
in the graph are 1, then the Distance function that the cluster is connected. As a natural
corresponds to the Depth Function. Our counterpart of this terminology, a cluster C is
discussion, however, will be based upon the called extensible if it contains at least one
Distance Function and all propositions and extensible node.
observations are applicable to the special case We denote the set of extensible nodes in C
where the Depth Function is used in place of the by Extensible(C) and the set of extensible nodes
Distance Function. in all clusters (hence all extensible nodes)
It is important to keep in mind that the simply by Extensible without mention of a
distance function Distance(p) is defined relative particular cluster C.
to the specific spanning tree T = T(C) associated Analogous to the definition of extensible
with cluster C, and that different trees (or nodes, which belong to clusters, we also
different roots) within the cluster would produce consider reachable nodes, which do not belong
different distance functions. Moreover, since all to any cluster (i.e., which are outside nodes) and
the clusters C ∈ Ω are node-disjoint, there is no which are joined by edges to extensible nodes.
danger that the definition of Distance(p) will be Hence, in particular, we consider the set of
nodes reachable from a given extensible node p Property if and only if T(C+) is generated by the
by Min Distance Linking Rule.
Reachable(p) = {q ∈ Outside: (p, q) ∈ E} Additional rules for generating T(C)
For each node p ∈ Extensible, we define:
Similarly, we define the set of nodes
reachable from an extensible cluster C by BestNeighbor(p) = {q ∈ Outside |
min(length(p, r), (p, r) ∈ E and r ∈
Reachable(C) = {q ∈ Reachable(p): p ∈
Outside)},
Extensible(C)}.
i.e., the outside node closest to p. The number of
Generating a specific tree T(C) associated
elements in BestNeighbor(p) for any p can be 1
with C
or more.
By making use of the preceding definitions,
We identify a set of “best” reachable nodes,
we specify the following rule for generating the
which consist of nodes that can be reached by
tree T(C), which will be employed at each
edges from extensible nodes of C.
iteration of a constructive method.
BestReachable(C) = {q ∈ BestNeighbor(p),
Min Distance Linking Rule. Given the choice
for any p ∈ Extensible(C)}.
of an extensible cluster C, and the choice of any
given node q ∈ Reachable(C) to create a new The reason for calling this a “best” set refers
cluster C+ by adding q to C, create an extension to the fact that we will always select nodes from
of the tree T(C) to produce a new tree T(C+) by this set as a basis for extending C to create a
adding the specific edge (p, q) ∈ E, where the new cluster.
node p ∈ Extensible(C) is chosen to satisfy Finally, it is convenient to identify the index
Distance(q) = Min(Distance(h) + length(h, q): i of a cluster C = Ci (i = 1,…,k) such that p ∈ Ci
for all h ∈ Extensible(C) and (h, q) ∈ E). by setting Cluster(p) = i. By convention,
The significance of this rule is demonstrated Cluster(p) = 0 if p does not yet belong to any of
as follows. the clusters under construction.
Define a minimum path to be a path between Best Reachable Choice Rule: Let C* denote
two nodes having the minimum sum of length(p, the cluster that is chosen to be extended. Then
q), where p and q are on the path and (p, q) ∈ E. choose a node q to add to the cluster C* by
We say a tree T(C) has the Min Path requiring that q ∈ BestReachable(C*).
Property if the predecessor path from every Associated with q, identify an edge (p, q) such
node p in T(C) to the root r is a minimum path in that p ∈ Extensible(C*), and complete the
the subgraph for cluster C. Then we can make process of adding q to C* by setting Distance(q)
the following observation. = Depth(p) + length(p, q), Cluster(q) = Cluster(p)
Proposition 1 Assume T(C) has the Min Path and pred(q) = p, hence adding node q and edge
Property, and consider any node q ∈ (p, q) to the tree T(C*) (and yielding |Ci| := |Ci| +
Reachable(C) that is added to C to produce a 1 where i = Cluster(p)).
new cluster C+. Then the new tree T(C+) This rule is motivated by the fact that it will
associated with C+ will satisfy the Min Path cause paths starting from the centroid of a
cluster generated due to adding new nodes to given in a natural way by the fact that our first
increase by as little as possible. Based upon the priority is to create clusters that are balanced by
property of BestReachable, it is clear to cardinality. To exploit this objective, define
understand that by this process of adding a node MinCardinality = Min(|C|: C is extensible)
q that creates the shortest path from the centroid
PreferredClusters = {C is extensible: |C| =
in C, we avoid growing trees with long paths,
MinCardinality}
and favor bushy trees whose terminal nodes are
Preferred Cluster Rule: Apply the Best
relatively close to the root. It can also be seen
Reachable Choice Rule by selecting C* to
that the foregoing rule also implicitly embodies
satisfy C* ∈ PreferredClusters.
the Min Dist Linking Rule within it. Hence, as a
This rule means that we will not build up the
result of Proposition 1 we may state
size of a larger extensible cluster before building
Proposition 2 If the Best Reachable Choice Rule
up the size of a smaller one.
is applied at every iteration of a constructive
Identifying the Chosen Cluster C* by the
method (starting from initial clusters that consist
Second Priority of Weight: We seek to include
of a single node) then every tree T(C) that is
the influence of weight balance in this choice.
generated will satisfy the Min Path Property.
We assume that more than one option exists for
The basis for this rule is further supported by
creating a new tree by the preceding choice rules,
the following definition and observation:
so that there is latitude to choose among these
We define a cluster C to be compact if every
options in a way that favors producing clusters
extensible node p in C satisfies min(Distance(q))
with balanced weights.
= Distance(p) + length(p, q) for some q in
As earlier, we consider a target value for the
Reachable(C).
weight of a cluster C given by Target(C) =
Proposition 3 All clusters will be compact at
|C|AvgWeight, where AvgWeight is the average
each iteration of a constructive process if and
of all node weights. Then we identify the
only if the Best Reachable Choice Rule is used.
(absolute value) amount by which the current
This compactness property has the following
weight W(C) of C deviates from meeting this
motivation. While the same cluster can have a
target:
variety of different trees associated with it, if we
maintain the tree compact the cluster C itself CurrentDeviation = |W(C) – Target(C)|
will tend to be as “compact” as possible. Similarly, the new deviation that results from
Multiple criteria priorities considerations creating a new cluster C+, is given by
To complete the Construction Phase of the
NewDeviation = |W(C+) – Target(C+)|
B&C method, it remains to identify the specific
Then the improvement in the weight balance
cluster C* that will be chosen as a basis for
objective from a choice that produces a new
applying the Best Reachable Choice Rule
cluster C+ is given by the quantity
described above.
Identifying the Chosen Cluster C* by the WeightImprovement = CurrentDeviation –
First Priority of Balance: Our choice of C* is NewDeviation,
where a negative value represents a deterioration. 3.3 Construction Phase Incorporating a

Thus, when more than one option exists to Shortest Path Distance Measure
extend a cluster C = C* to produce a new cluster We introduce ShortestDist(ri: p) to denote the
+
C (which means either that there is more than shortest path distance from the root ri of cluster
one choice for C* in the Preferred Cluster Rule, Ci to node p. As mentioned earlier, this distance
or that there is more than one choice for a node can take any form appropriate to the problem at
hand, whether a Euclidean measure or a street
q that can be added to a given C* by the Best
network measure that accounts for geographic
Reachable Choice Rule), we select the option
obstacles.
that yields the largest value of
This distance will be calculated only for
WeightImprovement.
specific cluster and node pairs, so that
Giving Increased Priority to the Weight
ShortestDist(ri:p) will be known for every node
Criterion: Greater latitude for using the
p ∈ N(Ci). Similarly the distance ShortestDist(ri:
WeightImprovement criterion can result by q) will be calculated for q ∈ Reachable(Ci),
slightly revising the definition of where in the following, for convenience, we
BestNeighbor(p). Instead of denote this latter set by Reachable[i].
BestNeighbor(p) = {q ∈ Outside | The value ShortestDist(ri:p) will be accurate
min(length(p, r), (p, r) ∈ E and r ∈ for p ∈ N(Ci), but ShortestDist(ri: q) at first will
Outside)} be an “estimate” for q ∈ Reachable[i].
Subsequently, this estimate will be verified as
we may use the alternative definition
accurate when node q is transferred from
BestExtensible(C) = {q ∈ Outside | length(p,
Reachable[i] to N(Ci).
r) < minLen + Δ, (p, r) ∈ E and r ∈ We will also make use of a predecessor list
Outside)} pred(q) and a special additional predecessor list
where minLen is the minimal edge length pred(i:q), where the latter list is a list “parallel”
among all edges (p, r), p is an extensible node to Reachable[i], so that each q ∈ Reachable[i]
and r ∈ Outside. The quantity Δ represents a identifies a second node p = pred(i:q) where p ∈
selected nonnegative value. N(Ci), and (p, q) ∈ E.
More choices will exist for applying the Construction phase based on shortest
WeightImprovement criterion as Δ increases, distance values
0. Select an initial seed node for each Cluster Ci,
hence potentially improving the weight balance
i = 1,…,k to become the root node ri of the
at the risk of impairing the cardinality balance of
cluster, hence N(Ci) = ri and E(Ci) = ∅. At
the solution ultimately produced. This allows the
the same time, create Reachable[i]=
method to be adapted to handle situations where
Adjacent(ri). i = 1,…,k.
the cardinality balance objective does not
For each q ∈ Reachable[i], set
completely dominate the weight balance
ShortestDist(ri:q) = length(ri,q) and set
objective.
pred(i:q) = ri. Finally, for each i = 1,…,k, we
define the value MinReachableDist(i) and balancing objectives. (That is, we use these rules
identify the set BestReachable(i) based upon to choose the tied option that gives the largest
the definitions given above. value of WeightImprovement and to incorporate
1. Choose the cluster index i* to identify a the Revised Targets in successive iterations of
cluster Ci* from the set of PreferredClusters, the method.)
and choose the node q* so that q* ∈ Increasing the emphasis on the distance
BestReachable(i*). Then remove node q* criterion
from Outside (hence removing it from To handle those applications where it is
Reachable[i*] and also removing it from all desirable to place more emphasis to the distance
sets Reachable[i] that contain q*), and add q* criterion in creating balanced clusters, we
to N(Ci*). Accompanying this, set pred(q*) = proceed as follows.
p, where p ∈ N(Ci) and (p, q) ∈ E. We First, we choose among the candidates for
complete the updating of Reachable[i*] Ci* in the set of PreferredClusters by identifying
according to its definition (by forming its a node q* ∈ BestReachable(i) and defining
union with the set Reachable(q*)) in the next NextShortest(i) = Min(ShortestDist(ri:j): j ∈
step. BestReachable(i) – {q*}).
2. Define TrialDistance(h) = ShortestDist(ri*:q*) Then we define DifDistance(i) =
+ length(q*,h), (q*,h) ∈ E. NextShortest(i) – ShortestDist(ri:q*) and pick
For each h ∈ Reachable(q*): if h ∉ Ci* ∈ PreferredClusters so that DifDistance(i*) =
Reachable[i*] then update the shortest Max(DifDistance(i): Ci ∈ PreferredClusters).
distance value by setting ShortestDist(ri*:h) = Thus, we minimize the regret of not picking a
TrialDistance(h), setting pred(i*:h) = q* and particular shortest distance node whose next
adding h to Reachable[i*]. But if h ∈ shortest distance exceeds the shortest distance
Reachable[i*], then update the shortest by the greatest amount.
distance value only if TrialDistance(h) < To apply this type of distance criterion even
ShortestDist(ri*:h) (and add h to more strongly, we enlarge the set of
Reachable[i*]). PreferredClusters by redefining it to be
Taking advantage of ties PreferredClusters = {Ci is extensible: |Ci| ≤
We note there are multiple places in the MinCardinality + α}
foregoing method where ties may occur in the where α is a selected nonnegative number.
choice rules, as in selecting the cluster i* and in Selecting α to be large allows the distance
choosing the node q* once i* has been selected. criterion dominate the cardinality balancing
(In addition, there may even be ties possible in criterion entirely.
selecting the node p* = pred(i*:q*), though we
have treated this predecessor node as unique for 3.4 Overall Structure of the B&C Cluster
simplicity.) The above discussion of weights and Method
targets in multiple criteria considerations gives The overall structure of the B&C Cluster
rules for breaking these ties to handle weight Method, which includes an Improvement Phase
to augment the fundamental Construction Phase, be handled in two main ways.

can now be sketched as follows. After Procedure 1 When the method selects new
expressing the method in outline form, we seed nodes at some iteration of Step 3, identify
subsequently describe the Improvement Phase in the best set of clusters produced since the last
detail. time that seed nodes were generated, and specify
0. Choose initial seed nodes as roots and initial the new seed nodes to be k nodes from N that are
Target(Ci) values for the clusters Ci, i =1,…,k. (respectively) closest to each of the centers of
Then, until a termination condition is reached, gravity of the k clusters.
execute the following: Procedure 1 corresponds to the one used to
1. Apply the Advanced Construction Phase. generate new seed nodes in the Modified
2. Apply the Improvement Phase. K-means Algorithm, except that the k clusters
3. Until a stopping criterion is met, choose new selected are produced by a different method.
seed (root) nodes and/or new target values
Procedure 2 Identify a best set of k clusters
and return to Step 1.
as in Procedure 1. For each such cluster C,
The Construction Phase of Step 1 and the
determine the shortest paths between all pairs of
Improvement Phase of Step 2 can be coordinated
nodes in C using the distance measure that
in a different fashion by requiring the
assigns the length of each edge. Then select a
Improvement Phase to invariably follow the
seed node r for C that minimizes the quantity
Construction Phase. For example, the
Improvement Phase can be skipped on selected ∑ (|ShortestDistance(r, p) – AvgDistance|: p
iterations, as by executing a series of iterations ∈ N(C), p ≠ r)
where only the Construction Phase is applied, where
and then applying the Improvement Phase AvgDistance = (∑(ShortestDistance(p, q): (p,
starting from the best outcome produced by the q) ∈ E(C)))/|E(C)|
Construction Phase during these iterations.
Procedure 2 is clearly more elaborate than
Alternatively, a simple version of the approach
Procedure 1, but may produce seed nodes that
might skip the Improvement Phase entirely, or
ultimately result in better clusters than otherwise
instead perform a single execution of the
obtained on subsequent iterations.
Construction Phase followed by a single
execution of the Improvement Phase, and then
3.4.2 Improvement Phase
stop. The stopping criterion of Step3 can be
The complete form of the Improvement
based on customary factors such as the
Phase is based on introducing special processes
predefined total computational time exceeds or
to exploit tree structures.
no more improved solution can be found. At an
extreme the method may terminate at the Fundamental definitions related to exploiting
conclusion of a single iteration. tree structures
A node q is called a successor (or descendant)
3.4.1 Choosing New Seed Nodes of p if p can be obtained by a predecessor trace
The issue of generating new seed nodes can starting from q, and q is called an immediate
successor (or child) of p if p = pred(q). Recall that the Connectivity Condition

In particular, let Child(p) = {q ∈ Adjacent(p): stipulates that each cluster within Ω is connected,
pred(q) = p}. Note that since the predecessor and note that this condition is satisfied upon the
array pred(p) automatically identifies a node in termination of the Construction Phase, and
specific cluster p belongs to (i.e., the cluster Ci hence at the beginning of the Improvement
for i = Cluster(p)), the set Child(p) also refers to Phase.
such a specific cluster. Hence automatically, Connectivity Relationship: Assume that Ω
Cluster(q) = Cluster(p) for all q ∈ Child(p). satisfies the Connectivity Condition. Then the
For a given cluster C: move c → D that produces the two new clusters
Let T(C: p) denote the subtree of T(C) that is C’ = C – c and D’ = D + c will yield a new Ω
rooted at a given node p; i.e., T(C: p) is the tree that satisfies the Connectivity Condition if and
consisting of p and all successors of p in T(C). only if V’(C, c) contains at least one re-rootable
Let c denote the node that is dropped from C tree, and every subtree T(C:p) that is not
by performing the move c → D (i.e., transferring re-rootable belongs to a re-rootable collection,
node c from its current cluster to a cluster D) in where p ∈ Child(c).
the Improvement Phase to create the new cluster Feasibility checking and re-rooting
C’, where by C’ = C – c. A valuable feature of the Connectivity
Associated with C’ and c, let S’(C: c) be the Relationship is that it can be checked without an
subgraph of T(C) which arises by deleting all excessive amount of effort. This is based on
nodes of T(C:c) from T(C) (hence also deleting executing a “re-rooting procedure” for
all edges of T(C) that meet these nodes). identifying re-rootable subtrees and re-rootable
Likewise associated with C’ and c, let V’(C:c) collections that will establish the indicated
be the subgraph of T(C) consisting of all connectivity.
subtrees T(C:p) such that p ∈ Child(c). To be precise, we identify a Re-Rooting
We observe that S’(C:c) is itself a subtree algorithm as follows. The re-rooting method
within T(C), and the subgraph of T(C) that ensures the connections of the subtrees in the
results by dropping c from T(C) (and all edges process of applying the Connectivity
of T(C) meeting node c). T(C) is precisely the relationship. The efficiency of this method
union of S’(C:c) and V’(C:c). depends on the structure of the graphs
A subtree T(C: p) in V’(C: c), where p ∈ encountered.
Child(c), will be called re-rootable if there exists It is important to observe that the following
an edge (c1, c2) ∈ E(C) such that c1 ∈ N(T(C:p)) algorithm makes use of the Distance function,
and c2 ∈ N(S’(C:c)). whose initial values are inherited from the
Finally, a collection of subtrees T(C:p) in Construction Method. In the following
V’(C:c) will be called a re-rootable collection if discussion, we assume that node c is removed
there exists a set of edges in E(C) that join these from the current cluster C.
subtrees to create a tree that includes a Re-rooting algorithm
re-rootable tree. 1. For a given subtree T(C:p) (rooted at a node p
∈ Child(c)), let c1 denote the first node found before it is performed. If the move is not feasible,
in the subtree that links by an edge to the then another move must be chosen instead.
subtree S’(C: c). (That is, c1 is found starting
at p and if p itself doesn’t link to S’(C:c) then 3.4.3 Feasibility Preserving Method for
we move on successors of p to look for such Re-Rooting
a link.) The Feasibility Preserving Method, can be
2. Given c1, choose c2 to be the specific node in employed when speed of execution dominates
S’(C:c) given by Distance(c2) = other concerns, and when the rejection of
Min(Distance(q): (c1, q) ∈ E and q ∈ N(S’(C: possibly feasible moves is not considered to be a
c)); i.e., (c1, q) joins c1 to S’(C:c)). great drawback. In fact, the Feasibility
3. If c1 differs from p, reverse the predecessor Preserving approach may be able to identify a
path from p to c1 by taking each edge (q, j) significant number of cases where feasible
on this path such that q = pred(j) and moves exist
re-setting pred(q) = j. Finally, set pred(c1) = This method shares the ability to operate
c2. without reference to the Distance function.
Change the distance measure of any node h Instead we make reference to an alternative
on the new predecessor path, going from c1 in predecessor PredA(p), and additionally make
reverse to p, as follows: Distance(q) = use of a function Safe(p), where Safe(p) =
Distance(p) + length(p, q), where p = pred(q). TRUE if it is safe (i.e., feasible) to choose node
Feasibility Checking consists simply of the p as the node c in the transfer-move c → D, and
following operation. First we identify the Safe(p) = FALSE otherwise (i.e., if we do not
indicated node c1 on each subtree T(C: p) know whether the move c → D is feasible, based
(before re-linking c1 to a node c2). Then we on the information processed by the method).
conclude that the move is feasible if we find a The ability to simply check whether Safe(p) =
qualifying node c1 on each of these subtrees. TRUE or FALSE when considering a
Otherwise, we reject the move. transfer-move greatly accelerates the
Implementation of the improvement phase Improvement Method.
It may be expected that in most cases the In the following procedure, the sets Child(p)
choice of a move c → D by the Improving and Adjacent(p) are treated as ordered sets (lists)
Method will automatically preserve the and their elements are always to be examined in
connectivity of C’. Hence we will not bother to the same sequence.
apply the Connectivity Relationship procedure Feasibility preserving re-rooting algorithm:
discussed above to check whether each move (Initialization) When the Improvement Method
being considered is feasible (as opposed to a is launched (at the conclusion of the
move that is finally chosen to be executed). Construction Method) perform the following
However, once a particular move c → D has steps.
been chosen for execution, then it must be For each i = 1,…k, execute the following
analyzed to make sure the move is feasible steps:
Initialize Cluster Ci Elseif j = c, then

(a) For each node p ∈ N(Ci) (hence i = continue with Next q (retain
Cluster(p)), set Safe(p) = FALSE and PredA(p) = 0)
PredA(p) = 0. Else set j = pred(j) and return to (2.0)
(b) Let r = root(Ci) and for each c ∈ N(Ci) Endif
(anticipating the potential later use of c as a End Trace of q
node that will take part in a move of the form Endif
c → D), carry out the following operations: Next q ∈ Adjacent(p) (until all are
Set ScanChild = Child(c) examined, unless jump out)
Set FailTest = FALSE (the following Test FailTest = TRUE (Here PredA(p) = 0).
Routine is assumed not to fail, unless End Update PredA(p)
otherwise determined to do so) Endif
If Child(c) is empty, set Safe(c) = TRUE, Next p ∈ ScanChild (until all are examined)
and skip the Test Routine, to examine next If NextTrace = TRUE return to (1.0)
c ∈ N(Ci). If FailTest = FALSE then set Safe(c) =
Special case: If c = r execute Special Setup TRUE and End Test Routine (jump out, no
below in place of the following need to continue trace)
Test Routine and Restore Routine. End Test Routine
Test Routine Restore Routine
(1.0) Set NextTrace = FALSE For each p ∈ Child(c) set pred(p) = c
For each p ∈ ScanChild End Restore Routine
If PredA(p) = 0 (automatically true if Trace Next c ∈ N(Ci)
= 1) then End Initialize Cluster Ci
Update PredA(p) Now we handle the special case for c = r:
For each q ∈ Adjacent(p) such that Special Setup (when c = r):
Cluster(q) = i Select the first p ∈ Child(c) and denote it
If q ≠ c then by p*.
Begin Trace of q: Set PredA(p*) = 0.
Set j = q If |Child(c)| = 1 (i.e., Child(c) = p*) then
(2.0) If j = r, then set Safe(c) = TRUE, and nothing
set PredA(p) = q, pred(p) = q. more needs to be done.
If FailTest = TRUE set NextTrace Else
= TRUE. Set Child(c) = Child(c) – {p*} (retaining
Remove p from ScanChild, the order of remaining elements of
set FailTest = FALSE and End Child(c), treated as an ordered list), set
Update PredA(p) (jump out of pred(r) = p* and then r = p*.
Update PredA(p) to get next p ∈ Execute Test Routine and Restore
Child(c)) Routine as indicated above (now c = the
old “true” r, not the current r). Tree Structure Update for Cluster C* = Ci*
Finally, set r = c, add p* to the front of For each p ∈ Child(c*) set pred(p) =
Child(c) and set pred(p*) = r. PredA(p)
Endif If c* = root(Ci*), then
End Special Setup Let p* be the first element of Child(c*)
Underlying rationale (hence PredA(p*) = 0 and now pred(p*)
If the predecessor trace from q adjacent to p = 0). Set root(Ci*) = p*
reaches pred(p) (for pred(p) = c) first before End Tree Structure Update for Cluster C* = Ci*
reaching the root r, then the test fails (this q does Let d* denote the node of D* chosen for c*
not re-link except by passing through pred(p)), to attach to, and let h* = Cluster(d*) (hence Ch*
but keep looking for nodes other than q adjacent = D*).
to p to see if any of them can trace back and Tree Structure Update for Cluster D* = Ch*
miss running into pred(p). Set pred(c*) = d*
If the predecessor trace reaches r first, we End Tree Structure Update for Cluster D* = Ch*
succeed in finding path that skirts pred(p), and The preceding changes of course imply
can connect node p to the path when this is changes in the composition of the children of the
needed (if pred(p) becomes a node c that is nodes named as predecessors.
moved to another cluster). The stored value Finally, execute the Feasibility Preserving
PredA(p) gives node q that can become the new Re-Rooting Algorithm, but restricted to the two
predecessor of p (by setting pred(p) = PredA(p)) clusters Ci for i = i* and i =h*. Then proceed
when the current node pred(p) is moved. with a new iteration of the Improvement
Thus the underlying rationale is to determine Algorithm.
whether it is possible to execute a trace that It is possible to identify a faster algorithm for
reaches root r without going through pred(p). re-rooting the clusters Ci* and Ch*, but the
Even if Safe(pred(p)) = FALSE, we must still execution of the Feasibility Preserving
update because we may later remove a node q Re-Rooting Algorithm to achieve this re-rooting
that is a child of pred(p) having PredA(q) = 0, only needs to be performed when a move is
and with this q removed we may be able to set actually selected, and does not consume
Safe(pred(p)) = TRUE. excessive time.
Selecting a move
A move c → D is permitted in this method 4. Computational Results
only if Safe(c) = TRUE. The B&C algorithm of Section 3.4 was
Updating the cluster structures for a selected implemented using the C# programming
move c* → D* language, and the computational environment
Let i* = Cluster(c*). for all testing experiments was a desktop with
Together with the process of removing node Windows XP professional operating system,
c* from cluster Ci* and adding it to cluster D*, CPU (Platinum IV) speed 3.2 GHz, and 1 GB of
carry out the following updates. RAM. ESRI’s ArcInfo product was used to
generate the Thiessen polygons for all sets of improvement procedure always starts with the
point data. The datasets were collected from the current best solution, and the procedure
real logistics applications in different countries. terminates if no improvement can be found. An
In order to consider travel distance and load effort to find the best improvement at each
balancing simultaneously, we introduce a iteration can be very time consuming, and
weighted objective function: consequently we adopted a threshold that
a1 * Dist + a2 * Dev determines the degree of improvement that is
considered admissible for selecting a move; i.e.,
where Dist is the value of sum of average travel
if the improvement in the objective function
distance of each cluster while Dev is the sum of
exceeds this threshold the move will be accepted
deviations of quantities to be balanced in each
cluster. Parameters a1 and a2 can be manually immediately instead of looking for further
adjusted to reflect business practice. For improvement. Based on preliminary experiments,
example, in some cases it may be more desirable we set this threshold to be 0.01. Furthermore, in
to minimize the average travel distance while in order to better evaluate the objective function,
other cases it may be preferable to build well we normalize the values of Dist and Dev to lie in
balanced clusters. The ability to combine the the range [0, 1]. The datasets in the following
effects of these two factors (travel distance and computational experiments are extracted from
balancing quantity) facilitates the real logistics applications.
implementation of the solver (by turning the We present the computational results to
problem into one having a single objective) and demonstrate the following facts:
provides flexibility for meeting various business
z The B&C parameters impact overall results
needs.
(balance vs. compactness)
Based upon the methodology of re-rooting,
z The B&C algorithm is able to create
involving the addition and deletion of nodes
balanced and connected clusters, and
from clusters, we implemented the following
z There is a significant difference between the
inter-cluster improvement steps:
z Node transfer: a node is removed from its
clusters produced by the B&C algorithm and
current assigned cluster and added to another conventional clustering algorithms such as
cluster if the objective function value can be the K-means method.
improved (reduced) Dataset 1 In this dataset, there are 328 stops
z Node exchange: two nodes exchange their with each stop having capacity ranging from 1 to
assigned clusters as a basis for improving 13. Figure 6 shows the locations of these stops.
the objective function The stops are not evenly distributed in the
The improvement procedure first performs a underlying area. Figure 7 and Figure 8 show the
node transfer operation followed by a node results for the two different parameter settings.
exchange operation, and the entire improvement The outcomes shown below were created by the
sequence is performed three times. The algorithm without any human intervention.
indicates that the clusters produced by the B&C

algorithm have very clean boundaries, and do
not suffer the defect commonly encountered in
this type of setting where some stop is enclosed
spatially within a cluster other than the one to
which it is assigned (i.e., the cluster that actually
contains the stop “interpenetrates” the second
cluster). By putting a heavier weight on the
balancing factor, all clusters are well balanced.
Figure 6 Stops for dataset 1
When the focus is on minimizing expected
travel time, the algorithm delivers the desired
result while keeping each cluster well bounded.
Table 1 Results for dataset 1

Average
Problem Cluster Computational
travel
type capacity Time
distance
a1 = 0.8 983 342
and a2 = 983 424 32 sec.
0.2 988 176
a1 = 0.2 985 340
Figure 7 Result for dataset 1 where a1 = 0.8 and a2 = and a2 = 984 432 34 sec.
0.2 0.8 985 173
Dataset 2 in this dataset we consider a real

service territory planning problem having a large
quantity of stops to be clustered. Figure 9
depicts the stops for this dataset.
Figure 8 Result for dataset 1 where a1 = 0.2 and a2 =

0.8
Table 1 summarizes the computational

results for these two parameter settings.
Figure 9 Stops for dataset 2
The table confirms that the different
parameter settings have a significant impact on This application contains 22052 stops, each
the solution results. In addition, the map display with a capacity of 1 unit. The objective is to
build clusters having a balanced number of stops. esthetically pleasing.

Based upon the business logic, the entire As mentioned earlier, we have not found a
territory should be divided into five (5) service clustering algorithm that is able to handle the
territories so that one territory will be serviced challenge of creating balanced and connect
every weekday. Furthermore, the service center clusters found in the logistics industry. In order
is located in the center of the area, and the user to demonstrate the significant difference
requires that each cluster should be structured to between the proposed algorithm and a widely
access this service center. used alternative clustering algorithm, we
We set 5 seed points around the service conduct benchmark tests comparing the
center to induce the clusters to start from the outcomes of our approach with those obtained
service center. Figure 10 shows the visual by the K-means algorithm. All datasets are
representations of the clusters for this dataset. selected from real applications, albeit
geographic locations are different. We have
restricted the form of the benchmarks to permit
the K-means algorithm to be applied. In
particular, while our algorithm is able to handle
heterogeneous stops having different capacities,
such a scenario cannot be handled by the
K-means algorithm. Therefore, to permit a
comparison we set all capacities to be 1. The
balancing goal then becomes that of balancing
Figure 10 Visual result for dataset 2 the number of stops in each cluster.
Computational results are listed in the following
The visual display for this dataset shows that table, which identifies the size of a problem in
all clusters start from the center of the terms of the number of stops to be clustered and
underlying territory, which satisfies the initial the ideal number of stops in each cluster. For the
business requirement. Several stops (displayed K-means and the B&C algorithm, we list the
in green) located on the north tip appear to be computation times and report the minimum and
separated from other stops of the same cluster. maximum number of stops for each problem to
Based upon the Thiessen polygon created for show the degree of balance achieved.
this dataset, these stops are in fact adjacent to The table confirms that the K-means
other stops in the same green cluster; hence, the algorithm is very fast, as expected. However, it
results do not violate the connectivity condition lacks the capacity to create well balanced
by reference to this polygon. Furthermore, since clusters and consequently performs poorly
the clusters have very clean boundaries, the user relative to this criterion, severely restricting its
of the system can adjust the result by moving value for applications in the logistics industry.
selected groups of stops between clusters to The B&C algorithm takes longer to achieve its
make the visual appearance of the outcome more results but produces clusters that are appreciably
superior in terms of the balance criterion. It is distributed in the area.

interesting to note that as the number of clusters A real world application
increases, the computation time usually Our B&C algorithm has been used in a real
decreases for a given dataset. This reflects the world application for a delivery service
fact that when the number of clusters increases, company as a system to plan daily delivery
the B&C algorithm requires fewer steps to territories. The results dramatically confirmed
improve the overall solutions (clusters). The the effectiveness of the algorithm, which made it
computation time also depends on the possible to reduce the number of vehicles
geographic characteristics of the area where the employed while increasing the number of
stops are located, and on how the stops are customers served per vehicle. In addition, a
Table 2 Computational comparisons
Problem Num. K-means Algorithm Our algorithm

Size clusters
(number (Avg. # Min. stops Max. Min. stops Max.
CPU
of stops) stops) in a stops in a in a stops in a CPU time
time
cluster cluster cluster cluster
1063 3 (354) 61 770 1 sec. 354 355 19 min. 23 sec.
1063 9 (118) 41 348 2 sec. 118 118 7 min. 2 sec.
1063 15 (71) 26 255 2 sec. 70 71 6 min.
1884 5 (377) 10 878 2 sec. 293 411 3 min. 2 sec.
1884 10 (188) 10 564 2 sec. 177 189 2 min. 50 sec.
1884 17 (111) 13 329 2 sec. 101 120 1 min. 23 sec.
3058 10 (306) 173 808 6 sec. 300 320 4 min. 20 sec.
3058 15 (204) 105 698 12 sec. 200 208 3 min. 52 sec.
3058 20 (153) 79 426 7 sec. 145 168 4 min.
5247 10 (528) 273 1226 18 sec. 510 545 25 min. 45 sec.
5247 20 (262) 135 848 10 sec. 253 270 10 min. 50 sec.
5247 30 (175) 51 676 10 sec. 143 192 6 min. 50 sec.
Table 3 Benefits of algorithm
Number of Vehicles Employed

Before System Deployed After System Deployed
40 20
Number of Customers Serviced per Vehicle
60 93
Number of Products Delivered per Vehicle
45 75
Average Vehicle Loading Rate
51% 81%
Time Spent on Planning Territories
days < 30 minutes
substantial increase resulted in the number of where particular service persons or vehicles
products delivered per vehicle, accompanied by cannot service particular stops in a territory (2)
a significant improvement in the average vehicle the introduction of adaptive memory
loading rates. Finally, the amount of time spent metaheuristics (principally tabu search) to guide
on planning the territories was reduced from the current local search process to yield better
days to less than half an hour. solutions; (3) the utilization of multi-core CPU
Table 3 displays the results before and after to speed up the algorithm; and (4) the creation of
the system was deployed. corresponding advanced data structures to
In summary, the algorithm succeeded in maintain algorithmic efficiency.
identifying a collection of territories that made it
possible for the vehicle routing system to create Acknowledgements
efficient delivery routes that achieved significant We would like to express our gratitude to
economic outcomes. two anonymous referees for their very useful
suggestions to improve our manuscript.
5. Conclusions
Our B&C algorithm for creating balanced References
and connected clusters provides an effective [1] Barreto, S., Ferreira, C., Paixao, J. & Santos,
means for exploiting Thiessen polygons, as B.S. (2007). Using clustering analysis in a
demonstrated by computational tests on datasets capacitated location-routing problem.
drawn from real world applications. The European Journal of Operational Research,
re-rooting component and utilization of 179: 968-977
spanning-tree data structure of our algorithm [2] Blakeley, F., Bozkaya, B., Cao, B. & Hall,
succeeds in retaining connectivity in an efficient W. (2003). Optimizing periodic maintenance
manner throughout the solution improvement operations for Schindler Elevator
steps. The computational outcomes further Corporation. Interfaces, 33: 67-79
disclose the algorithm’s robustness, which [3] Braysy, O. & Gendreau, M. (2005). Vehicle
enables a user to apply the algorithm without routing problems with time windows, part I:
extensive tuning and without having to change route construction and local search
parameter values to solve problems whose algorithms. Transportation Science, 39:
objectives belong to a common class. These 104-118
features afford significant advantages in [4] Braysy, O. & Gendreau, M. (2005). Vehicle
reducing the planning time and increasing the routing problems with time windows, part II:
quality of outcomes obtained in designing metaheuristics. Transportation Science, 39:
service territories. 119-139
Future research will focus on enhancements [5] Dondo, R. & Cerda, J. (2007). A
to handle additional problem considerations and cluster-based optimization approach for the
to introduce additional algorithmic features, multi-depot heterogeneous fleet vehicle
including (1) a capability to handle conditions routing problem with time windows.
European Journal of Operational Research, Statistics and Probability, 1: 281-297,

176: 1478-1507 Berkeley, University of California Press
[6] Estivill-Castro, V. & Lee, I. (2001). Fast [14] Strehl, A. & Ghosh, J. (2002).
spatial clustering with different metrics and Relationship-based clustering and
in the presence of obstacles. In: GIS’01, visualization for high-dimensional data
142-147, November 9-10, 2001 mining. INFORMS Journal on Computing,
[7] Fan, B. (2009). A hybrid spatial data 1-23
clustering method for site selection: the data [15] Zhang, B., Yin, W.J., Xie, M. & Dong, J.
driven approach of GIS mining. Experts (2007). Geo-spatial clustering with
Systems with Applications, 36: 3923-3936 non-spatial attributes and geographic
[8] Hruschka, H. & Natter, M. (1999). non-overlapping constraint: a penalized
Comparing performance of feedforward spatial distance measure. In: Proceeding of
neural nets and K-means for cluster-based PAKKD’07, 1072-1079, Springer-Verlag
market segmentation. European Journal of Berlin Heidelberg
Operational Research, 114: 346-355
[9] Huff, D.L. (1963). A probabilistic analysis of Buyang Cao has earned his B.S. and M.S.
shopping center trade areas. Land Economics, degrees in Operations Research at University of
39: 81-90 Shanghai for Science and Technology, and his
[10] Humair, S. & Willems, S.P. (2006). Ph.D. in Operations Research at University of
Optimizing strategic safety stock placement Federal Armed Forces Hamburg, Germany. He
in supply chains with clusters of is working as an Operations Research team
commonality. Operations Research, 54: leader at Esri, Inc., California, USA. Currently
725-742 he is also a guest professor at School of
[11] Karypis, G. & Kumar, V. (1998). A fast Software Engineering at Tongji University,
and high quality multilevel scheme for Shanghai, China. He has been involved and led
partitioning irregular graphs. SIAM Journal various projects related to solving logistics
of Scientific Computing, 20: 359-392 problems including Sears vehicle routing
[12] Kwon, Y.J, Kim, J.G., Seo, J., Lee, D.H. & problems, Schindler Elevator periodic routing
Kim, D.S. (2007). A Tabu search algorithm problems, a major Southern California Energy
using Voronoi diagram for the capacitated technician routing and scheduling problems, taxi
vehicle routing problem. In: Proceedings of dispatching problems for a luxury taxi company
5th International Conference on in Boston, etc. He published papers in various
Computational Science and Applications, international journals on logistics solutions, and
IEEE Computer Society, 480-485 he also reviewed scholar papers for several
[13] MacQueen, J.B. (1967). Some methods for international journals. His main interest is to
classification and analysis of multivariate apply GIS and optimization technologies to
observations. In: Proceedings of 5-th solve complicated decision problems from real
Berkeley Symposium on Mathematical world.
Fred Glover holds the title of Distinguished Theory Prize, an elected member of the National
Professor at the University of Colorado and is Academy of Engineering, and has received
Chief Technology Officer for OptTek Systems, honorary awards and fellowships from the
Inc. He has authored or co-authored more than American Association for the Advancement of
400 published articles and eight books in the Science (AAAS), the NATO Division of
fields of mathematical optimization, computer Scientific Affairs, the Miller Institute of Basic
science and artificial intelligence. He is the Research in Science and numerous other
recipient of the distinguished von Neumann organizations.

p24 CreatingBalancedAndConnectedCl

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

p24 CreatingBalancedAndConnectedCl

Uploaded by

Copyright:

Available Formats

J Syst Sci Syst Eng (Dec 2010) 19(4): 453-480 ISSN: 1004-3756 (Paper) 1861-9576 (Online)

DOI: 10.1007/s11518-010-5150-x CN11-2983/N

CREATING BALANCED AND CONNECTED CLUSTERS TO IMPROVE

Buyang CAO1 Fred GLOVER2

1. Introduction VRP/TW is an NP hard problem, and it has

© Systems Engineering Society of China & Springer-Verlag Berlin Heidelberg 2010

week or by a certain sets of drivers (vehicles). criteria can be expressed as bounds on

scheduling and dispatching for Schindler problem is to produce a minimum weight

Strehl & Ghosh (2002) developed an where:

data-mining problem found in retail-industry (such as square footage).

in a very high dimensional space. Their z α is an attractiveness parameter estimated

approach introduces a similarity relationship from empirical observations.

similarities are computed, the problem is from empirical observations.

the foregoing applications includes balanced clustering problems. Computational

logistics and service industry to plan and

geographically non-overlapped and balanced

predefined number of clusters to be created.

whose centroid is closest to the node; 3) Fan (2009), or

outcomes are to be expected because the

Figure 5 Result obtained based on real street

Figure 3 Stops to be clustered 3.1 Modified K-means Algorithm

where a negative value represents a deterioration. 3.3 Construction Phase Incorporating a

to augment the fundamental Construction Phase, be handled in two main ways.

successor (or child) of p if p = pred(q). Recall that the Connectivity Condition

Initialize Cluster Ci Elseif j = c, then

indicates that the clusters produced by the B&C

Table 1 Results for dataset 1

Dataset 2 in this dataset we consider a real

Figure 8 Result for dataset 1 where a1 = 0.2 and a2 =

Table 1 summarizes the computational

build clusters having a balanced number of stops. esthetically pleasing.

superior in terms of the balance criterion. It is distributed in the area.

Table 2 Computational comparisons

Problem Num. K-means Algorithm Our algorithm

Table 3 Benefits of algorithm

Number of Vehicles Employed

European Journal of Operational Research, Statistics and Probability, 1: 281-297,

You might also like