You are on page 1of 2

int kdtree(int dim, int ndata, double *data, int kk,

int *cluster_start, int *cluster_size,


double **cluster_bdry, double **cluster_centroid,
short *cluster_assign) ;

/************************************************************************
The Algorithm Skeleton:
jj = 1 // jj -- the number of clusters the dataset is partitioned into.
// At the beginning, jj=1, meaing only one cluster, the whole dataset.
while (jj < kk ) { // there are jj clusters at this moment
for (j=0; j<jj; j++) { // j loops through indices of all jj clusters
find the dimension of largest variance for the j-th cluster
call bipartition() to partition the j-th cluster into 2 clusters
}
jj = 2*jj
}

Array sizes:
data[ndata*dim] -- stores ndata data points, each of dim dimensions ,
where data[i*dim], data[i*dim+1], ..., data[i*dim+dim-1]
form the i-th datum.
cluster_start[kk]-- stores the index of the starting datum of a cluster.
For example, cluster_start[k] stores the index of the
starting datum of the k-th cluster, when all data in
the same cluster are stored consecutively in data[].
cluster_size[kk] -- stores the number of the data items of each cluster.
For example, cluster_size[k] stores the number of the
data items of the k-th cluster.
cluster_bdry[kk][2*dim] -- For each k between 0 and kk-1,
cluster_bdry[k][2*j] and cluster_bdry[k][2*j+1]
store the min and max of all j-th dimension data
for the k-th cluster, for j=0, 1, ..., dim─1.
cluster_centroid[kk][dim]-- For each k between 0 and kk-1, cluster_centroid[k][0],
cluster_centroid[k][1], ..., cluster_centroid[k][dim─1]
store the centroid of the k-th cluster.
cluster_assign[ndata] -- for passing to function bipartition()
**********************************************************************************/
int bipartition(int dim, int i0, int im, double *data, int chosen_dim,
int cluster_start[2], int cluster_size[2],
double *cluster_bdry[2], double *cluster_centroid[2],
short *cluster_assign) ;
/************************************************************************
chosen_dim -- the dimension chosen as the one along which the input data subset
(from data index i0 to im-1) is to be partitioned.

Array sizes:
data[ndata*dim] -- same as in kdtree().
cluster_start[2]-- stores the index of the starting datum of a cluster.
For example, for k=0 or 1, cluster_start[k] stores the index
of the starting datum of the k-th cluster, when all data in
the same cluster are stored consecutively in data[].
cluster_size[2]-- stores the number of the data items of each cluster,
for example, cluster_size[k] stores the number of the
data items of the k-th cluster, for k=0 or 1.
cluster_bdry[2][2*dim] -- For each k = 0 or 1,
cluster_bdry[k][2*j], cluster_bdry[k][2*j+1]
store the min and max of all j-th dimension data
for the k-th cluster, for j=0, 1, ..., dim─1.
cluster_centroid[2][dim]-- For each k = 0 or 1, cluster_centroid[k][0],
cluster_centroid[k][1], ..., cluster_centroid[k][dim─1]
store the centroid of the k-th cluster.
cluster_assign[ndata] -- array store index of the cluster a datum is
assigned to. For example, cluster_assign[i] stores
the cluster index the i-th datum belongs to.
**********************************************************************************/

int search_kdtree(int dim, int ndata, double *data, int kk,


int *cluster_start, int *cluster_size, double **cluster_bdry,
double *query_pt, double *result_pt) ;

/********************************************************************************
search_kdtree() returns the number of data whose distances to query_pt
are calculated.

Array sizes:
query_pt[dim] -- the query point, to which the closest data point from the
dataset is to be found.
result_pt[dim]-- the closest data point to the query point.

Algorithm Skeleton:
1. Find the “closest cluster” to the query_pt.
Note: As discussed in earlier classes, if the query_pt is within the bdry of a
cluster, the distance between the query and the cluster should be ZERO;
but if the query is outside of boundaries of a cluster, you need to find a
definition of the distance from cluster_bdry to the query_pt
2. From the “closest” cluster, find the closest datapoint to the query_pt. Denote
the distance from the closest datapoint to the query_pt as d_min.
3. Use d_min to eliminate clusters whose “distances” to the query_pt are larger
than d_min.
4. For the remaining clusters whose distances are less than or equal to d_min,
what needs to be done?
**********************************************************************************/

You might also like