This action might not be possible to undo. Are you sure you want to continue?

**APPLICATION-DRIVEN PROCESSING AND RETRIEVAL
**

DISSERTATION

Presented in Partial Fulﬁllment of the Requirements for

the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Michael Albert Gibas, B.S., M.S.

* * * * *

The Ohio State University

2008

Dissertation Committee:

Hakan Ferhatosmanoglu, Adviser

Hui Fang

Atanas Rountev

Approved by

Adviser

Graduate Program in

Computer Science and

Engineering

ABSTRACT

The proliferation of massive data sets across many domains and the need to gain

meaningful insights from these data sets highlight the need for advanced data re-

trieval techniques. Because I/O cost dominates the time required to answer a query,

sequentially scanning the data and evaluating each data object against query crite-

ria is not an eﬀective option for large data sets. An eﬀective solution should require

reading as small a subset of the data as possible and should be able to address general

query types. Access structures built over single attributes also may not be eﬀective

because 1) they may not yield performance that is comparable to that achievable by

an access structure that prunes results over multiple attributes simultaneously and 2)

they may not be appropriate for queries with results dependent on functions involv-

ing multiple attributes. Indexing a large number of dimensions is also not eﬀective,

because either too many subspaces must be explored or the index structure becomes

too sparse at high dimensionalities. The key is to ﬁnd solutions that allow for much of

the search space to be pruned while avoiding this ‘curse of dimensionality’. This the-

sis pursues query performance enhancement using two primary means 1) processing

the query eﬀectively based on the characteristics of the query itself and 2) physically

organizing access to data based on query patterns and data characteristics. Query

performance enhancements are described in the context of several novel applications

ii

including 1) Optimization Queries, which presents an I/O-optimal technique to an-

swer queries when the objective is to maximize or minimize some function over the

data attributes, 2) High-Dimensional Index Selection, which oﬀers a cost-based ap-

proach to recommend a set of low dimensional indexes to eﬀectively address a set

of queries, and 3) Multi-Attribute Bitmap Indexes, which describes extensions to a

traditionally single-attribute query processing and access structure framework that

enables improved query performance.

iii

ACKNOWLEDGMENTS

First of all, I would like to thank my wife, Moni Wood, for her love and support.

She has made many sacriﬁces and expended much eﬀort to help make this dream a

reality. I would not have been to complete this dissertation without her.

I also would not have been able to accomplish this without the help of my family.

I thank my mother Marilyn, my sister Susan and my grandmother Margaret for their

support. I thank my late father Murray for providing an example to which I have

always aspired. Moni’s family was also understanding and supportive of the travails

of Ph.D research.

I want to express my appreciation to my adviser, Hakan Ferhatosmanoglu, for

his guidance and encouragement over the years. He has been generous with his

time, eﬀort, and advice during my graduate studies. I am thankful for the technical

discussions we have had and the opportunities he has given me to work on interesting

projects.

Many people in the department have helped me to get to this point. I speciﬁcally

want to thank my dissertation committee members, Dr. Atanas Rountev and Dr.

Hui Fang, for their eﬀorts and valuable comments on my research. I wish to thank

Dr. Bruce Weide for his help and support during my studies, Dr. Rountev for my

early exposure to research and paper writing, and Dr. Jeﬀ Daniels for giving me the

opportunity to work on an interdisciplinary geo-physics project.

iv

Many friends and members of the Database Research Group provided valuable

guidance, discussions, and critiques over the past several years. I want to especially

thank Guadalupe Canahuate for her help and friendship during this time. Also thanks

to her family for providing enough Dominican coﬀee to get through all the paper

deadlines.

My Ph.D. research was supported by NSF grant NSF IIS -0546713.

v

VITA

May 10, 1969 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Omaha, Nebraska

1993 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.S. Electrical Engineering,

Ohio State University, Columbus Ohio

2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Computer Science and Engineer-

ing,

Ohio State University, Columbus Ohio

1989-1992 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Electrical Engineering Co-op,

British Petroleum,

Various Locations

1994-2002 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Project Manager,

Acquisition Logistics Engineering,

Worthington, Ohio

2003-present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Assistant,

Oﬃce Of Research,

Department of Computer Science and

Engineering,

Department of Geology

Ohio State University,

Columbus, Ohio

PUBLICATIONS

Research Publications

Michael Gibas, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Online Index Recom-

mendations for High-Dimensional Databases Using Query Workloads. IEEE Trans-

actions on Knowledge and Data Engineering (TDKE) Journal, February 2008 (Vol.

20, no.2), pp. 246-260

vi

Michael Gibas, Hakan Ferhatosmanoglu, High-Dimensional Indexing, Encyclopedia of

Geographical Information Science, (E-GIS)

Michael Gibas, Ning Zheng, and Hakan Ferhatosmanoglu, A General Framework for

Modeling and Processing Optimization Queries, 33rd International Conference on

Very Large Data Bases (VLDB 07), Vienna, Austria, September, 2007, pp. 1069-

1080

Guadalupe Canahuate, Michael Gibas, Hakan Ferhatosmanoglu, Update Conscious

Bitmap Indexes. International Conference on Scientiﬁc and Statistical Database

Management (SSDBM), Banﬀ, Canada, 2007/July

Guadalupe Canahuate, Michael Gibas, Hakan Ferhatosmanoglu, Indexing Incomplete

Databases, International Conference on Extending Database Technology (EDBT),

Munich, Germany, 2006/March, pp. 884-901

Michael Gibas, Developing Indices for High Dimensional Databases Using Query Ac-

cess Patterns, Masters Thesis, June 2004

Atanas Rountev, Scott Kagan, and Michael Gibas, Static and Dynamic Analysis of

Call Chains in Java, ACM SIGSOFT International Symposium on Software Testing

and Analysis (ISSTA’04), pages 1-11, July 2004

Fatih Altiparmak, Michael Gibas, Hakan Ferhatosmanoglu, Relationship Preserving

Feature Selection for Unlabelled Time-Series, Ohio State Technical Report, OSU-

CISRC-10/07-TR74, 14 pp.

Michael Gibas, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Indexes for Databases

with Missing Data, Ohio State Technical Report, OSU-CISRC-4/06-TR45, 15 pp.

Guadalupe Canahuate, Tan Apaydin, Michael Gibas, Hakan Ferhatosmanoglu, Fast

Semantic-based Similarity Searches in Text Repositories, Conference on Using Meta-

data to Manage Unstructured Text, LexisNexis, Dayton, OH, 2005/October

Atanas Rountev, Scott Kagan, and Michael Gibas, Evaluating the Imprecision of

Static Analysis, ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Soft-

ware Tools and Engineering (PASTE’04), pages 14-16, June 2004

FIELDS OF STUDY

vii

Major Field: Computer Science and Engineering

viii

TABLE OF CONTENTS

Page

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Chapters:

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Overall Technical Motivation . . . . . . . . . . . . . . . . . . . . . 1

1.2 Novel Database Management Applications . . . . . . . . . . . . . . 2

2. Modeling and Processing Optimization Queries . . . . . . . . . . . . . . 6

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Query Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Background Deﬁnitions . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Common Query Types Cast into Optimization Query Model 13

2.3.3 Optimization Query Applications . . . . . . . . . . . . . . . 15

2.3.4 Approach Overview . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Query Processing Framework . . . . . . . . . . . . . . . . . . . . . 19

2.4.1 Query Processing over Hierarchical Access Structures . . . . 21

2.4.2 Query Processing over Non-hierarchical Access Structures . 26

2.4.3 Non-Covered Access Structures . . . . . . . . . . . . . . . . 28

2.5 I/O Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 33

2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

ix

3. Online Index Recommendations for High-Dimensional Databases using

Query Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2.1 High-Dimensional Indexing . . . . . . . . . . . . . . . . . . 54

3.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2.3 Index Selection . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2.4 Automatic Index Selection . . . . . . . . . . . . . . . . . . . 57

3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . 59

3.3.3 Proposed Solution for Index Selection . . . . . . . . . . . . 61

3.3.4 Proposed Solution for Online Index Selection . . . . . . . . 71

3.3.5 System Enhancements . . . . . . . . . . . . . . . . . . . . . 76

3.4 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 78

3.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 80

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4. Multi-Attribute Bitmap Indexes . . . . . . . . . . . . . . . . . . . . . . . 98

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.3.1 Multi-attribute Bitmap Index Selection for Ad Hoc Queries 110

4.3.2 Query Execution with Multi-Attribute Bitmaps . . . . . . . 115

4.3.3 Index Updates . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 121

4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 123

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

x

LIST OF FIGURES

Figure Page

2.1 Sample Optimization Objective Function and Constraints F(

θ, A) :

(x −1)

2

+(y −2)

2

, o(min), g

1

( α, A) : y −6 ≤ 0, g

2

( α, A) : −y +3 ≤ 0,

h

1

(

β, A) : x −5 = 0, k = 1 . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Optimization Query System Functional Block Diagram . . . . . . . . 20

2.3 Example of Hierarchical Query Processing . . . . . . . . . . . . . . . 25

2.4 Cases of MBRs with respect to R and optimal objective function contour 30

2.5 Cases of partitions with respect to R and optimal objective function

contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.6 Accesses, Random weighted kNN vs kNN, R*-tree, 8-D Histogram,

100k pts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.7 CPU Time, Random weighted kNN vs kNN, R*-tree, 8-D Histogram,

100k pts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.8 Accesses, Random weighted kNN vs kNN, VA-File, 60-D Sat, 100k pts 38

2.9 Page Accesses, Random Weighted Constrained k-LP Queries, R*-Tree,

8-D Color Histogram, 100k pts . . . . . . . . . . . . . . . . . . . . . . 40

2.10 Accesses vs. Index Type, Optimizing Random Function, 5-D Uniform

Random, 50k Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.11 Pages Accesses vs. Constraints, NN Query, R*-Tree, 8-D Histogram,

100k Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

xi

3.1 Index Selection Flowchart . . . . . . . . . . . . . . . . . . . . . . . . 61

3.2 Dynamic Index Analysis Framework . . . . . . . . . . . . . . . . . . . 73

3.3 Costs in Data Object Accesses for Ideal, Sequential Scan, AutoAdmin,

Naive, and the Proposed Index Selection Technique using Relaxed Con-

straints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.4 Baseline Comparative Cost of Adaptive versus Static Indexing, random

Dataset, synthetic Query Workload . . . . . . . . . . . . . . . . . . . 85

3.5 Baseline Comparative Cost of Adaptive versus Static Indexing, random

Dataset, clinical Query Workload . . . . . . . . . . . . . . . . . . . . 86

3.6 Baseline Comparative Cost of Adaptive versus Static Indexing, stock

Dataset, hr Query Workload . . . . . . . . . . . . . . . . . . . . . . . 87

3.7 Comparative Cost of Online Indexing as Support Changes, random

Dataset, synthetic Query Workload . . . . . . . . . . . . . . . . . . . 89

3.8 Comparative Cost of Online Indexing as Support Changes, stock Dataset,

hr Query Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.9 Comparative Cost of Online Indexing as Major Change Threshold

Changes, random Dataset, synthetic Query Workload . . . . . . . . . 92

3.10 Comparative Cost of Online Indexing as Major Change Threshold

Changes, stock Dataset, hr Query Workload . . . . . . . . . . . . . . 93

3.11 Comparative Cost of Online Indexing as Multi-Dimensional Histogram

Size Changes, random Dataset, synthetic Query Workload . . . . . . 94

4.1 A WAH bit vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.2 Query Execution Example over WAH compressed bit vectors. . . . . 103

4.3 Query Execution Times vs Total Bitmap Length, ANDing 6 bitmaps,

2M entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.4 Query Execution Times vs Number of Attributes queried, Point Queries,

2M entries versus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

xii

4.5 Simple multiple attribute bitmap example showing the combination

two single attribute bitmaps. . . . . . . . . . . . . . . . . . . . . . . . 108

4.6 Query Execution Time for each query type. . . . . . . . . . . . . . . 125

4.7 Number of bitmaps involved in answering each query type. . . . . . . 126

4.8 Total size of the bitmaps involved in answering each query type. . . . 127

4.9 Query Execution Times, Point Queries, Traditional versus Multi-Attribute

Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.10 Query Execution Times as Query Ranges Vary, Landsat Data . . . . 129

4.11 Single and Multi-Attribute Bitmap Index Size Comparison . . . . . . 130

xiii

CHAPTER 1

Introduction

If Edison had a needle to ﬁnd in a haystack, he would proceed at once

with the diligence of the bee to examine straw after straw until he found

the object of his search... I was a sorry witness of such doings, knowing

that a little theory and calculation would have saved him ninety per cent

of his labor. - Nikola Tesla, 1931

In order to make better use of the massive quantities of data being generated by

and for modern applications, it is necessary to ﬁnd information that is in some respect

interesting within the data. Two primary challenges of this task are specifying the

meaning of interesting within the context of an application and searching the data

set to ﬁnd the interesting data eﬃciently. In the case of our proverbial haystack, we

can think of each piece of straw as a data object and needles as a data objects that

are interesting. While we could guarantee that we ﬁnd the most interesting needle by

examining each and every piece of straw, we would be much better served by limiting

our search to a small region within the haystack.

1.1 Overall Technical Motivation

Analogous to our physical haystack example, the query response time during data

exploration tasks using a database system is aﬀected both by the physical organization

1

of the data and by the way in which the data is accessed. As such, query performance

can be enhanced by traversing the data more eﬀectively during query resolution or

by organizing access to the data so that it is more suitable for the query. Depending

on the application, queries can vary widely on a per query basis. Therefore, it is

highly desirable to process generic queries at run-time in a way that is eﬀective for

each particular query. Physically reorganizing data or data access structures is time

consuming and therefore should be performed for a set of queries rather than by

trying to optimize an organization for a single query.

This thesis targets query performance enhancement through two sources 1) run-

time query processing such that the search paths and intermediate operations during

query resolution are appropriate for a given generic query, 2) data organization such

that the access structures available during query resolution are appropriate for the

overall query patterns and data distribution and allow for signiﬁcant data space prun-

ing during query resolution. Query performance enhancement is explored using three

introduced applications, motivated and described in the next section.

1.2 Novel Database Management Applications

Optimization Queries. Exploration of image, scientiﬁc, and business data usu-

ally requires iterative analyses that involve sequences of queries with varying parame-

ters. Besides traditional queries, similarity-based analyses and complex mathematical

and statistical operators are used to query such data [41]. Many of these data ex-

ploration queries can be cast as optimization queries where a linear or non-linear

function over object attributes is minimized or maximized (our needle). Addition-

ally, the object attributes queried, function parametric values, and data constraints

2

can vary on a per-query basis. Because scientiﬁc data exploration and business deci-

sion support systems rely heavily on function optimization tasks, and because such

tasks encompass such disparate query types with varying parameters, these applica-

tion domains can beneﬁt from a general model-based optimization framework. Such

a framework should meet the following requirements: (i) have the expressive power

to support solution of each query type, (ii) take advantage of query constraints to

prune search space, and (iii) maintain eﬃcient performance when the scoring criteria

or optimization function changes.

A general optimization model and query processing framework are proposed for

optimization queries, an important and signiﬁcant subset of data exploration tasks.

The model covers those queries where some function is being minimized or maximized

over object attributes, possibly within a set of constraints. The query processing

framework can be used in those cases where the function being optimized and the

set of constraints is convex. For access structures where the underlying partitions

are convex, the query processing framework is used to access the most promising

partitions and data objects and is proven to be I/O-optimal.

High-Dimensional Index Selection. With respect to the physical organization

of the data, access structures or indexes that cover query attributes can be used

to prune much of the data space and greatly reduce the required number of page

accesses to answer the query. However, due to the oft-cited curse of dimensionality,

indexes built over many dimensions actually perform worse than sequential scan. This

problem can be addressed by using sets of lower dimensional indexes that eﬀectively

answer a large portion of the incoming queries. This approach can work well if the

characteristics of future queries are well known, but in general this is not the case,

3

and an index set built based on historical queries may be completely ineﬀective for

new queries.

In the case of physical organization, query performance is aﬀected by creating a

number of low dimension indexes using the historical and incoming query workloads,

and data statistics. Historical query patterns are data mined to ﬁnd those attribute

sets that are frequently queried together. Potential indexes are evaluated to derive

an initial index set recommendation that is eﬀective in reducing query cost for a large

portion of queries such that the set of indexes is within some indexing constraint.

A set of indexes is obtained that in eﬀect reduces the dimensionality of the query

search space, and is eﬀective in answering queries. A control feedback mechanism

is incorporated so that the performance of newly arriving queries can be monitored,

and if the current index set is no longer eﬀective for the current query patterns, the

index set can be changed. This ﬂexible framework that can be tuned to maximize

indexing accuracy or indexing speed which allows it to be used for the initial static

index recommendation and also online dynamic index recommendation.

Multi-Attribute Bitmap Indexes. The nature of the database management

system used and the way in which data is physically accessed also aﬀects query execu-

tion times. Bitmap indexes have been shown to be eﬀective as an data management

system for many domains such a data warehousing and scientiﬁc applications. This

owes to the fact that they are a column-based storage structure and only the data

needed to address a query is ever retrieved from disk and that their base operations

are performed using hardware supported fast bitwise operations. Static bitmap in-

dexes have traditionally built over each attribute individually. The query model for

selection queries is simple and involves ORing together bitcolumns that match query

4

criteria within an attribute and ANDing inter-attribute intermediate results. As a re-

sult, the bitmap index structure is limited to those domains that it is naturally suited

for without change. As well query resolution does not take advantage of potential

multi-attribute pruning and possible reductions in the number of binary operations.

Multi-attribute bitmap indexes are introduced to extend the functionality and

applicability of traditional bitmap indexes. For many queries, query execution times

are improved by reducing size of bitmaps used in query execution and by reducing the

number of binary operations required to resolve the query without excessive overhead

for queries that do not beneﬁt from multi-attribute pruning. A query processing

model where multi-attribute bitmaps are available for query resolution is presented.

Methods to determine attribute combination candidates are provided.

5

CHAPTER 2

Modeling and Processing Optimization Queries

The most exciting phrase to hear in science, the one that heralds new

discoveries, is not ‘Eureka!’ (I found it!) but ‘That’s funny ...’ - Isaac

Asimov

One challenge associated with data management is determining a cost-eﬀective

way to process a given query. Given the time expense of I/O, minimizing the I/O

required to answer a query is a primary strategy in query optimization. Scientiﬁc

discovery and business intelligence activities are frequently exploring for data objects

that are in some way special or unexpected. Such data exploration can be performed

by iteratively ﬁnding those data objects that minimize or maximize some function over

the data attributes. While the answer to a given query can be found by computing

the function for all data objects and ﬁnding those objects that yield the most ex-

treme values, this approach may not be feasible for large data sets. The optimization

query framework presented in this chapter provides a method to generically handle

an important broad class of queries, while ensuring that, given an access structure,

the query is I/O-optimally processed over that structure.

6

2.1 Introduction

Motivation and Goal Optimization queries form an important part of real-

world database system utilization, especially for information retrieval and data anal-

ysis. Exploration of image, scientiﬁc, and business data usually requires iterative

analyses that involve sequences of queries with varying parameters. Besides tradi-

tional queries, similar-ity-based analyses and complex mathematical and statistical

operators are used to query such data [41]. Many of these data exploration queries

can be cast as optimization queries where a linear or non-linear function over object

attributes is minimized or maximized. Additionally, the object attributes queried,

function parametric values, and data constraints can vary on a per-query basis.

There has been a rich set of literature on processing speciﬁc types of queries, such

as the query processing algorithms for Nearest Neighbors (NN) [48] and its variants

[37], top-k queries [21, 51], etc. These queries can all be considered to be a type of

optimization query with diﬀerent objective functions. However, they either only deal

with speciﬁc function forms or require strict properties for the query functions.

Because scientiﬁc data exploration and business decision support systems rely

heavily on function optimization tasks, and because such tasks encompass such dis-

parate query types with varying parameters, these application domains can beneﬁt

from a general model-based optimization framework. Such a framework should meet

the following requirements: (i) have the expressive power to support existing query

types and the power to formulate new query types, (ii) take advantage of query con-

straints to proactively prune search space, and (iii) maintain eﬃcient performance

when the scoring criteria or optimization function changes.

7

Despite an extensive list of techniques designed speciﬁcally for diﬀerent types

of optimization queries, a uniﬁed query technique for a general optimization model

has not been addressed in the literature. Furthermore, application of specialized

techniques requires that the user know of, possess, and apply the appropriate tools

for the speciﬁc query type. We propose a generic optimization model to deﬁne a wide

variety of query types that includes simple aggregates, range and similarity queries,

and complex user-deﬁned functional analysis. Since the queries we are considering

do not have a speciﬁc objective function but only have a “model” for the objective

(and constraints) with parameters that vary for diﬀerent users/queries/iterations, we

refer to them as model-based optimization queries. In conjunction with a technique to

model optimization query types, we also propose a query processing technique that

optimally accesses data and space partitioning structures to answer the query. By

applying this single optimization query model with I/O-optimal query processing,

we can provide the user with a ﬂexible and powerful method to deﬁne and execute

arbitrary optimization queries eﬃciently. Example optimization query applications

are provided in Section 2.3.3.

Contributions The primary contributions of this work are listed as follows.

• We propose a general framework to execute convex model-based optimization

queries (using linear and non-linear functions) and for which additional con-

straints over the attributes can be speciﬁed without compromising query accu-

racy or performance.

• We deﬁne query processing algorithms for convex model-based optimization

queries utilizing data/space partitioning index structures without changing the

8

index structure properties and the ability to eﬃciently address the query types

for which the structures were designed.

• We prove the I/O optimality for these types of queries for data and space

partitioning techniques when the objective function is convex and the feasible

region deﬁned by the constraints is convex (these queries are deﬁned as CP

queries in this paper).

• We introduce a generic model and query processing framework that optimally

addresses existing optimization query types studied in the research literature as

well as new types of queries with important applications.

2.2 Related Work

There are few published techniques that cover some instances of model-based

optimization queries. One is the “Onion technique” [14], which deals with an un-

constrained and linear query model. An example linear model query selects the best

school by ranking the schools according to some linear function of the attributes of

each school. The solution followed in the Onion technique is based on constructing

convex hulls over the data set. It is infeasible for high dimensional data because the

computation of convex hulls is exponential with respect to the number of dimensions

and is extremely slow. It is also not applicable to constrained linear queries because

convex hulls need to be recomputed for each query with a diﬀerent set of constraints.

The computation of convex hulls over the whole data set is expensive and in the case

of high-dimensional data, infeasible in design time.

The Prefer technique [32] is used for unconstrained and linear top-k query types.

The access structure is a sorted list of records for an arbitrary linear function. Query

9

processing is performed for some new linear preference function by scanning the sorted

list until a record is reached such that no record below it could match the new query.

This avoids the costly build time associated with the Onion technique, but does not

provide a guarantee on minimum I/O, and still does not incorporate constraints.

Boolean + Ranking [60] oﬀers a technique to query a database to optimize boolean

constrained ranking functions. However, it is limited to boolean rather than general

convex constraints, addresses only single dimension access structures (i.e. can not

optimize on multiple dimensions simultaneously), and therefore largely uses heuristics

to determine a cost eﬀective search strategy.

While the Onion and Prefer techniques organize data to eﬃciently answer a speciﬁc

type of query (LP), our technique organizes data retrieval to answer more general

queries. In contrast with the other techniques listed, our proposed solution is proven

to be I/O-optimal for both hierarchical and space partitioned access structures with

convex partition boundaries, covers a broader spectrum of optimization queries, and

requires no additional storage space.

2.3 Query Model Overview

2.3.1 Background Deﬁnitions

Let Re(a

1

, a

2

, . . . , a

n

) be a relation with attributes a

1

, a

2

, . . . , a

n

. Without loss

of generality, assume all attributes have the same domain. Denote by A a subset

of attributes of Re. Let g

i

( α, A), 1 ≤ i ≤ u, h

j

(

β, A), 1 ≤ j ≤ v and F(

θ, A)

be certain functions over attributes in A, where u and v are positive integers and

θ = (θ

1

, θ

2

, . . . , θ

m

) is a vector of parameters for function F, and α and

β are vector

parameters for the constraint functions.

10

Deﬁnition 1 (Model-Based Optimization Query) Given a relation Re(a

1

, a

2

, . . . , a

n

),

a model-based optimization query Q is given by (i) an objective function F(

θ, A) , with

optimization objective o (min or max), (ii) a set of constraints (possibly empty) spec-

iﬁed by inequality constraints g

i

( α, A) ≤ 0, 1 ≤ i ≤ u (if not empty) and equality

constraints h

j

(

**β, A) = 0, 1 ≤ j ≤ v (if not empty), and (iii) user/query adjustable
**

objective parameters

θ, constraint parameters α and

β, and answer set size integer k.

Figure 2.1 shows a sample objective function with both inequality and equality

constraints. The objective function ﬁnds the nearest neighbor to point a (1,2). The

constraints limit the feasible region to the line passing through x = 5, where 3 ≤ y ≤

6. The points b and d represent the best and worst point within the constraints for

the objective function respectively. Point c could be the actual point in the database

that meets the constraints and minimizes the objective function.

The answer of a model-based optimization query is a set of tuples with maxi-

mum cardinality k satisfying all the constraints such that there exists no other tuple

with smaller (if minimization) or larger (if maximization) objective function value

F satisfying all constraints as well. This deﬁnition is very general, and almost any

type of query can be considered as a special case of model-based optimization query.

For instance, NN queries over an attribute set A can be considered as model-based

optimization queries with F(

**θ, A) as the distance function (e.g., Euclidean) and the
**

optimization objective is minimization. Similarly, top-k queries, weighted NN queries

and linear/non-linear optimization queries can all be considered as speciﬁc model-

based optimization queries. Without loss of generality, we will consider minimization

as the objective optimization throughout the rest of this chapter.

11

Figure 2.1: Sample Optimization Objective Function and Constraints F(

θ, A) : (x −

1)

2

+(y −2)

2

, o(min), g

1

( α, A) : y −6 ≤ 0, g

2

( α, A) : −y +3 ≤ 0, h

1

(

β, A) : x−5 = 0,

k = 1

Deﬁnition 2 (Convex Optimization (CP) Query) A model-based optimization

query Q is a Convex Optimization (CP) query if (i) F(

θ, A) is convex, (ii) g

i

( α, A), 1 ≤

i ≤ u (if not empty) are convex, and (iii) h

j

(

**β, A), 1 ≤ j ≤ v (if not empty) are linear
**

or aﬃne.

Notice that the deﬁnition of a CP query does not have any assumptions on the

speciﬁc form of the objective function. The only assumptions are that the queries can

be formulated into convex functions and the feasible regions deﬁned by the constraints

of the queries are convex (e.g., polyhedron regions). Therefore, users can ask any form

of queries with any coeﬃcients as long as these assumptions hold. The conditions (i),

(ii) and (iii) form a well known type of problem in Operations Research literature -

Convex Optimization problems. A function is convex if it is second diﬀerentiable and

12

its Hessian matrix is positive deﬁnite. More intuitively, if one travels in a straight

line from inside a convex region to outside the region, it is not possible to re-enter

the convex region.

2.3.2 Common Query Types Cast into Optimization Query

Model

Although appearing to be restricted in functions F, g

i

and h

j

, the set of CP prob-

lems is a superset of all least square problems, linear programming problems (LP)

and convex quadratic programming problems (QP) [27]. Therefore, they cover a wide

variety of linear and non-linear queries, including NN queries, top-k queries, linear

optimization queries, and angular similarity queries. The formulation of some com-

mon query types as CP optimization queries are listed and discussed below.

Euclidean Weighted Nearest Neighbor Queries - The Weighted Nearest Neigh-

bor query asking NN for a point (a

0

1

, a

0

2

, . . . , a

0

n

) can be considered as an optimization

query asking for a point (a

1

, a

2

, . . . , a

n

) such that an objective function is minimized.

For Euclidean WNN the objective function is

WNN(a

1

, a

2

, . . . , a

n

) =

w

1

(a

1

−a

0

1

)

2

+ w

2

(a

2

−a

0

2

)

2

+ . . . + w

m

(a

n

−a

0

n

)

2

,

where w

1

, w

2

, . . . , w

n

> 0 are the weights and can be diﬀerent for each query. Tradi-

tional Nearest Neighbor queries are the subset of weighted nearest neighbor queries

where the weights are all equal to 1.

Linear Optimization Queries - The objective function of a Linear Optimization

query is

L(a

1

, a

2

, . . . , a

n

) = c

1

a

1

+ c

2

a

2

+ . . . + c

n

a

n

13

where (c

1

, c

2

, . . . , c

n

) ∈ R

n

are coeﬃcients and can be diﬀerent for diﬀerent queries,

and the constraints form a polyhedron. Since linear functions are convex and poly-

hedrons are convex, linear optimization queries are also special cases of CP queries

with a parametric function form.

Top-k Queries - The objective functions of top-k queries are score (or ranking) func-

tions over the set of attributes. The common score functions are sum and average,

but could be any arbitrary function over the attributes. Clearly, sum and average

are special cases of linear optimization queries. If the ranking function in question is

convex, the top-k query is a CP query.

Range Queries - A range query asks for all data (a

1

, a

2

, . . . , a

n

) s.t. l

i

≤ a

i

≤ u

i

, for

some (or all) dimensions i, where [l

i

, u

i

]

n

is the range for dimension i. Constructing an

objective function as f(a

1

, a

2

, . . . , a

n

) = C, where C is any constant, and constraints

g

i

= l

i

−a

i

<= 0, g

n+i

= a

i

−u

i

<= 0, i = 1, 2, · · · , n

Any points that are within the constraints will get the objective value C. Points

that are not within the constraints are pruned by those constraints. Those points

that remain all have an objective value of C and because they all have the same

maximum(minimum) objective value, all form the solution set of the range query.

Therefore, range queries can be formed as special cases of CP queries with constant

objective functions applying the range limits as constraints. Also note that our model

can not only deal with the ‘traditional’ hyper-rectangular range queries as described

above, but also ‘irregular’ range queries such as l ≤ a

1

+a

2

≤ u and l ≤ 3a

1

−2a

2

≤ u

14

2.3.3 Optimization Query Applications

Any problem that can be rewritten as a convex optimization problem can be

cast to our model, and solved using our framework. In cases where the optimiza-

tion problem or objective function parameters are not known in advance, this generic

framework can be used to solve the problem by eﬃciently accessing available access

structures. We provide a short list of potential optimization query applications that

could be relevant during scientiﬁc data exploration or business decision support where

the problem being optimized is continuously evolving.

Weighted Constrained Nearest Neighbors - Weighted nearest neighbor queries

give the ability to assign diﬀerent weights for diﬀerent attribute distances for nearest

neighbor queries. Weighted Constrained Nearest Neighbors (WCNN) ﬁnd the clos-

est weighted distance object that exists within some constrained space. This type

of query would be applicable in situations where attribute distances vary in impor-

tance and some constraints to the result set are known. A potential application is

clinical trial patient matching where an administrator is looking to ﬁnd the best can-

didate patients to ﬁll a clinical trial. Some hard exclusion criteria is known and some

attributes are more important than others with respect to estimating participation

probability and suitability.

kNN with Adaptive Scoring - In some situations we may have an objective func-

tion that changes based on the characteristics of a population set we are trying to

ﬁll. In such a case, we will change the nearest neighbor scoring weights dependent on

the current population set. As an example, again consider the patient clinical trial

matching application where we want to maintain a certain statistical distribution over

the trial participants. In a case where we want half of the participants to be male and

15

half female, we can adjust weights of the objective optimization function to increase

the likelihood that future trial candidates will match the currently underrepresented

gender.

Queries over Changing Attributes - The attributes involved in optimization

queries can vary based on the iteration of the query. For example, a series of opti-

mization queries may search a stock database for maximum stock gains over diﬀerent

time intervals. The attributes involved in each query will be diﬀerent.

Weights, constraints, functional attributes, and optimization functions themselves

can all change on a per-query basis. A database system that can eﬀectively handle

the potential variations in optimization queries will beneﬁt data exploration tasks.

In the examples listed above, each query or each member of a set of queries can

be rewritten as an optimization query in our model. This demonstrates the power

and ﬂexibility that the user has to deﬁne data exploration queries and the examples

represent a small subset of the query types that are possible.

2.3.4 Approach Overview

We propose the use of Convex Optimization (CP) in order to traverse access

structures to ﬁnd optimal data objects. The goal in CP is optimize (minimize or

maximize) some convex function over data attributes. The solution can optionally

be subject to some convex criteria over the data attributes. Additionally, the data

attributes may be subject to lower and upper bounds.

All of the discussed applications can be stated as an objective function with a

objective of maximization or minimization. We can solve a continuous CP problem

over a convex partition of space by optimizing the function of interest within the

16

constraints of that space. Most access structures are built over convex partitions.

Particularly common partitions are Minimum Bounding Rectangles (MBRs). Rect-

angular partitions can be represented by lower and upper bounds over data attributes

and can be directly addressed by the CP problem bounds. Other convex partitions

can be addressed with data attribute constraints (such as x +y < 5). If the problem

under analysis has constraints in addition to the constraints imposed by the partition

boundaries, they can also be added to the CP problem as constraints.

Given the function, partition constraints, and problem constraints, we can use

CP to ﬁnd the optimal answer for the function within the intersection of the problem

constraints and partition constraints. If no solution exists (problem constraints and

partition constraints do not intersect), the partition is found to be infeasible for the

problem. The partition and its children can be eliminated from further consideration.

Given a function, problem constraints, and a set of partitions, the partition the yields

the best CP result with respect to the function under analysis is the one that contains

some space that optimizes the problem under analysis better than the other partitions.

These partition functional values can be used to order partitions according to how

promising they are with respect to optimizing the function within problem constraints.

With our technique we endeavor to minimize the I/O operations and access struc-

ture search computations required to perform arbitrary model-based optimization

queries. The access structure partitions to search and evaluate during query process-

ing are determined by solving CP problems that incorporate the objective function,

model constraints, and partition constraints. The solution of these CP problems al-

lows us to prune partitions and search paths that can not contain an optimal answer

during query processing.

17

The CP literature discusses the eﬃciency of CP problem solution. In [42], it is

stated that a CP problem can be very eﬃciently solved with polynomial-time algo-

rithms. The readers can refer to [43] for a detailed review on interior-point polynomial

algorithms for CP problems. Current implementations of CP algorithms can solve

problems with hundreds of variables and constraints in very short times on a personal

computer [38]. CP problems can be so eﬃciently solved that Boyd and Vandenberghe

stated “if you formulate a practical problem as a convex optimization problem, then

you have solved the original problem.” ([10] page 8). We are in eﬀect replacing the

bound computations from existing access structure distance optimization algorithms

with a CP problem that covers the objective function and constraints. Although

such computations can be more complex, we found very little time diﬀerence in the

functions we examined.

We focus on query processing techniques by presenting a technique for general CP

objective functions, which can be applied to existing space or data partitioning-based

indices without losing the capabilities of those indexing techniques to answer other

types of queries. The basic idea of our technique is to divide the query processing into

two phases: First, solve the CP problem without considering the database, (i.e. in

continuous space). This “relaxed” optimization problem can be solved eﬃciently in

the main memory using many possible algorithms in the convex optimization litera-

ture. It can be solved in polynomial time in the number of dimensions in the objective

functions and the number of constraints ([43]). Second, search through the index by

pruning the impossible partitions and access the most promising pages. Details for

our technique for diﬀerent index types are provided in the following section.

18

2.4 Query Processing Framework

By solving CP problems associated with the problem constraints, partition con-

straints, and objective function, we ﬁnd the best possible objective function value

achievable by a point in the partition. Thus, there exists a lower bound for each

partition in the database that can be used to prune the data space before retrieving

pages from disk. The bounds can be deﬁned for any type of convex partition (such

as minimum bounding regions) used for indexing the data.

Figure 2.2 is a functional block diagram of the optimization query processing sys-

tem that shows interactions between components. Query processing proceeds ﬁrst by

the user providing an objective function and constraints to the system. A lightweight

query compiler validates the user input, converts the function and constraints into

appropriate format for the query processor, and executes the appropriate query pro-

cessing algorithm using the speciﬁed access structure. The query processor invokes

a CP solver to ﬁnd bounds for partitions that could potentially contain answers to

the query. By solving CP subproblems using index partition boundaries as additional

constraints, we can ensure that we access pages in the order of how promising they

are. The user gets back those data object(s) in the database that answer the query

within the constraints.

The generic query processing framework for a CP query is described below. This

framework can be applied to indexing structures in which the partitions on which

they are built are convex because the partition boundaries themselves are used to

form the new CP problems.

A popular taxonomy of access structures divides the structures into the categories

of hierarchical and non-hierarchical structures. Hierarchical structures are usually

19

Figure 2.2: Optimization Query System Functional Block Diagram

partitioned based on data objects while non-hierarchical structures are usually based

on partitioning space. These two families have important diﬀerences that aﬀect how

queries are processed over them. Therefore we provide algorithms for hierarchical

structures in Subsection 2.4.1 and non-hierarchical structures in 2.4.2.

The overall idea is to traverse the access structure and solve the problem in se-

quential order of how good partitions and search paths could be. We terminate

when we come to a partition that could not contain an optimal answer to the query.

The implementations described address a minimization objective function and prune

based on comparison to minimum values. Solutions to maximization problems can be

performed by using maximums (i.e. “worst” or upper bounds) in place of minimums.

20

The number of results that are returned by the query processing is dependent on

an input k. If the user desires only the single best result, k = 1. It can however, be

an arbitrary value in order to return the k best data objects. For range queries, if all

objects that meet the range constraints are desired, the value of k should be set to n,

the number of data points. Only the data points within the constrained region will be

returned, and only the points that are within the lowest-level partitions that intersect

the constrained region will be examined during query processing. Therefore, the query

processing will be at least as I/O-eﬃcient as standard range query processing over a

given access structure.

Now, we describe the implementations of this framework based on hierarchical

and non-hierarchical indexing techniques. In Section 2.5 we show that both of these

algorithms are in fact optimal, i.e., no unnecessary MBRs (or partitions) given an

access structure are accessed during the query processing.

2.4.1 Query Processing over Hierarchical Access Structures

Algorithm 1 shows the query processing over hierarchical access structures. This

algorithm is applicable to any hierarchical access structure where the partitions at

each tree level are convex. A partial list of structures covered by this algorithm

includes the R-tree, R*-tree, R+-tree, BSP-tree, Quad-tree, Oct-tree, k-d-B-tree,

B-tree, B+-tree, LSD-tree, Buddy-tree, P-tree, SKD-tree, GBD-tree, Extended k-D-

Tree, X-tree, SS-tree, and SR-tree [25, 9].

21

The algorithm is analogous to existing optimal NN processing [31], except that

partitions are traversed in order of promise with respect to an arbitrary convex func-

tion, rather than distance. Additionally, our algorithm also prunes out any partitions

and search paths that do not fall within problem constraints.

We solve a new CP problem for each partition that we traverse using the objective

function, original problem constraints, and the constraints imposed by the partition

boundaries. The algorithm takes as input the convex optimization function of interest

OF, the set of convex constraints C, and a desired result set size k. In lines 1-3, we

initialize the structures that we maintain throughout a query. We keep a vector of

the k best points found so far as well as a vector of the objective function values for

these points. We keep a worklist of promising partitions along with the minimum

objective value for the partition. We initialize the worklist to contain the tree root

partition.

We then process entries from the worklist until the worklist is empty. In line 5,

we remove the ﬁrst promising partition. If this partition is a leaf partition, then we

access the points in the partition and compute their objective function values. If a

point is not within the original problem constraints, then it will not have a feasible

solution and will be discarded from further consideration. Points with objective

function values lower than the objective function value of the k

th

best point so far

will cause the best point and best point objective function value vectors to be updated

accordingly.

If the partition under analysis is not a leaf, we ﬁnd its child partitions. For each

of these child partitions, we solve the CP problem corresponding to the objective

function, original problem constraints, and partition constraints (line 11). This yields

22

CPQueryHierarchical (Objective Function OF,

Constraints C,k)

notes:

b[i] - the i

th

best point found so far

F(b[i]) - the objective function value of the i

th

best point

L - worklist of (partition, minObjVal) pairs

1: b[1]...b[k] ←null.

2: F(b[1])...F(b[k]) ← +∞

3: L ← (root)

4: while L = ∅

5: remove ﬁrst partition P from L

6: if P is a leaf

7: access and compute functions for points in

P within constraints C

8: update vectors b and F if better points

discovered

9: else

10: for each child P

child

of P

11: minObjV al = CPSolve(OF,P

child

,C)

12: if minObjV al < F(b[k])

13: insert P

child

into L according to

minObjV al

14: end for

15: end if

16: prune partitions from L with minObjV al >

F(b[k])

17: end while

18: return vector b

Algorithm 1: I/O Optimal Query Processing for Hierarchical Index Structures.

the best possible objective function value achievable within the intersection of the

original problem constraints and partition constraints. If the minimum objective

function value is less than the k

th

best objective function value (i.e. it is possible for

a point within the intersection of the problem constraints and partition constraints to

23

beat the k

th

best point), then the partition is inserted into the worklist according to

its minimum objective value. If there is no feasible solution for the partition within

problem constraints, the partition will be dropped from consideration.

After a partition is processed, we may have updated our k

th

best objective function

value. We prune any partitions in the worklist that can not beat this value.

Query Processing Example As an example of query processing, consider Fig-

ure 2.3 with an optimization objective of minimizing the distance to the query point

within the constrained region (i.e. constrained 1-NN query). We initialize our work-

list with the root and start by processing it. Since it is not a leaf, we examine its child

partitions A and B. We solve for the minimum objective function value for A and do

not obtain a feasible solution because A and the constrained region do not intersect.

We do not process A further. We ﬁnd the minimum objective function for partition B

and get the distance from the query point to the point where the constrained region

and the partition B intersect, point c. We insert partition B into our worklist and

since it is the only partition in the worklist, we process it next. Since B is not a

leaf, we examine its child partitions, B.1, B.2, and B.3. We ﬁnd minimum objective

function values for each and obtain the distances from the query point to point d, e,

and f respectively. Since point e is closest, we insert partitions into the worklist in

the order B.2, B.1, and B.3.

We process B.2 and since it is a leaf, we compute objective function values for

the points in B.2. Point B.2.a is the closer of the two. This point is designated

as the best point thus far, and the best point objective function is updated. After

processing B.2, we know we can do no worse than the distance from the query point

to B.2.a. Since partition B.3 can not do better than this, we prune it from the

24

worklist. We examine points in B.1. Point B.1.b is not a feasible solution (it is not

in the constrained region) so it is dropped. Point B.1.a gets a value which is better

than the current best point. We update the best point to be B.1.a. Since the worklist

is now empty, we have completed the query and return the best point.

The optimal point for this optimization query this query is B.1.a. We examine

only points in partitions that could contain points as good as the best solution.

Figure 2.3: Example of Hierarchical Query Processing

25

2.4.2 Query Processing over Non-hierarchical Access Struc-

tures

In this section, we show that the general framework for CP queries can also be

implemented on top of non-hierarchical structures. Algorithm 2 can be applied to

non-hierarchical access structures where the partitions are convex. A partial list of

structures where the algorithm can be applied is Linear Hashing, Grid-File, EXCELL,

VA-File, VA+-File, and Pyramid Technique [25, 9, 53, 23].

We use a two stage approach detailed in Algorithm 2 to determine which data

objects optimize the query within the constraints. This technique is similar to the

technique identiﬁed in [53] for ﬁnding NN for a non-hierarchical structure, but we use

the objective function rather than a distance and also consider original problem con-

straints to ﬁnd lower bounds. We take the objective function, OF, original problem

constraints C, and result set size k as inputs. We initialize the structures to be used

during query processing (lines 1-3). In stage 1, we process each populated partition

vin the structure by solving the CP problem that corresponds to the original objec-

tive function and constraints combined with the additional constraints imposed by the

partition boundaries (line 5). Those partitions that do not intersect the constraints

will be bypassed. For those that do have a feasible solution, the lower bound of the

CP problem is recorded. At the end of stage 1, we place unique unpruned partitions

in increasing sorted order based on their lower bound. In stage 2, we process the

unpruned partitions by accessing objects contained by the ﬁrst partition and com-

puting actual objective values. We maintain a sorted list of the top-k objects we have

found so far, and continue accessing points contained by subsequent partitions until

a partition’s lower bound is greater than the k

th

best value. By accessing partitions

26

in order of how good they could be, according to the CP problem, we access only

those points in cell representations that could contain points as good as the actual

solution.

CPQueryNonHierarchical (Objective Function OF,

Constraints C,k)

notes:

b[i] - the i

th

best point found so far

F(b[i]) - the objective function value of the i

th

best point

L - worklist of (partition, minObjVal) pairs

Initialize:

1: b[1]...b[k] ←null.

2: F(b[1])...F(b[k]) ← +∞

3: L ← ∅

Stage 1:

4: for each populated partition P

5: minObjV al = CPSolve(OF, P, C)

6: if minObjV al < +∞

7: insert (P, minObjV al) into L according to

minObjV al

Stage 2:

8: for i = 1 to |L|

9: if minObjV al of partition L[i] > F(b[k])

10: break

11: Access objects that partition L[i] contains

12: for each object o

13: if OF(o) < F(b[k])

14: insert o into b

15: update F(b)

Algorithm 2: I/O Optimal Query Processing over Non-hierarchical Structures.

27

2.4.3 Non-Covered Access Structures

A review of access structure surveys yielded few structures that are not covered by

the algorithms. These include structures that either do not guarantee spatial locality

(i.e. space ﬁlling curves which are used for approximate ordering). They include

structures built over values computed for a speciﬁc function (such as distance in M-

trees. Since attribute information is lost, they can not be applied to general functions

over the original attributes). They include structures that contain non-convex regions

(BANG-File, hB-tree, BV-tree) [25, 9]. Note that M-trees would still work to answer

queries for the functions they are built over and the structures built with non-convex

leaves could still be optimally traversed down to the level of non-convexity.

2.5 I/O Optimality

In this section, we will show that the proposed query processing technique achieves

optimal I/O for each implementation in Section 2.4. We denote the objective func-

tion contour which goes through the actual optimal data point in the database as the

optimal contour. This contour also goes through all other points in continuous space

that yield the same objective function value as the optimal point. The k

th

optimal

contour would pass through all points in continuous space that yield the same objec-

tive function value as the k

th

best point in the database. In the following arguments,

we use the optimal contour, but could replace them with the k

th

optimal contour.

Following the traditional deﬁnition in [5], we deﬁne the optimality as follows.

Deﬁnition 3 (Optimality) An algorithm for optimization queries is optimal iﬀ it

retrieves only the pages that intersect the intersection of the constrained region R and

the optimal contour.

28

For hierarchical tree based query processing, the proof is given in the following.

Lemma 1 The proposed query processing algorithm is I/O optimal for CP queries

under convex hierarchical structures.

Proof. It is easy to see that any partition intersecting the intersection of optimal

contour and feasible region R will be accessed since they are not pruned in any phase

of the algorithm. We need to prove only these partitions are accessed, i.e., other

partitions will be pruned without accessing the pages.

Figure 2.4 shows diﬀerent cases of the various position relations between a parti-

tion, R and the optimal contour under 2 dimensions. a is an optimal data point in the

data base. We will use minObjV al(Partition, R) to denote the minimum objective

function value that is computed for the partition and constraints R.

A: intersects neither R nor the optimal contour.

B: intersects R but not the optimal contour.

C: intersects the optimal contour but not R.

D: intersects the optimal contour and R, but not the intersection of optimal con-

tour and R.

E: intersects the intersection of the optimal contour and R.

29

Figure 2.4: Cases of MBRs with respect to R and optimal objective function contour

A and C will be pruned since they are infeasible regions, when the CP for these

partitions are solved, they will be eliminated from consideration since they do not

intersect the constrained region and minObjV al(A, R) = minObjV al(C, R) = +∞.

B is pruned because minObjV al(B, R) > F(a) and E will be accessed earlier than

B in the algorithm because minObjV al(E, R) < minObjV al(B, R). We shall show

that a page in case D is pruned by the algorithm. Let CP

0

be the data partition that

contains an optimal data point, CP

1

be the partition that contains CP

0

, . . ., and CP

k

be the root partition that contains CP

0

, CP

1

, , CP

k−1

. Because the objective function

is convex, we have

F(a) ≥ minObjV al(CP

0

, R) ≥ minObjV al(CP

1

, R) ≥ . . .

≥ minObjV al(CP

k

, R)

30

and

minObjV al(D, R) > F(a) ≥ minObjV al(CP

0

, R) ≥

minObjV al(CP

1

, R) ≥ . . . ≥ minObjV al(CP

k

, R)

During the search process of our algorithm, CP

k

is replaced by CP

k−1

, and CP

k−1

by CP

k−2

, and so on, until CP

0

is accessed. If D will be accessed at some point

during the search, then D must be the ﬁrst in the MBR list at some time. This

only can occur after CP

0

has been accessed because minObjV al(D, R) is larger than

constrained function value of any partition containing the optimal data point. If CP

0

is accessed earlier than D, however, the algorithm prunes all partitions which have a

constrained function value larger than F(a), including D. This contradicts with the

assumption that D will be accessed.

The I/O-optimality relates to accessing data from leaf partitions. However, the

algorithm is also optimal in terms of the access structure traversal, that is, we will

never retrieve the objects within an MBR (leaf or non-leaf), unless that MBR could

contain an optimal answer. The proof applies to any partition, whether it contains

data or other partitions.

Similarly, we can prove the I/O optimality of proposed algorithm over non-hierarchical

structures.

Lemma 2 The proposed query processing algorithm is I/O optimal for CP queries

over non-hierarchical convex structures.

Proof. Figure 2.5 shows diﬀerent possible cases of the partition position. Partitions

A, B, C, D, and E are the same as deﬁned for Figure 2.4.

31

Figure 2.5: Cases of partitions with respect to R and optimal objective function

contour

Without loss of generality, assume that only these ﬁve partitions contain some

data points, and partition E contains an optimal data point, a, for a query F(x, y).

Then an optimal algorithm retrieves only partition E.

Partitions C and A will not be retrieved because they are infeasible and our

algorithm will never retrieve infeasible partitions. Partition B will not be retrieved

because partition E will be retrieved before partition B (because minObjV al(E, R) <

minObjV al(B, R)), and the best data point found in E will prune partition B. Thus,

we only need to show that partition D will not be retrieved.

32

Because partition D does not intersect the intersection of the optimal contour

and R but E does, minObjV al(D, R) > minObjV al(E, R). Therefore, partition

E must be retrieved before partition D since the algorithm always retrieves the

most promising partition ﬁrst, and the data point a is found at that time. Sup-

pose partition D is also retrieved later. This means partition D cannot be pruned by

data point a, i.e., F(a) ≥ minObjV al(D, R), which also means the virtual solution

corresponding to (minObjV al(D, R)) is contained by the optimal contour (because

of the convexity of function F ). Therefore, the virtual solution corresponding to

(minObjV al(D, R)) ∈ D ∩ R∩optimal contour. Hence, we can ﬁnd at least one vir-

tual point x

∗

(without considering the database) in partition D such that x

∗

is both

in R and the optimal contour. This contradicts the assumption that partition D does

not intersect the intersection of R and the optimal contour.

2.6 Performance Evaluation

The goals of the experiments are to show i) that the proposed framework is general

and can be used to process a variety of queries that optimize some objective ii) that

the generality of the framework does not impose signiﬁcant cpu burden over those

queries that already have an optimal solution, and iii) that non-relevant data subspace

can be eﬀectively pruned during query processing.

We performed experiments for a variety of optimization query types including

kNN, Weighted kNN, Weighted Constrained kNN, and Constrained Weighted Linear

Optimization. Where possible, we compare cpu times against a non-general optimiza-

tion technique. We index and perform que-ries over four real data sets. One of these

is Color Histogram data, a 64-dimensional color image histogram dataset of 100,000

33

data points. The vectors represent color histograms computed from a commercial

CD-ROM. The second is Satellite Image Texture (Landsat) data, which consists of

100,000 60-dimensional vectors representing texture feature vectors of Landsat images

[40]. These datasets are widely used for performance evaluation of index structures

and similarity search algorithms [39, 26]. We used a clinical trials patient data set

and a stock price data set to perform real world optimization queries.

Weighted kNN Queries Figure 2.6 shows the performance of our query pro-

cessing over R*-trees for the ﬁrst 8-dimensions of the histogram data for both kNN

and k weighted NN (k-WNN) queries. The index is built over the ﬁrst 8 dimensions

of the histogram data and the queries are generated to ﬁnd the kNN or k-WNN to a

random point in the data space over the same 8 dimensions. We execute 100 queries

for each trial, and randomly assign weights between 0 and 1 for the k-WNN queries.

We vary the value of k between 1 and 100. We assume the index structure is in

memory and we measure the number of additional page accesses required to access

points that could be kNN according to the query processing. Because the weights

are 1 for the kNN queries, our algorithm matches the I/O-optimal access oﬀered by

traditional R-tree nearest neighbor branch and bound searching.

The ﬁgure shows that we achieve nearly the same I/O performance for weighted

kNN queries using our query processing algorithm over the same index that is ap-

propriate for traditional non-weighted kNN queries. For unconstrained, unweighted

nearest neighbor queries, the optimal contour is a hyper-sphere around the query point

with a radius equal to the distance of the nearest point in the data set. If instead, the

queries are weighted nearest neighbor queries, the optimal contour is a hyper-ellipse.

Intuitively, as the weights of a weighted nearest neighbor query become more skewed,

34

Figure 2.6: Accesses, Random weighted kNN vs kNN, R*-tree, 8-D Histogram, 100k

pts

the hyper-volume enclosed by the optimal contour increases, and the opportunity to

intersect with access structure partitions increases. Results for the weighted queries

track very closely to the non-weighted queries. This indicates that these weighted

functions do not cause the optimal contour to cross signiﬁcantly more access struc-

ture partitions than their non-weighted counterparts. For a weighted nearest neighbor

query, we could generate an index based on attribute values transformed based on

the weights and yield an optimal contour that was hyper-spherical. We could then

traverse the access structure using traditional nearest neighbor techniques. However,

this would require generating an index for every potential set of query weights, which

is not practical for applications where weights are not typically known prior to the

35

query. For similar applications where query weights can vary, and these weights are

not known in advance, we would be better served to process the query over a single

reasonable access structure in an I/O-optimal way than by building speciﬁc access

structures for a speciﬁc set of weights.

Figure 2.7 shows the average cpu times required per query to traverse the access

structure using convex problem solutions for the weighted kNN queries compared to

the standard nearest neighbor traversal used for unweighted kNN queries.

Figure 2.7: CPU Time, Random weighted kNN vs kNN, R*-tree, 8-D Histogram,

100k pts

36

The ﬁgure shows that the generic CP-driven solution takes slightly longer to per-

form than a specialized solution for nearest neighbor queries alone. Part of this

diﬀerence is due to the slightly more complicated objective function (weighted ver-

sus unweighted Euclidean distance), part is due to the additional partitions that

the optimal contour crosses, while the remaining part is due to incorporating data

set constraints (unused in this example) into computing minimum partition function

values.

Hierarchical data partitioning access structures are not eﬀective for high-dimensional

data sets. As data set dimensionality increases, the partitions overlap with each other

more. As partitions overlap more, the probability that a point’s objective function

value can prune other partitions decreases. Therefore, the same characteristics that

make high-dimensional R-tree type structures ineﬀective for traditional queries also

make them ineﬀective for optimization queries. For this reason, we perform similar

experiments for high-dimensional data using VA-ﬁles.

Figure 2.8 shows page access results for kNN and k-WNN queries over 60-dimensional,

5-bit per attribute, VA-ﬁle built over the landsat dataset. The experiments assume

the VA-ﬁle is in memory and measure the number of objects that need to be accessed

to answer the query.

Similar to the R*-tree case, the results show that the performance of the algorithm

is not signiﬁcantly aﬀected by re-weighting the objective function. Sequential scan

would result in examining 100,000 objects. If the dataspace were more uniformly

populated, we would expect to see object accesses much closer to k. However, much

of the data in this data set is clustered, and some vector representations are heavily

37

Figure 2.8: Accesses, Random weighted kNN vs kNN, VA-File, 60-D Sat, 100k pts

populated. If a best answer comes from one of these vector representations, we need

to look at each object assigned the same vector representation.

It should be noted that while the framework results in optimal access of a given

structure, it can not overcome the inherent weaknesses of that structure. For example,

at high dimensionality, hierarchical data structures are ineﬀective, and the framework

can not overcome this. A consequence of VA-File access structures is that, without

some other data reorganization, a computation needs to occur for every approximated

object. As distance computations need to be performed for each vector in traditional

VA-ﬁle nearest neighbor algorithms, following the VA-ﬁle processing model so too

must a CP problem be solved for each vector for general optimization queries. Times

38

for this experiment reﬂect this fact. It takes longer to perform each query but the

ratio of the general optimization query times to the traditional kNN CPU times is

similar to the R*-tree case, about 1.006 (i.e. very little computational burden is

added to allow query generality).

Constrained Weighted Linear Optimization Queries We also explore Con-

strained Weighted-Linear Optimization queries. Figure 2.9 shows the number of page

accesses required to answer randomly weighted LP minimization queries for the his-

togram data set using the 8-D R*-Tree access structure. We vary k and we vary the

constrained area region ﬁxed at the origin as a 8-D hypercube with the indicated

value. We generated 100 queries for each data point in the form of a summation of

weighted attributes over the 8 dimensions. Weights are uniformly randomly generated

and are between the values of -10 and 10.

The graph shows interesting results. The number of page accesses is not directly

related to the constraint size. The rate of increase in page accesses to accommodate

diﬀerent values of k does vary with the constraint size. When weights can be negative,

the corner corresponding to the minimum constraint value for the positively weighted

attributes, and the maximum constrained value for the negatively weighted attributes

will yield the optimal result within the constrained region (e.g. corner [0,1,0] would

be an optimal minimization answer for a constrained unit cube and a linear function

with a weight vector [+,-,+]). For this data set, many points are clustered around

the origin, leading to denser packing of access structure partitions near the origin.

Because of the sparser density of partitions around the point that optimizes the query,

we access fewer partitions when this particular problem is less constrained. However,

the radius of the k

th

-optimal contour increases more in these less dense regions than

39

Figure 2.9: Page Accesses, Random Weighted Constrained k-LP Queries, R*-Tree,

8-D Color Histogram, 100k pts

more clustered regions, leading to a faster increase in the number of page accesses

required when k increases.

Note that the Onion technique is not feasible for this dataset dimensionality,

and it, as well as the other competing techniques are not appropriate for convex

constrained areas. For comparative purposes, the sequential scan of this data set is

834 pages.

We should also note what happens when there are less than k optimal answers in

the data set. This means that there are less than k objects in our constrained region.

For our query processing, we only need to examine those points that are in partitions

40

that intersect the constrained region. A technique that did not evaluate constraints

prior to or during query processing will need to examine every point.

Queries with Real Constraints The previous example showed results with

synthetically generated constraints. We also wanted to explore that constrained op-

timization problems for real applications. We performed Weighted Constrained kNN

experiments to emulate a patient matching data exploration task described in Sec-

tion 2.3.3. To test this, we performed a number of queries over a set of 17000 real

patient records. We established constraints on age and gender and queried to ﬁnd the

10 weighted nearest neighbors in the data set according to 7 measured blood analytes.

Using the VA-ﬁle structure with 5 bits assigned per attribute and 100 randomly se-

lected, randomly weighted queries, we achieved an average of 11.5 vector accesses to

ﬁnd the k = 10 results. This means that on average the number of vectors within the

intersection of the constrained space and 10

th

-optimal contour is 11.5. It corresponds

to a vector access ratio of 0.0007.

Arbitrary Functions over Diﬀerent Access Structures In order to show that

the framework can be used to ﬁnd optimal data points for arbitrary convex functions,

we generated a set of 100 random functions. Unlike nearest neighbor and linear

optimization type queries, the continuous optimal solution is not intuitively obvious

by examining the function. Functions contain both quadratic and linear terms and

were generated over 5 dimensions in the form

¸

5

i=1

(random() ∗ x

2

i

−random() ∗ x

i

),

where x

i

is the data attribute value for the i

th

dimension. A sample function, where

coeﬃcients are rounded to two decimal places, is to minimize 0.91x

2

1

+0.49x

2

2

+0.4x

2

3

+

0.55x

2

4

+ 0.76x

2

5

−0.09x

1

−0.49x

2

−0.28x

3

−0.91x

4

−0.51x

5

.

41

We explored the results of our technique over 5 index structures. These include

3 hierarchical structures, including the R-tree, the R*-tree, and the VAM-split bulk

loaded R*-tree. We explored 2 non-hierarchical structures, the grid ﬁle and the VA-

ﬁle. For each of these index types, we modiﬁed code to include an optimization

function that uses CP to traverse the index structure using the algorithm appropriate

for the structure. We added a new set of 50000 uniform random data points within

the 5-dimension hypercube and built each of the index structures over this data set.

We ran the set of random functions over each structure and measured the number

of objects in the partitions that we retrieve. For the hierarchical structures these are

leaf partitions and for the non-hierarchical structures they are grid cells. For each

structure we try to keep the number partitions relatively close to each other. The

number of partitions for the hierarchical structures is between 12000 and 13000, the

grid ﬁle 16000, and the VA-File 32000.

Figure 2.10 shows results over the 100 functions. It shows the minimum, maxi-

mum, and average object access ratio to guarantee discovery of the optimal answer

over the set of functions. Each index yielded the same correct optimal data point for

each function, and these optimal answers were scattered throughout the data space.

Each structure performed well for these optimization queries, the average ratio of the

data points accessed is below 0.0025 for all the access structures. Even the worst

performing structure over the worst case function still prunes over 99% of the search

space.

Results between hierarchical access structures are as expected. The R*-tree looks

to avoid some of the partition overlaps produced by the R-tree insertion heuristics

and it shows better performance. The VAM-split R*-tree bulk loads the data into the

42

Figure 2.10: Accesses vs. Index Type, Optimizing Random Function, 5-D Uniform

Random, 50k Points

index and can avoid more partition overlaps than an index built using dynamic inser-

tion. As expected, this yields better performance than the dynamically constructed

R*-tree.

The VA-File has greater resolution than the grid-ﬁle in this particular test, and

therefore achieves better access ratios.

Incorporating Problem Constraints during Query Processing The pro-

posed framework processes queries in conjunction with any set of convex problem

constraints. We compare the proposed framework to alternative methods for han-

dling constraints in optimization query processing. For a constrained NN type query

we could process the query according to traditional NN techniques and when we

ﬁnd a result, check it against the constraints until we ﬁnd a point that meets the

43

constraints. Alternatively, we could perform a query on the constraints themselves,

and then perform distance computations on the result set and select the best one.

Figure 2.11 shows the number of pages accessed using our technique compared to

these two alternatives as the constrained area varies. Constrained kNN queries were

performed over the 8-D R*-tree built for the Color Histogram data set.

Figure 2.11: Pages Accesses vs. Constraints, NN Query, R*-Tree, 8-D Histogram,

100k Points

In the ﬁgure, the R*-tree line shows the number of page accesses required when

we use traditional NN techniques, and then check if the result meets the constraints.

As the problem becomes less constrained, the more likely a discovered NN will fall

within the constraints, and we can terminate the search. The Selectivity line is a

44

lower bound on the number of pages that would be read to obtain the points within

the constraints. Clearly, as the problem becomes more constrained, the fewer pages

will need to be read to ﬁnd potential answers. The CP line shows the results for our

framework, which processes original problem constraints as part of the search process

and does not further process any partitions that do not intersect constraints.

We demonstrate better results with respect to page accesses except in cases where

the constraints are very small (and we can prune much of the space by performing a

query over the constraints ﬁrst) or when the NN problem is not constrained (where

our framework reduces to the traditional unconstrained NN problem). When there

are problem constraints, we prune search paths that can not lead to feasible solutions.

Our search drills down to the leaf partition that either contains the optimal answer

within the problem constraints, or yields the best potential answer within constraints.

2.7 Conclusions

In this chapter we present a general framework to model optimization queries.

We present a query processing framework to answer optimization queries in which

the optimization function is convex, query constraints are convex, and the underlying

access structure is made up of convex partitions. We provide speciﬁc implementations

of the processing framework for hierarchical data partitioning and non-hierarchical

space partitioning access structures. We prove the I/O optimality of these implemen-

tations. We experimentally show that the framework can be used to answer a number

of diﬀerent types of optimization queries including nearest neighbor queries, linear

function optimization, and non-linear function optimization queries.

45

The ability to handle general optimization queries within a single framework comes

at the price of increasing the computational complexity when traversing the access

structure. We show a slight increase in CPU times in order to perform convex pro-

gramming based traversal of access structures in comparison with equivalent non-

weighted and unconstrained versions of the same problem.

Since constraints are handled during the processing of partitions in our framework,

and partitions and search paths that do not intersect with problem constraints are

pruned during the access structure traversal, we do not need to examine points that

do not intersect with the constraints. Furthermore, we only access points in partitions

that could potentially have better answers than the actual optimal database answers.

This results in a signiﬁcant reduction in required accesses compared to alternative

methods in the cases where the inclusion of constraints makes these alternatives less

eﬀective.

The proposed framework oﬀers I/O-optimal access of whatever access structure is

used for the query. This means that given an access structure we will only read points

from partitions that have minimum objective function values lower than the k

th

ac-

tual best answer. Because the framework is built to best utilize the access structure,

it captures the beneﬁts as well as the ﬂaws of the underlying structure. Essentially,

it demonstrates how well the particular access structure isolates the optimal contour

within the problem constraints to access structure partitions. If the optimal contour

and problem constraints can be isolated to a few leaf partitions, the access structure

will yield nice results. However, if the optimal contour crosses many partitions, the

performance will not be as good. As such, the framework can be used to measure

page access performance associated with using diﬀerent indexes and index types to

46

answer certain classes of optimization queries, in order to determine which struc-

tures can most eﬀectively answer the optimization query type. Database researchers

and administrators can use this technique as a benchmarking tool to evaluate the

performance of a wide range of index structures available in the literature.

To use this framework, one does not need to know the objective function, weights,

or constraints in advance. The system does not need to compute a queried function

for all data objects in order to ﬁnd the optimal answers. The technique provides

optimization of arbitrary convex functions, and does not incur a signiﬁcant penalty

in order to provide this generality. This makes the framework appropriate for ap-

plications and domains where a number of diﬀerent functions are being optimized

or when optimization is being performed over diﬀerent constrained regions and the

exact query parameters are not known in advance.

47

CHAPTER 3

Online Index Recommendations for High-Dimensional

Databases using Query Workloads

When solving problems, dig at the roots instead of just hacking at the

leaves. - Anthony J. D’Angelo.

In order to eﬀectively prune search space during query resolution, access struc-

tures that are appropriate for the query must be available to the query processor

at query run-time. Appropriate access structures should match well with query se-

lection criteria and allow for a signiﬁcant reduction in the data search space. This

chapter describes a framework to determine a meaningful set of appropriate access

structures for a set of queries considering query cost savings estimated from both

query attributes and data characteristics. Since query patterns can change over time,

rendering statically determined access structures ineﬀective, a method for monitoring

performance and recommending access structure changes is provided.

3.1 Introduction

An increasing number of database applications, such as business data warehouses

and scientiﬁc data repositories, deal with high dimensional data sets. As the number

of dimensions/attributes and overall size of data sets increase, it becomes essential to

eﬃciently retrieve speciﬁc queried data from the database in order to eﬀectively utilize

48

the database. Indexing support is needed to eﬀectively prune out signiﬁcant portions

of the data set that are not relevant for the queries. Multi-dimensional indexing,

dimensionality reduction, and RDBMS index selection tools all could be applied to the

problem. However, for high-dimensional data sets, each of these potential solutions

has inherent problems.

To illustrate these problems, consider a uniformly distributed data set of 1,000,000

data objects with several hundred attributes. Range queries are consistently executed

over ﬁve of the attributes. The query selectivity over each attribute is 0.1, so the

overall query selectivity is 1/10

5

(i.e. the answer set contains about 10 results).

An ideal solution would allow us to read from disk only those pages that contain

matching answers to the query. We could build a multi-dimensional index over the

data set so that we can directly answer any query by only using the index. However,

performance of multi-dimensional index structures is subject to Bellman’s curse of

dimensionality [4] and degrades rapidly as the number of dimensions increases. For

the given example, such an index would perform much worse than sequential scan.

Another possibility would be to build an index over each single dimension. The

eﬀectiveness of this approach is limited to the amount of search space that can be

pruned by a single dimension (in the example the search space would only be pruned

to 100,000 objects).

For data-partitioning indexes such as the R-tree family of indexes, data is placed

in a partition that contains the data point and could overlap with other partitons.

To answer a query, all potentially matching search paths must be explored. As the

dimensionality of the index increases, the overlaps between partitions increase and

49

at high enough dimensions the entire data space needs to be explored. For space-

partitioning structures, where partitions do not overlap and data points are associated

with cells which contain them (e.g. grid ﬁles), the problem is the exponential explosion

of the number of cells. A 100 dimensional index with only a single split per attribute

results in 2

100

cells. The index structure can become intractably large just to identify

the cells. This phenomenon is thoroughly described in [54].

Another possible solution would be to use some dimensionality reduction tech-

nique, index the reduced dimension data space, and transform the query in the

same way that the data was transformed. However, the dimensionality reduction

approaches are mostly based on data statistics, and perform poorly especially when

the data is not highly correlated. They also introduce a signiﬁcant overhead in the

processing of queries.

Another possible solution is to apply feature selection to keep the most important

attributes of the data according to some criteria and index the reduced dimension-

ality space. However, traditional feature selection techniques are based on selecting

attributes that yield the best classiﬁcation capabilities. Therefore, they also select

attributes based on data statistics to support classiﬁcation accuracy rather than focus-

ing on query performance and workload in a database domain. As well, the selected

features may oﬀer little or no data pruning capability given query attributes.

Commercial RDBMS’s have developed index recommendation systems to identify

indexes that will work well for a given workload. These tools are optimized for the

domains for which these systems are primarily employed and the indexes that the

systems provide. They are targeted towards lower dimension transactional databases

and do not produce results optimized for single high-dimensional tables.

50

Our approach is based on the observation that in many high dimensional database

applications, only a small subset of the overall data dimensions are popular for a

majority of queries and recurring patterns of dimensions queried occur. For example,

Large Hadron Collider (LHC) experiments are expected to generate data with up to

500 attributes at the rate of 20 to 40 per second [45]. However, the search criterion

is expected to consist of 10 to 30 parameters. Another example is High Energy

Physics (HEP) experiments [49] where sub-atomic particles are accelerated to nearly

the speed of light, forcing their collision. Each such collision generates on the order

of 1-10MBs of raw data, which corresponds to 300TBs of data per year consisting

of 100-500 million objects. The queries are predominantly range queries and involve

mostly around 5 dimensions out of a total of 200.

We address the high dimensional database indexing problem by selecting a set of

lower dimensional indexes based on joint consideration of query patterns and data

statistics. This approach is also analogous to dimensionality reduction or feature

selection with the novelty that the reduction is speciﬁcally designed for reducing query

response times, rather than maintaining data energy as is the case for traditional

approaches. Our reduction considers both data and access patterns, and results in

multiple and potentially overlapping sets of dimensions, rather than a single set. The

new set of low dimensional indexes is designed to address a large portion of expected

queries and allow eﬀective pruning of the data space to answer those queries.

Query pattern evolution over time presents another challenging problem. Re-

searchers have proposed workload based index recommendation techniques. Their

long term eﬀectiveness is dependent on the stability of the query workload. However,

query access patterns may change over time becoming completely dissimilar from the

51

patterns on which the index set were originally determined. There are many common

reasons why query patterns change. Pattern change could be the result of periodic

time variation (e.g. diﬀerent database uses at diﬀerent times of the month or day), a

change in the focus of user knowledge discovery (e.g. a researcher discovery spawns

new query patterns), a change in the popularity of a search attribute (e.g. current

events cause an increase in queries for certain search attributes), or simply random

variation of query attributes. When the current query patterns are substantially dif-

ferent from the query patterns used to recommend the database indexes, the system

performance will degrade drastically since incoming queries do not beneﬁt from the

existing indexes. To make this approach practical in the presence of query pattern

change, the index set should evolve with the query patterns. For this reason, we

introduce a dynamic mechanism to detect when the access patterns have changed

enough that either the introduction of a new index, the replacement of an existing

index, or the construction of an entirely new index set is beneﬁcial.

Because of the need to proactively monitor query patterns and query performance

quickly, the index selection technique we have developed uses an abstract represen-

tation of the query workload and the data set that can be adjusted to yield faster

analysis. We generate this abstract representation of the query workload by mining

patterns in the workload. The query workload representation consists of a set of

attribute sets that occur frequently over the entire query set that have non-empty

intersections with the attributes of the query, for each query. To estimate the query

cost, the data set is represented by a multi-dimensional histogram where each unique

value represents an approximation of data and contains a count of the number of

52

records that match that approximation. For each possible index for each query, the

estimated cost of using that index for the query is computed.

Initial index selection occurs by traversing the query workload representation and

determining which frequently occurring attribute set results in the greatest beneﬁt

over the entire query set. This process is iterated until some indexing constraint is met

or no further improvement is achieved by adding additional indexes. Analysis speed

and granularity is aﬀected by tuning the resolution of the abstract representations.

The number of potential indexes considered is aﬀected by adjusting data mining

support level. The size of the multi-dimensional histogram aﬀects the accuracy of the

cost estimates associated with using an index for a query.

In order to facilitate online index selection, we propose a control feedback system

with two loops, a ﬁne grain control loop and a coarse control loop. As new queries

arrive, we monitor the ratio of potential performance to actual performance of the

system in terms of cost and based on the parameters set for the control feedback

loops, we make major or minor changes to the recommended index set.

The contributions of this work can be summarized as follows:

1. Introduction of a ﬂexible index selection technique designed for high-dimensional

data sets that uses an abstract representation of the data set and query work-

load. The resolution of the abstract representation can be tuned to achieve

either a high ratio of index-covered queries for static index selection or fast

index selection to facilitate online index selection.

2. Introduction of a technique using control feedback to monitor when online

query access patterns change and to recommend index set changes for high-

dimensional data sets.

53

3. Presentation of a novel data quantization technique optimized for query work-

loads.

4. Experimental analysis showing the eﬀects of varying abstract representation

parameters on static and online index selection performance and showing the

eﬀects of varying control feedback parameters on change detection response.

The rest of the chapter is organized as follows. Section 3.2 presents the related

work in this area. Section 3.3 explains our proposed index selection and control

feedback framework. Section 3.4 presents the empirical analysis. We conclude in

Section 3.5.

3.2 Related Work

High-dimensional indexing, feature selection, and DBMS index selection tools are

possible alternatives for addressing the problem of answering queries over a subspace

of high-dimensional data sets. As described below, each of these methods provides

less than ideal solutions for the problem of fast high-dimensional data access. Our

work diﬀers from the related index selection work in that we provide a index selection

framework that can be tuned for speed or accuracy. Our technique is optimized

to take advantage of multi-dimensional pruning oﬀered by multi-dimensional index

structures. It takes into consideration both data and query characteristics and can

be applied to perform real-time index recommendations for evolving query patterns.

3.2.1 High-Dimensional Indexing

A number of techniques have been introduced to address the high-dimensional

indexing problem, such as the X-tree [6], and the GC-tree [28]. While these index

54

structures have been shown to increase the range of eﬀective dimensionality, they still

suﬀer performance degradation at higher index dimensionality.

3.2.2 Feature Selection

Feature selection techniques [7, 36, 29] are a subset of dimensionality reduction

targeted at ﬁnding a set of untransformed attributes that best represent the overall

data set. These techniques are also focused on maximizing data energy or classiﬁca-

tion accuracy rather than query response. As a result, selected features may have no

overlap with queried attributes.

3.2.3 Index Selection

The index selection problem has been identiﬁed as a variation of the Knapsack

Problem and several papers proposed designs for index recommendations [34, 55, 3,

24, 17, 13] based on optimization rules. These earlier designs could not take advantage

of modern database systems’ query optimizer. Currently, almost every commercial

Relational Database Management System (RDBMS) provides the users with an index

recommendation tool based on a query workload and using the query optimizer to

obtain cost estimates. A query workload is a set of SQL data manipulation state-

ments. The query workload should be a good representative of the types of queries

an application supports.

Microsoft SQL Server’s AutoAdmin tool [15, 16, 1] selects a set of indexes for use

with a speciﬁc data set given a query workload. In the AutoAdmin algorithm, an

iterative process is utilized to ﬁnd an optimal conﬁguration. First, single dimension

candidate indexes are chosen. Then a candidate index selection step evaluates the

queries in a given query workload and eliminates from consideration those candidate

55

indexes which would provide no useful beneﬁt. Remaining candidate indexes are eval-

uated in terms of estimated performance improvement and index cost. The process

is iterated for increasingly wider multi-column indexes until a maximum index width

threshold is reached or an iteration yields no improvement in performance over the

last iteration.

Costs are estimated using the query optimizer which is limited to considering those

physical designs oﬀered by the DBMS. In the case of SQL Server, single and multi-level

B+-trees are evaluated. These index structures can not achieve the same level of result

pruning that can be oﬀered by an index technique that indexes multiple dimensions

simultaneously (such as R-tree or grid ﬁle). As a result the indexes suggested by the

tool often do not capture the query performance that could be achieved for multi-

dimensional queries.

MAESTRO (METU Automated indEx Selection Tool)[19] was developed on top

of Oracle’s DBMS to assist the database administrator in designing a complete set of

primary and secondary indexes by considering the index maintenance costs based on

the valid SQL statements and their usage statistics automatically derived using SQL

Trace Facility during a regular database session. The SQL statements are classiﬁed by

their execution plan and their weights are accumulated. The cost function computed

by the query optimizer is used to calculate the beneﬁt of using the index.

For DB2, IBM has developed the DB2Adviser [52] which recommends indexes

with a method similar to AutoAdmin with the diﬀerence that only one call to the

query optimizer is needed since the enumeration algorithm is inside the optimizer

itself.

56

These commercial index selection tools are coupled to physical design options

provided by their respective query optimizers and therefore, do not reﬂect the pruning

that could be achieved by indexing multiple dimensions together.

3.2.4 Automatic Index Selection

The idea of having a database that can tune itself by automatically creating new

indexes as the queries arrive has been proposed [35, 18]. In [35] a cost model is used

to identify beneﬁcial indexes and decide when to create or drop an index at runtime.

[18] proposes an agent-based database architecture to deal with automatic index

creation. Microsoft Research has proposed a physical design alerter [11] to identify

when a modiﬁcation to the physical design could result in improved performance.

3.3 Approach

3.3.1 Problem Statement

In this section we deﬁne the problem of index selection for a multi-dimensional

space using a query workload.

A query workload W consists of a set of queries that select objects within a

speciﬁed subspace in the data domain. More formally, we deﬁne a workload as follows:

Deﬁnition 4 A workload W is a tuple W = (D, DS, Q), where D is the domain,

DS ⊆ D is a ﬁnite subset (the data set), and Q (the query set) is a set of subsets of

DS.

In our case, the domain is

d

, where d is the dimensionality of the dataset, the

instance DS is a set of n tuples t = {(a

1

, a

2

, ..., a

d

)|a

i

∈ }, and the query set Q is a

set of range queries.

57

Finding the answers to a query, when no index is present, reduces to scanning all

the points in the dataset and testing whether the query conditions are met. In this

scenario we can deﬁne the cost of answering the query as the time it takes to scan the

dataset, i.e. the time to retrieve the data pages from disk. The assumption is that

the time spend doing I/O dominates the time required to perform the simple bound

comparisons. In the case that an index is present, the cost of answering the query

can be lower. The index can identify a smaller set potential matching objects and

only those data pages containing these objects need to be retrieved from disk. The

degree to which an index prunes the potential answer set for a query determines its

eﬀectiveness for the query.

Our problem can be deﬁned as ﬁnding a set of indexes I, given a multi-dimensional

dataset DS, a query workload W, an optional indexing constraint C, an optional

analysis time constraint t

a

, that provides the best estimated cost over W. In the

context of this problem an index is considered to be the set of attributes that can

be used to prune subspace simultaneously with respect to each attribute. Therefore,

attribute order has no impact on the amount of pruning possible.

The overall goal of this work is to develop a ﬂexible index selection framework

that can be tuned to achieve eﬀective static and online index selection for high-

dimensionsal data under diﬀerent analysis constraints.

For static index selection, when no constraints are speciﬁed, we are aiming to

recommend the set of indexes that yields the lowest estimated cost for every query

in a workload for any query that can beneﬁt from an index. In the case when a

constraint is speciﬁed, either as the minimum number of indexes or a time constraint,

we want to recommend a set of indexes within the constraint, from which the queries

58

can beneﬁt the most. When there is a time-constraint, we need to automatically

adjust the analysis parameters to increase the speed of analysis.

For online index selection, we are striving to develop a system that can recommend

an evolving set of indexes for incoming queries over time, such that the beneﬁt of

index set changes outweighs the cost of making those changes. Therefore, we want an

online index selection system that diﬀerentiates between low-cost index set changes

and higher cost index set changes and also can make decisions about index set changes

based on diﬀerent cost-beneﬁt thresholds.

3.3.2 Approach Overview

In order to measure the beneﬁt of using a potential index over a set of queries, we

need to be able to estimate the cost of executing the queries, with and without the

index. Typically, a cost model is embedded into the query optimizer to decide on the

query plan, whether the query should be answered using a sequential scan or using

an existing index. Instead of using the query optimizer to estimate query cost, we

conservatively estimate the number of matches associated with using a given index

by using a multi-dimensional histogram abstract representation of the dataset. The

histogram captures data correlations between only those attributes that could be rep-

resented in a selected index. The cost associated with an index is calculated based on

the number of estimated matches derived from the histogram and the dimensionality

of the index. Increasing the size of the multi-dimensional histogram enhances the

accuracy of the estimate at the cost of abstract representation size.

While maintaining the original query information for later use to determine esti-

mated query cost, we apply one abstraction to the query workload to convert each

59

query into the set of attributes referenced in the query. We perform frequent itemset

mining over this abstraction and only consider those sets of attributes that meet a

certain support to be potential indexes. By varying the support, we aﬀect the speed

of index selection and the ratio of queries that are covered by potential indexes. We

further prune the analysis space using association rule mining by eliminating those

subsets above a certain conﬁdence threshold. Lowering the conﬁdence threshold im-

proves analysis time by eliminating some lower dimensional indexes from considera-

tion but can result in recommending indexes that cover a strict superset of the queried

attributes.

Our technique diﬀers from existing tools in the method we use to determine the

potential set of indexes to evaluate and in the quantization-based technique we use

to estimate query costs. All of the commercial index wizards work in design time.

The Database Administrator (DBA) has to decide when to run this wizard and over

which workload. The assumption is that the workload is going to remain static over

time and in case it does change, the DBA would collect the new workload and run the

wizard again. The ﬂexibility aﬀorded by the abstract representation we use allows

it to be used for infrequent, index selection considering a broader analysis space or

frequent, online index selection.

In the following two sections we present our proposed solution for index selection

which is used for static index selection and as a building block for the online index

selection.

60

Figure 3.1: Index Selection Flowchart

3.3.3 Proposed Solution for Index Selection

The goal of the index selection is to minimize the cost of the queries in the workload

given certain constraints. Given a query workload, a dataset, the indexing constraints,

and several analysis parameters, our framework produces a set of suggested indexes

as an output. Figure 3.1 shows a ﬂow diagram of the index selection framework.

Table 3.1 provides a list of the notations used in the descriptions.

We identify three major components in the index selection framework: the ini-

tialization of the abstract representations, the query cost computation, and the index

selection loop. In the following subsections we describe these components and the

dataﬂow between them.

Initialize Abstract Representations

The initialization step uses a query workload and the dataset to produce a set of

Potential Indexes P, a Query Set Q, and a Multi-dimensional Histogram H, according

to the support, conﬁdence, and histogram size speciﬁed by the user. The description

of the outputs and how they are generated is given below.

• Potential Index Set P

61

Symbol Description

P Potential Set of Indexes, the set of attribute sets under consid-

eration as a suggested indexes

Q Query Set, a representation of the query workload, for each

query. It consists of the attribute sets in P that intersect with

the query attributes, query ranges, and estimated query costs

H Multi-Dimensional Histogram, used as a workload-optimized ab-

stract representation of the data set

S Suggested Indexes, the set of attribute sets currently selected as

recommended indexes

i attribute set currently under analysis

support the minimum ratio of occurrence of an attribute in a query work-

load to be included in P

confidence the maximum ratio of occurrence of an attribute subset to the

occurrence of a set before the subset is pruned from P

histogram size the number of bits used to represent a quantized data object

Table 3.1: Index Selection Notation List

The potential index set P is a collection of attribute sets that could be beneﬁcial

as an index for the queries in the input query workload. This set is computed using

traditional data mining techniques. Considering the attributes involved in each query

from the input query workload to be a single transaction, P consists of the sets of

attributes that occur together in a query at a ratio greater than the input support.

Formally, support of a set of attributes A is deﬁned as:

S

A

=

n

¸

i=1

1 if A ⊆ Q

i

0 otherwise

n

where Q

i

is the set of attributes in the i

th

query and n is the number of queries.

For instance, if the input support is 10%, and attributes 1 and 2 are queried

together in greater than 10 percent of the queries, then a representation of the set

62

of attributes {1,2} will be included as a potential index. Note that because a subset

of an attribute set that meets the support requirement will also necessarily meet the

support, all subsets of attribute sets meeting the support will also be included as a

potential index (in the example above both the sets {1} and {2} will be included). As

the input support is decreased, the number of potential indexes increases. Note that

our particular system is built independently from a query optimizer, but the sets of

attributes appearing in the predicates from a query optimizer log could just as easily

be substituted for the query workload in this step.

If a set occurs nearly as often as one of its subsets, an index built over the subset

will likely not provide much beneﬁt over the query workload if an index is built over

the attributes in the set. Such an index will only be more eﬀective in pruning data

space for those queries that involve only the subset’s attributes. In order to enhance

analysis speed with limited eﬀect on accuracy, the input conﬁdence is used to prune

analysis space. Conﬁdence is the ratio of a set’s occurrence to the occurrence of a

subset.

While data mining the frequent attribute sets in the query workload in determin-

ing P, we also maintain the association rules for disjoint subsets and compute the

conﬁdence of these association rules. The conﬁdence of an association rule is deﬁned

as the ratio that the antecedent (Left Hand Side of the rule) and consequent (Right

Hand Side of the rule) appear together in a query given that the antecedent appears

in the query. Formally, conﬁdence of an association rule {set of attributes A}→{set

of attributes B}, where A and B are disjoint, is deﬁned as:

63

C

A→B

=

n

¸

i=1

1 if (A ∪ B) ⊆ Q

i

0 otherwise

n

¸

i=1

1 if A ⊆ Q

i

0 otherwise

where Q

i

is the set of attributes in the i

th

query and n is the number of queries.

In our example, if every time attribute 1 appears, attribute 2 also appears then

the conﬁdence of {1}→{2} = 1.0. If attribute 2 appears without attribute 1 as many

times as it appears with attribute 1, then the conﬁdence {2}→{1} = 0.5. If we have

set the conﬁdence input to 0.6, then we will prune the attribute set {1} from P, but

we will keep attribute set {2}.

We can also set the conﬁdence level based on attribute set cardinality. Since

the cost of including extra attributes that are not useful for pruning increases with

increased indexed dimensionality, we want to be more conservative with respect to

pruning attribute subsets. The conﬁdence can contain more than one value depending

on set cardinality.

While the apriori algorithm was appropriate for the relatively low attribute query

sets in our domain, a more eﬃcient algorithm such as the FP-Tree [30] could be applied

if the attribute sets associated with queries are too large for the apriori technique to

be eﬃcient. While it is desirable to avoid examining a high-dimensional index set

as a potential index, another possible solution in the case where a large number of

attributes are frequent together would be to partition a large closed frequent itemset

into disjoint subsets for further examination. Techniques such as CLOSET [44] could

be used to arrive at the initial closed frequent itemsets.

• Query Set Q

64

The query set Q is the abstract representation of the query workload is initialized

by associating the potential indexes that could be beneﬁcial for each query with

that query. These are the indexes in the potential index set P that share at least

one common attribute with the query. At the end of this step, each query has an

identiﬁed set of possible indexes for that query.

• Multi-Dimensional Histogram H

An abstract representation of the data set is created in order to estimate the query

cost associated with using each query’s possible indexes to answer that query. This

representation is in the form of a multi-dimensional histogram H. A single bucket

represents a unique bit representation across all the attributes represented in the

histogram. The input histogram size dictates the number of bits used to represent

each unique bucket in the histogram. These bits are designated to represent only

the single attributes that met the input support in the input query workload. If a

single attribute does not meet the support, then it can not be part of an attribute

set appearing in P. There is no reason to sacriﬁce data representation resolution for

attributes that will not be evaluated. The number of bits that each of the represented

attributes gets is proportional to the log of that attribute’s support. This gives more

resolution to those attributes that occur more frequently in the query workload.

Data for an attribute that has been assigned b bits is divided into 2

b

buckets.

In order to handle data sets with uneven data distribution, we deﬁne the ranges of

each bucket so that each bucket contains roughly the same number of points. The

histogram is built by converting each record in the data set to its representation

in bucket numbers. As we process data rows, we only aggregate the count of rows

with each unique bucket representation because we are just interested in estimating

65

query cost. Note that the multi-dimensional histogram is based on a scalar quantizer

designed on data and access patterns, as opposed to just data in the traditional case.

A higher accuracy in representation is achieved by using more bits to quantize the

attributes that are more frequently queried.

For illustration, Table 3.2 shows a simple multi-dimensional histogram example.

This histogram covers 3 attributes and uses 1 bit to quantize attributes 2 and 3, and

2 bits to quantize attribute 1, assuming it is queried more frequently than the other

attributes. In this example, for attributes 2 and 3 values from 1-5 quantize to 0, and

values from 6-10 quantize to 1. For attribute 1, values 1 and 2 quantize to 00, 3 and

4 quantize to 01, 5-7 quantize to 10, and 8 and 9 quantize to 11. The .’s in the ’value’

column denote attribute boundaries (i.e. attribute 1 has 2 bits assigned to it).

Note that we do not maintain any entries in the histogram for bit representa-

tions that have no occurrences. So we can not have more histogram entries than

records and will not suﬀer from exponentially increasing the number of potential

multi-dimensional histogram buckets for high-dimensional histograms.

Sample Dataset Histogram

A

1

A

2

A

3

Encoding Value Count

2 5 5 0000 00.0.0 2

4 8 3 0110 00.0.1 1

1 4 3 0000 01.0.0 1

6 7 1 1010 01.1.0 1

3 2 2 0100 01.1.1 1

2 2 6 0001 10.1.0 2

5 6 5 1010 11.0.0 1

8 1 4 1100 11.0.1 1

3 8 7 0111

9 3 8 1101

Table 3.2: Histogram Example

66

Query Cost Calculation

Once generated, the abstract representations of the query set Q and the multi-

dimensional histogram H are used to estimate the cost of answering each query using

all possible indexes for the query. For a given query-index pair, we aggregate the

number of matches we ﬁnd in the multi-dimensional histogram looking only at the

attributes in the query that also occur in the index (bits associated with other at-

tributes are considered to be don’t cares in the query matching logic). To estimate

the query cost, we then apply a cost function based on the number of matches we

obtain using the index and the dimensionality of the index. At the end of this step,

our abstract query set representation has estimated costs for each index that could

improve the query cost. For each query in the query set representation, we also keep

a current cost ﬁeld, which we initialize to the cost of performing the query using

sequential scan. At this point, we also initialize an empty set of suggested indexes S.

• Cost Function

A cost function is used to estimate the cost associated with using a certain index

for a query. The cost function can be varied to accurately reﬂect a cost model for the

database system. For example, one could apply a cost function that amortized the

cost of loading an index over a certain number of queries or use a function tailored

to the type of index that is used. Many cost functions have been proposed over the

years. For an R-Tree, which is the index type used for this work, the expected number

of data page accesses is estimated in [22] by:

A

nn,mm,FBF

=

d

1

C

eff

+ 1

d

67

where d is the dimensionality of the dataset and C

eff

is the number of data objects

per disk page. However, this formula assumes the number of points N approaches to

inﬁnity and does not consider the eﬀects of high dimensionality or correlations.

A more recently proposed cost model is given in [8] where the expected number

of pages accesses is determined as:

A

r,em,ui

(r) =

2r ·

d

N

C

eff

+ 1 −

1

C

eff

d

where r is the radius of the range query, d is the dataset dimensionality, N is the

number of data objects, and C

eff

is the capacity of a data page.

While these published cost estimates can be eﬀective to estimate the number of

page accesses associated with using a multi-dimensional index structure under certain

conditions, they have certain characteristics that make them less than ideal for the

given situation. Each of the cost estimate formulas require a range radius. Therefore

the formulas break down when assessing the cost of a query that is an exact match

query in one or more of the query dimensions. These cost estimates also assume that

data distribution is independent between attributes, and that the data is uniformly

distributed throughout the data space.

In order to overcome these limitations, we apply a cost estimate that is based on

the actual matches that occur over the multi-dimension histogram over the attributes

that form a potential index. The cost model for R-trees we use in this work is given

by

(d

(d/2)

∗ m)

where d is the dimensionality of the index and m is the number of matches returned for

query matching attributes in the multi-dimensional histogram. Using actual matches

68

eliminates the need for a range radius. It also ties the cost estimate to the actual

data characteristics (i.e. incorporates both data correlation between attributes and

data distribution, while the published models will produce results that are dependent

only on the range radius for a given index structure). The cost estimate provided

is conservative in that it will provide a result that is at least as great as the actual

number of matches in the database.

By evaluating the number of matches over the set of attributes that match the

query, the multi-dimensional subspace pruning that can be achieved using diﬀerent

index possibilities is taken into account. There is additional cost associated with

higher dimensionality indexes due to the greater number of overlaps of the hyperspaces

within the index structure, and additional cost of traversing the higher dimension

structure. A penalty is imposed on a potential index by the dimensionality term.

Given equal ability to prune the space, a lower dimensional index will translate into

a lower cost.

The cost function could be more complicated in order to more accurately model

query costs. It could model query cost with greater accuracy, for example by crediting

complete attribute coverage for coverage queries. It could also reﬂect the appropriate

index structures used in the database system, such as B+-trees. We used this partic-

ular cost model because the index type was appropriate for our data and query sets,

and we assumed that we would retrieve data from disk for all query matches.

Index Selection Loop

After initializing the index selection data structures and updating estimated query

costs for each potentially useful index for a query, we use a greedy algorithm that takes

into account the indexes already selected to iteratively select indexes that would be

69

appropriate for the given query workload and data set. For each index in the potential

index set P, we traverse the queries in query set Q that could be improved by that

index and accumulate the improvement associated with using that index for that

query. The improvement for a given query-index pair is the diﬀerence between the

cost for using the index and the query’s current cost. If the index does not provide

any positive beneﬁt for the query, no improvement is accumulated. The potential

index i that yields the highest improvement over the query set Q is considered to

be the best index. Index i is removed from potential index set P and is added to

suggested index set S. For the queries that beneﬁt from i, the current query cost is

replaced by the improved cost.

After each i is selected, a check is made to determine if the index selection loop

should continue. The input indexing constraints provides one of the loop stop criteria.

The indexing constraint could be any constraint such as the number of indexes, total

index size, or total number of dimensions indexed. If no potential index yields further

improvement, or the indexing constraints have been met, then the loop exits. The

set of suggested indexes S contains the results of the index selection algorithm.

At the end of a loop iteration, when possible, we prune the complexity of the

abstract representations in order to make the analysis more eﬃcient. This includes

actions such as eliminating potential indexes that do not provide better cost estimates

than the current cost for any query and pruning from consideration those queries

whose best index is already a member of the set of suggested indexes. The overall

speed of this algorithm is coupled with the number of potential indexes analyzed,

so the analysis time can be reduced by increasing the support or decreasing the

conﬁdence.

70

Diﬀerent strategies can be used in selecting a best index. The strategy provided

assumes a indexing constraint based on the number of indexes and therefore uses

the total beneﬁt derived from the index as the measure of index ’goodness’. If the

indexing constraint is based on total index size, then beneﬁt per index size unit

may be a more appropriate measure. However, this may result in recommending a

lower-dimensioned index and later in the algorithm a higher-dimensioned index that

always performs better. The recommendation set can be pruned in order to avoid

recommending an index that is non-useful in the context of the complete solution.

3.3.4 Proposed Solution for Online Index Selection

The online index selection is motivated by the fact that query patterns can change

over time. By monitoring the query workload and detecting when there is a change on

the query pattern that generated the existing set of indexes, we are able to maintain

good performance as query patterns evolve. In our approach, we use control feedback

to monitor the performance of the current set of indexes for incoming queries and

determine when adjustments should be made to the index set. In a typical control

feedback system, the output of a system is monitored and based on some function

involving the input and output, the input to the system is readjusted through a

control feedback loop. Our situation is analogous but more complex than the typical

electrical circuit control feedback system in several ways:

1. Our system input is a set of indexes and a set of incoming queries rather than

a simple input, such as an electrical signal.

2. The system output must be some parameter that we can measure and use to

make decisions about changing the input. Query performance is the obvious

71

parameter to monitor. However, because lower query performance could be

related to other aspects rather than the index set, our decision making control

function must necessarily be more complex than a basic control system.

3. We do not have a predictable function to relate system input and output because

of the non-determinism associated with new incoming queries. For example, we

may have a set of attributes that appears in queries frequently enough that our

system indicates that it is beneﬁcial to create an index over those attributes,

but there is no guarantee that those attributes will ever be queried again.

Control feedback systems can fail to be eﬀective with respect to response time.

The control system can be too slow to respond to changes, or it can respond too

quickly. If the system is too slow, then it fails to cause the output to change based

on input changes in a timely manner. If it responds too quickly, then the output

overshoots the target and oscillates around the desired output before reaching it.

Both situations are undesirable and should be designed out of the system.

Figure 3.2 represents our implementation of dynamic index selection. Our system

input is a set of indexes and a set of incoming queries. Our system simulates and

estimates costs for the execution of incoming queries. System output is the ratio of

potential system performance to actual system performance in terms of database page

accesses to answer the most recent queries. We implement two control feedback loops.

One is for ﬁne grain control and is used to recommend minor, inexpensive changes

to the index set. The other loop is for coarse control and is used to avoid very

poor system performance by recommending major index set changes. Each control

feedback loop has decision logic associated with it.

72

Figure 3.2: Dynamic Index Analysis Framework

System Input

The system input is made up of new incoming queries and the current set of indexes

I, which is initialized to be the suggested indexes S from the output of the initial

index selection algorithm. For clarity, a notation list for the online index selection is

included as Table 3.3.

System

The system simulates query execution over a number of incoming queries. The

abstract representation of the last w queries stored as W , where w is an adjustable

window size parameter. W is used to estimate performance of a hypothetical set of

indexes I

new

against the current index set I. This representation is similar to the one

kept for query set Q in the static index selection. In this case, when a new query

q arrives, we determine which of the current indexes in I most eﬃciently answers

73

Symbol Description

I Current set of attribute sets used as indexes

I

new

Hypothetical set of attribute sets used as indexes

w window size

W abstract represention of the last w queries

q the current query under analysis

P Current Potential Indexes, the set of attribute sets in consider-

ation to be indexes before q arrives

P

new

New Potential Indexes, the set of attribute sets in consideration

to be indexes after q arrives

i

q

the attribute set estimated to be the best index for query q

Table 3.3: Notation List for Online Index Selection

this query and replace the oldest query in W with the abstract representation of q.

We also incrementally compute the attribute sets that meet the input support and

conﬁdence over the last w queries. This information is used in the control feedback

loop decision logic. The system also keeps track of the current potential indexes P,

and the current multi-dimensional histogram H.

System Output

In order to monitor the performance of the system, we compare the query per-

formance using the current set of indexes I to the performance using a hypothetical

set of indexes I

new

. The query performance using I is the summation of the costs of

queries using the best index from I for the given query. Consider the possible new

indexes P

new

to be the set of attribute sets that currently meet the input support

and conﬁdence over the last w queries. The hypothetical cost is calculated diﬀerently

based on the comparison of P and P

new

, and the identiﬁed best index i

q

from P or

P

new

for the new incoming query:

74

1. P = P

new

and i is in I. In this case we bypass the control loops since we could

do no better for the system by changing possible indexes.

2. P = P

new

and i is not in I. We recompute a new set of suggested indexes I

new

over the last w queries. The hypothetical cost is the cost over the last w queries

using I

new

.

3. P = P

new

and i is in I. In this case we bypass the control loops since we could

do no better for the system by changing possible indexes.

4. P = P

new

and i is not in I. We traverse the last w queries and determine those

queries that could beneﬁt from using a new index from P

new

. We compute the

hypothetical cost of these queries to be the real number of matches from the

database. Hypothetical cost for other queries is the same as the real cost.

The ratio of the hypothetical cost, which indicates potential performance, to the

actual performance is used in the control loop decision logic.

Fine Grain Control Loop

The ﬁne grain control loop is used to recommend low cost, minor changes to index

set. This loop is entered in case 2 as described above when the ratio of hypotheti-

cal performance to actual performance is below some input minor change threshold.

Then the indexes are changed to I

new

, and appropriate changes are made to update

the system data structures. Increasing the input minor change threshold causes the

frequency of minor changes to also increase.

75

Coarse Control Loop

The coarse control loop is used to recommend more costly, but changes with

greater impact on future performance to the index set. This loop is entered in case 4 as

described above when the ratio of hypothetical performance to actual performance is

below some input major change threshold. Then the static index selection is performed

over the last w queries, abstract representations are recomputed, and a new set of

suggested indexes I

new

is generated. Appropriate changes are made to update the

system data structures to the new situation. Increasing the input major change

threshold increases the frequency of major changes.

3.3.5 System Enhancements

In the following subsections, we present two system enhancements that provide

further robustness and scalability to our framework.

Self Stabilization

Control feedback systems can be either too slow or too responsive in reacting

to input changes. In our application, a system that is slow to respond results in

recommending useful indexes long after they ﬁrst could have a positive eﬀect. It

could also fail to recommend potentially useful indexes if thresholds are set so that

the system is insensitive to change. A system that is too responsive can result in

system instability, where the system continously adds and drops indexes.

The system performance and rate of index change can be monitored and used

in order to tune the control feedback system itself. If the actual number of query

results is much lower than the estimated cost using the recommended index set over

a window of queries, this indicates a non-responsive system. When this condition is

76

detected, the system can be made more responsive by cutting the window size. This

increases the probability that P

new

will be diﬀerent from P, and we can reperform to

analysis applying a reduced support. This gives us a more complete P with respect

to answering a greater portion of queries.

If the frequency of index change is too high with little or no improvement in query

performance, an oversensitive system or unstable query pattern is indicated. We can

reduce the sensitivity by increasing the window size or increasing the support level

during recomputation of new recommended indexes.

Partitioning the Algorithm

The system can also be applied to environments where the database is parti-

tioned horizontally or vertically at diﬀerent locations. A solution at one extreme

is to maintain a centralized index recommendation system. This would maintain

the abstract representation and collect global query patterns over the entire system.

Recommended indexes would be determined based on global trends. This approach

would allow for the creation of indexes that maximize pruning over the global set of

dimensions. However, it would not optimize query performance at the site level. A

single set of indexes would be recommended for the entire system.

At the other extreme would be to perform the index recommendation at each

site. In this case, a diﬀerent abstract representation would be built based on the

data at each speciﬁc site given requests to the data at that site. Indexes would

be recommended based on the query traﬃc to the data at that site. This allows

tight control of the indexes to reﬂect the data at each site. However, index-based

pruning based on attributes and data at multiple locations would not be possible.

77

This approach also requires traﬃc to each site that contains potentially matching

data.

A hybrid approach can provide the beneﬁts of index recommendation based on

global data and query patterns while optimizing the index set at diﬀerent requesting

locations. A global set of indexes can be centrally recommended based on a global

abstract representation of the data set with low support thresholds (the centralized

location will store a larger set of globally good indexes). The query patterns emanat-

ing from each site can be mined in order to ﬁnd which of those global indexes would

be appropriate to maintain at the local site, eliminating some traﬃc.

3.4 Empirical Analysis

3.4.1 Experimental Setup

Data Sets

Several data sets were used during the performance of experiments. The variation

in data sets is intended to show the applicability of our algorithm to a wide range of

data sets and to measure the eﬀect that data correlation has on results. Data sets

used include:

• random - a set of 100,000 records consisting of 100 dimensions of uniformly

distributed integers between 0 and 999. The data is not correlated.

• stocks - a set of 6500 records consisting of 360 dimensions of daily stock market

prices. This data is extremely correlated.

78

• mlb - a set of 33619 records of major league pitching statistics from between the

years of 1900 and 2004 consisting of 29 dimensions of data. Some dimensions

are correlated with each other, while others are not at all correlated.

Analysis Parameters

The eﬀect of varying several analysis input parameters including support, multi-

dimensional histogram size, and online indexing control feedback decision thresholds

was analyzed. Unless otherwise speciﬁed, the confidence parameter for the experi-

ments is 1.0.

Query Workloads

It is desirable to explore the general behavior of database interactions without

inadvertently capturing the coupling associated with using a speciﬁc query history

on the same database. Therefore, query workload ﬁles were generated by merging

synthetic query histories and query histories from real-world applications with dif-

ferent data sets. Histories and data were merged by taking a random record from

a data set and the numerical identiﬁer of the attributes involved in the synthetic or

historical query in order to generate a point query. So, if a historical query involved

the 3rd and 5th attribute, and the n

th

record was randomly selected from the data

set, a SELECT type query is generated from the data set where the 3rd attribute

is equal to the value of the 3rd attribute of the n

th

record and the 5th attribute is

equal to the value of 5th attribute of the n

th

record. This gives a query workload that

reﬂects the attribute correlations within queries, and has a variable query selectivity.

Unless otherwise stated each query in experiments is a point query with respect to

79

the attributes covered by the query. Attributes that are not covered by the query can

be any value and still match the query.

These query histories form the basis for generating the query workloads used in

our experiments:

1. synthetic - 500 randomly generated queries. The distribution of the queries over

the ﬁrst 200 queries is 20% involve attributes {1,2,3,4} together, 20% {5,6,7},

20%, {8,9}, and the remaining queries involve between 1 and 5 attributes that

could be any attribute. Over the last 300 queries, the distribution shifts to

20% covering attributes {11,12,13,14}, 20% {15,16,17}, 20% {18,19}, and the

remaining 40% are between 1 to 5 attributes that could be any attribute.

2. clinical - 659 queries executed from a clinical application. The query distribution

ﬁle has 64 distinct attributes.

3. hr - 35,860 queries executed from a human resources application. The query

distribution ﬁle has 54 attributes. Due to the size of this query set, some initial

portion of the queries are used for some experiments.

3.4.2 Experimental Results

Index Selection with Relaxed Constraints

Figure 3.3 shows how accurately the proposed static index selection technique can

perform with relaxed analysis constraints. For this scenario, a low support value,

0.01, is used. There are no constraints placed on the number of selected indexes.

And 256 bits are used for the multi-dimension histogram. Figure 3.3 shows the cost

of performing a sequential scan over 100 queries using the indicated data sets, the

estimated cost of using recommended indexes, and the true number of answers for

80

Figure 3.3: Costs in Data Object Accesses for Ideal, Sequential Scan, AutoAdmin,

Naive, and the Proposed Index Selection Technique using Relaxed Constraints

the set of queries. For comparative purposes, costs for indexes proposed by AutoAd-

min and a naive algorithm are also provided. The naive algorithm uses indexes for

the ﬁrst n most frequent itemsets in the query patterns, where n is the number of

indexes suggested by the proposed algorithm. Note that the cost for sequential scan

is discounted to account for the fact that sequential access is signiﬁcantly cheaper

than random access. A factor of 10 is applied as the ratio of random access cost

to sequential access cost. While the actual ratio of a given environment is variable,

regardless of any reasonable factor used, the graph will show the cost of using our

indexes much closer to the ideal number of page accesses than the cost of sequential

scan.

Table 3.4 compares the proposed index selection algorithm with relaxed con-

straints against SQL Server’s index selection tool, AutoAdmin [15], using the data

sets and query workloads indicated.

81

Data Set/ Analysis % Queries Number of

Workload Tool Time(s) Improved Indexes

stock/ AutoAdmin 450 100 23

clinical Proposed 110 100 18

stock/ AutoAdmin 338 100 20

hr Proposed 160 100 16

mlb/ AutoAdmin 15 0 0

clinical Proposed 522 87 16

Table 3.4: Comparison of proposed index selection algorithm with AutoAdmin in

terms of analysis time and % queries improved

For the stock dataset using both the clinical and hr workloads, both algorithms

suggest indexes which will improve all of the queries. Since the selectivity of these

queries is low (the queries return a low number of matches using any index that

contains a queried attribute), the amount of the query improvement will be very

similar using either recommended index set. The proposed algorithm executes in

less time and generates indexes which are more representative of the query patterns

in the workload and allow for greater subspace pruning. For example, the most

common query in the clinical workload is to query over both attributes 11 and 44

together. SQL Server’s index recommendations are all single-dimension indexes for

all attributes that appear in the workload. However, our ﬁrst index recommendation

is a 2-dimension index built over attributes 11 and 44.

For the mlb dataset, SQL Server quickly recommended no indexes. Our index

selection takes longer in this instance, but ﬁnds indexes that improve 87 % of the

queries. These are all the queries that have selectivities low enough that an index

can be beneﬁcial. It should be noted that the indexes selected by AutoAdmin are

appropriate based on the cost models for the underlying index structure used in

82

SQL Server. Since the underlying index type for SQL Server is B+-trees, which do

not index multiple dimensions concurrently, the single dimension indexes that are

recommended are the least costly possible indexes to use for the query set. For this

particular experiment, a single dimensional index is not able to prune the subspace

enough to make the application of that index worthwhile compared to sequential scan.

Eﬀect of Support and Conﬁdence

Table 3.5 presents results on analysis complexity and expected query improvement

as support and confidence are varied. The results are shown for the stock data

set over the clinical query workload. Results show the total number of query-index

pairs analyzed over the query set in the index selection loop and the total estimated

improvement in query performance in terms of data objects accessed over the query

set as the index selection parameters vary. As confidence decreases, we maintain

fewer potential indexes in P and need to analyze fewer attribute sets per index. This

decreases analysis time but shows very little eﬀect on overall performance.

Using a conﬁdence level of 0% is equivalent to using the maximal itemsets of at-

tributes that meet support criteria as recommended indexes. For this example, the

strategy yields nearly identical estimated cost although only 34-44% of the query/in-

dex pairs need to be evaluated.

Baseline Online Indexing Results

A number of experiments were performed to demonstrate the eﬀectiveness and

characteristics of the adaptive system. In each of the experiments that show the

performance of the adaptive system over a sequence of queries, an initial set of indexes

is recommended based on the ﬁrst 100 queries given the stated analysis parameters.

83

Support Query/Index Pairs Total Improvement

Conﬁdence 100 50 0 100 50 0

2 229 229 90 57710 57710 57603

4 198 198 85 55018 55018 55001

6 185 185 73 46924 46924 46907

8 178 178 66 42533 42533 42516

10 170 170 58 37527 37527 37510

Table 3.5: Comparison of Analysis Complexity and Query Performance as support

and confidence vary, stock dataset, clinical workload

New queries are evaluated, and depending on the parameters of the adaptive system,

changes are made to index set. These index set changes are considered to take place

before the next query. The estimated cost of the last 100 queries given the indexes

available at the time of the query are accumulated and presented. This demonstrates

the evolving suitability of the current index set to incoming queries.

Figures 3.4 through 3.6 show a baseline comparison between using the query

pattern change detection and modifying the indexes and making no change to the

initial index set. A number of data set and query history combinations are examined.

The baseline parameters used are 5% support, 128 bits for each multi-dimension

histogram entry, a window size of 100, an indexing constraint of 10 indexes, a major

change threshold of 0.9 and a minor change threshold of 0.95.

Figure 3.4 shows the comparison for the random data set using the synthetic query

workload. At query 200, this workload drastically changes, and this is evident in the

query cost for the static system. The performance gets much worse and does not get

better for the static system. For the online system, the performance degrades a little

84

Figure 3.4: Baseline Comparative Cost of Adaptive versus Static Indexing, random

Dataset, synthetic Query Workload

when the query patterns are changing and then improve again once the indexes have

changed to match the new query patterns.

The synthetic query workload was generated speciﬁcally to show the eﬀect of a

changing query pattern. Figure 3.5 shows a comparison of performance for the random

data set using the clinical query workload. This real query workload also changes

substantially before reverting back to query patterns similar to those on which the

static index selection was performed. Performance for the static system degrades

substantially when the patterns change until the point that they change back. The

adaptive system is better able to absorb the change.

85

Figure 3.5: Baseline Comparative Cost of Adaptive versus Static Indexing, random

Dataset, clinical Query Workload

86

Figure 3.6: Baseline Comparative Cost of Adaptive versus Static Indexing, stock

Dataset, hr Query Workload

Figure 3.6 shows the comparison for the stock data set using the ﬁrst 2000 queries

of the hr query workload. The adaptive system shows consistently lower cost than

the static system and in places signiﬁcantly lower cost for this real query workload.

The eﬀect of range queries was also explored. The random data set was examined

using the synthetic workload using ranges instead of points. As the ranges for a given

attribute increased from a point value to 30% selectivity, the cost saved for top 3

proposed indexes (which were the index sets [1,2,3,4], [5,6,7], and [8,9] for all ranges),

the overall cost saved decreased. This is due to the increase in the size of the answer

87

set. When the range for each attribute reached 30%, no index was recommended

because the result set became large enough that sequential scan was a more eﬀective

solution.

Online Index Selection versus Parametric Changes

Changing online index selection parameters changes the adaptive index selection

in the following ways:

Support - decreasing the support level has the eﬀect of increasing the potential

number of times the index set will be recalculated. It also makes the recalculation of

the recommended indexes themselves more costly and less appropriate for an online

setting. Figure 3.7 shows the eﬀect of changing support on the online indexing results

for the random data set and synthetic query workload, while Figure 3.8 shows the

eﬀect on the stock data set using the ﬁrst 600 queries of hr query workload. These

graphs were generated using the baseline parameters and varying support levels. As

expected, lower support levels translate to lower initial costs because more of the ini-

tial query set are covered by some beneﬁcial index. For the synthetic query workload,

when the patterns changed, but do not change again (e.g. queries 100-200 in Figure

3.7), lower support levels translated to better performance. However, for the hr query

workload, the 10% support threshold yields better performance than the 6% and 8%

thresholds for some query windows. The frequent itemsets change more frequently

for lower support levels, and these changes dictate when decisions occur. These runs

made decisions at points that turned out to be poor decisions, such as eliminating

an index that ended up being valuable. Little improvement in cost performance is

achieved between 4% support and 2% support. Over-sensitive control feedback can

88

Figure 3.7: Comparative Cost of Online Indexing as Support Changes, random

Dataset, synthetic Query Workload

degrade actual performance, independent of the extra overhead that oversensitive

control causes.

Online Indexing Control Feedback Decision Thresholds - increasing the

thresholds decreases the response time of aﬀecting change when query pattern change

does occur. It also has the eﬀect of increasing the number of times the costly reanalysis

occurs. In a real system, one would need to balance the cost of continued poor

performance against the cost of making an index set change. Figure 3.9 shows the

eﬀect of varying the major change threshold in the coarse control loop for the random

89

Figure 3.8: Comparative Cost of Online Indexing as Support Changes, stock Dataset,

hr Query Workload

90

data set and synthetic query workload. Figure 3.10 shows the eﬀect of changing

the major change threshold for the stock data set and hr query workload. These

graphs were generated using the baseline parameters (except that the random graph

uses a support of 10%), and only varying the major change threshold in the coarse

control loop. The major change threshold is varied between 0 and 1. Here, a value

of 0 translates to never making a change, and a value of 1 means making a change

whenever improvement is possible. The graphs show no real beneﬁt once the major

change threshold increases (and therefore frequency of major changes) beyond 0.6.

Figure 3.9 shows that a value of 1 or 0.9 are best in terms of response time when a

change occurs, but a value of 0.3 shows the best performance once the query patterns

have stabilized. This indicates that this value should be carefully tuned based on the

expected frequency of query pattern changes.

Multi −dimensional Histogram Size - increasing the number of bits used to

represent a bucket in the multi-dimensional histogram improves analysis accuracy at

the cost of histogram generation time and space. Experiments demonstrated that

if the representation quality of the histogram was suﬃcient, very little beneﬁt was

achieved through greater resolution. Figure 3.11 shows an example of the eﬀect of

varying the multi-dimensional histogram size. Using a 32 bit size for each unique

histogram bucket yields higher costs than by using greater histogram resolution. One

reason for this is that artiﬁcially higher costs are calculated because of the lower

histogram resolution. As well, some beneﬁcial index changes are not recommended

because of these overly conservative cost estimates.

91

Figure 3.9: Comparative Cost of Online Indexing as Major Change Threshold

Changes, random Dataset, synthetic Query Workload

92

Figure 3.10: Comparative Cost of Online Indexing as Major Change Threshold

Changes, stock Dataset, hr Query Workload

93

Figure 3.11: Comparative Cost of Online Indexing as Multi-Dimensional Histogram

Size Changes, random Dataset, synthetic Query Workload

94

3.5 Conclusions

A ﬂexible technique for index selection is introduced that can be tuned to achieve

diﬀerent levels of constraints and analysis complexity. A low constraint, more complex

analysis can lead to more accurate index selection over stable query patterns. A more

constrained, less complex analysis is more appropiate to adapt index selection to

account for evolving query patterns. The technique uses a generated multi-dimension

histogram to estimate cost, and as a result is not coupled to the idiosyncrasies of

a query optimizer, which may not be able to take advantage of knowledge about

correlations between attributes. Indexes are recommended in order to take advantage

of multi-dimensional subspace pruning when it is beneﬁcial to do so.

These experiments have shown great opportunity for improved performance using

adaptive indexing over real query patterns. A control feedback technique is introduced

for measuring performance and indicating when the database system could beneﬁt

from an index change. By changing the threshold parameters in the control feedback

loop, the system can be tuned to favor analysis time or pattern change recognition.

The foundation provided here will be used to explore this tradeoﬀ and to develop

an improved utility for real-world applications. The proposed technique aﬀords the

opportunity to adjust indexes to new query patterns. A limitation of the proposed

approach is that if index set changes are not responsive enough to query pattern

changes then the control feedback may not aﬀect positive system changes. However,

this can be addressed by adjusting control sensitivity or by changing control sensitivity

over time as more knowledge is gathered about the query patterns.

95

From initial experimental results, it seems that the best application for this ap-

proach is to apply the more time consuming no-constraint analysis in order to de-

termine an initial index set and then apply a lightweight and low control sensitivity

analysis for the online query pattern change detection in order to avoid or make the

user aware of situations where the index set is not at all eﬀective for the new incoming

queries.

In this thesis, the frequency of index set change has been aﬀected through the use

of the online indexing control feedback thresholds. Alternatively, the frequency of

performance monitoring could be adjusted to achieve similar results and to appropri-

ately tune the sensitivity of the system. This monitoring frequency could be in terms

of either a number of incoming queries or elapsed time.

The proposed online change detection system utilized two control feedback loops in

order to diﬀerentiate between inexpensive and more time consuming system changes.

In practice, the ﬁne grain control threshold was not triggered unless we contrived a

situation such that it would be triggered. The kinds of low cost changes that this

threshold would trigger are not the kinds of changes that make enough impact to

be that much better than the existing index set. This would change if the indexing

constraints were very low, and one potential index is now more valuable than a

suggested index.

Index creation is quite time-consuming. It is not feasible to perform real-time

analysis of incoming queries and generate new indexes when the patterns change.

Potential indexes could be generated prior to receiving new queries, and when indi-

cated by online analysis, moved to active status. This could mean moving an index

from local storage to main memory, or from remote storage to local storage depending

96

on the size of the index. Additionally, the analysis could prompt a server to create a

potential index as the analysis becomes aware that such an index is useful, and once

it is created, it could be called on by the local machine.

97

CHAPTER 4

Multi-Attribute Bitmap Indexes

The signiﬁcant problems we have cannot be solved at the same level of

thinking with which we created them. - Albert Einstein (attributed).

An access structure built over multiple attributes aﬀords the opportunity to elim-

inate more of a potential result set within a single traversal of the structure than a

single attribute structure. However, this added pruning power comes with a cost.

Query processing becomes more complex as the number of potential ways in which

a query can be executed increases. The multi-attribute access structure may not

be eﬀective for general queries as a multi-attribute index can only prune results for

those attributes that match selection criteria of the query. This chapter presents

multi-attribute bitmap indexes. It describes the changes required when modifying a

traditionally single attribute access structure to handle multiple attributes in terms

of enhancing the query processing model and adding cost-based index selection.

4.1 Introduction

Bitmap indexes are widely used in data warehouses and scientiﬁc applications to

eﬃciently query large-scale data repositories. They are successfully implemented in

commercial Database Management Systems, e.g., Oracle [2], IBM [46], Informix [33],

98

Sybase [20], and have found many other applications such as visualization of scientiﬁc

data [56]. Bitmap indexes can provide very eﬃcient performance for point and range

queries thanks to fast bit-wise operations supported by hardware.

In their simplest and most popular form, equality encoded bitmaps also known

as projection indexes, are constructed by generating a set of bit vectors where a ’1’

in the i

th

position of a bit vector indicates that the i

th

tuple maps to the attribute-

value associated with that particular bit vector. In general, the mapping can be used

to encode a speciﬁc value, category, or range of values for a given attribute. Each

attribute is encoded independently.

Queries are then executed by performing the appropriate bit operations over the

bit vectors needed to answer the query. A range query over a value-encoded bitmap

can be answered by ORing together the bit vectors that correspond to the range

for each attribute, and then ANDing together these intermediate results between

attributes. The resultant bit vector has set bits for the tuples that fall within the

range queried.

One disadvantage of bitmap indexes is the amount of space they require. Com-

pression is used to reduce the size of the overall bitmap index. General compression

techniques are not desirable as they require bitmaps to be decompressed before ex-

ecuting the query. Run length encoders, on the contrary, allow the query execution

over the compressed bitmaps. The two most popular are Byte-Aligned Bitmap Code

(BBC) [2] and Word Aligned Hybrid (WAH) [57]. The CPU time required to an-

swer queries is related the total length of the compressed bitmaps used in the query

computation.

99

These traditional bitmap indexes are analogous to single dimension indexes in

traditional indexing domains. However, just as traditional multi-attribute indexes

can be more eﬀective in pruning data search space and reducing query time, multi-

attribute bitmap indexes could be applied to more eﬃciently answer queries. This

cost savings comes from two sources. First, the number of bitwise operations is

potentially reduced as more than one attribute are evaluated simultaneously. Second,

the bitvectors built over two or more attributes are sparser, which allows for greater

compression. The resulting smaller bitmaps makes the processing of the bitmap faster

which translates into faster query execution time.

As an example, consider a 2-dimensional query over a large dataset. In order to

avoid sequential scan over the data we build a index access structure that can direct

the search by pruning the space. If single dimensional indexes, such as B+-Trees or

projection indexes, are available over the two attributes queried, the query would ﬁnd

candidates for each attribute and then combine the candidate sets to obtain the ﬁnal

answer. In the case of B+-Trees, the set intersection operation is used to combine

the sets. In the case of projection indexes, the AND operation is used to combine the

bitmaps. However, if a 2-dimensional index such as an R-tree or Grid File exists over

the two queried attributes, then the data space can be pruned simultaneously over

both dimensions. This results in a smaller result set if the two dimensions indexed are

not highly correlated. Also a set intersection operation is not required. Similarly, a

multi-dimensional bitmap over a combination of attributes and values identiﬁes those

records that match the query combination and avoids a further bit operation in order

to answer the query criteria.

100

In this thesis, the use of Multi-Attribute Bitmaps is proposed to improve the

performance of queries in comparison with traditional single attribute indexes. The

goals of this research are to:

• Evaluate of query performance of multi-attribute bitmap enhanced data man-

agement system compared to the current bitmap index query resolution model.

• Support ad hoc queries where information about expected queries is not avail-

able and support the experimental analysis provided herein. We propose an

expected cost-improvement based technique to evaluate potential attribute set

combinations.

• Given a selection query over a set of attributes and a set of single and multiple

attribute bitmaps available for those attributes, determine the bitmap query

execution steps that results in the best query performance with minimal over-

head.

The proposed approaches to attribute combination and query processing are based

on measuring the expected query cost savings of options. Although the techniques

incorporate heuristics in order to control analysis search space and query execution

overhead, experiments show signiﬁcant opportunity for query execution performance

improvement.

4.2 Background

Bitmap Compression. Bitmap tables need to be compressed to be eﬀective

on large databases. The two most popular compression techniques for bitmaps are

the Byte-aligned Bitmap Code (BBC) [2] and the Word-Aligned Hybrid (WAH) code

101

128 bits 1,20*0,4*1,78*0,30*1

31-bit groups 1,20*0,4*1,6*0 62*0 10*0,21*1 9*1

groups in hex 400003C0 00000000 00000000 001FFFFF 000001FF

WAH (hex) 400003C0 80000002 001FFFFF 000001FF

Figure 4.1: A WAH bit vector.

[59]. Unlike traditional run length encoding, these schemes mix run length encoding

and direct storage. BBC stores the compressed data in Bytes while WAH stores it in

Words. WAH is simpler because it only has two types of words: literal words and ﬁll

words. The most signiﬁcant bit indicates the type of word we are dealing with. Let

w denote the number of bits in a word, the lower (w−1) bits of a literal word contain

the bit values from the bitmap. If the word is a ﬁll, then the second most signiﬁcant

bit is the ﬁll bit, and the remaining (w −2) bits store the ﬁll length. WAH imposes

the word-alignment requirement on the ﬁlls. This requirement is key to ensure that

logical operations only access words.

Figure 4.1 shows a WAH bit vector representing 128 bits. In this example, we

assume 32 bit words. Under this assumption, each literal word stores 31 bits from the

bitmap, and each ﬁll word represents a multiple of 31 bits. The second line in Figure

4.1 shows how the bitmap is divided into 31-bit groups, and the third line shows the

hexadecimal representation of the groups.

The last line shows the values of WAH words. Since the ﬁrst and third groups do

not contain greater than a multiple of 31 of a single bit value, they are represented as

literal words (a 0 followed by the actual bit representation of the group). The fourth

group is less than 31 bits and thus is stored as a literal. The second group contains a

multiple of 31 0’s and therefore is represented as a ﬁll word (a 1, followed by the 0 ﬁll

102

B1 400003C0 80000002 001FFFFF 000001FF

B2 00000100 C0000003 00001F08

B1 AND B2 00000100 80000002 001FFFFF 00000108

Figure 4.2: Query Execution Example over WAH compressed bit vectors.

value, followed by the number of ﬁlls in the last 30 bits). The ﬁrst three words are

regular words, two literal words and one ﬁll word. The ﬁll word 80000002 indicates a

0-ﬁll of two-word long (containing 62 consecutive zero bits). The fourth word is the

active word, it stores the last few bits that could not be stored in a regular word.

Another word with the value nine, not shown, is needed to stores the number of

useful bits in the active word. Logical operations are performed over the compressed

bitmaps resulting in another compressed bitmap.

Bit operations over the compressed WAH bitmap ﬁle have been shown to be faster

than BBC (2-20 times) [57] while BBC gives better compression ratio. In this paper

we use WAH to compress the bitmaps due to the impact on query execution times.

Relationship Between Bitmap Index Size and Bitmap Query Execution

Time. Using WAH compression, queries are executed by performing logical opera-

tions over the compressed bitmaps which result in another compressed bitmap. As

an example, consider the two WAH compressed bitvectors in Figure 4.2.

First, the ﬁrst word of each bitvector is decoded, and it is determined that they

are both literal words. Therefore, the logical AND operation is performed over the

two words and the resultant word is stored as the ﬁrst word in the result bitmap.

When the next word is read from both bitmaps, we have two ﬁll words. The ﬁll word

from B1 is a zero-ﬁll word compressing 2 words, and the ﬁll word from B2 is a one-ﬁll

103

word compressing 3 words. In this case we operate over 2 words simultaneously by

ANDing the ﬁlls and appending a ﬁll word into the result bitmap that compresses

two words (the minimum between the word counts). Since we still have one word

left from B2, we only read a word from B1 this time, and AND it with the all 1

word from B2. Finally, the last two words are read from the two bitmaps and are

ANDed together as they are literal words. A full description of the query execution

process using WAH can be found in [59]. Query execution time is proportional to

the size of the bitmap involved in answering the query [56]. To illustrate the eﬀect

of bit density and consequently bitmap size on query times over compressed bitmaps

we perform the following experiment. We generated 6 uniformly random bitmaps

representing 2 million entries for a number of diﬀerent bit densities and ANDed the

bitmaps together. Figure 4.3 shows average query time against the total bitmap

length of the bitmaps used as operands in the series of operations.

The expected size of a random bitmap with N bits compressed using WAH is

given by the formula [59]:

m

R

(d) ≈

N

w −1

1 −(1 −d

B

)

2w−2

−d

2w−2

**where w is the word size, and d is the bit density.
**

The lower the bit density, the more quickly and more eﬀectively intermediate

results can be compressed as the number of AND operations increases. Figure 3 shows

a clear relationship between query times and bitmap length. The diﬀerent slopes of

the curve are due to changes in frequency of the type of bit operations performed

over the word size chunks of the bitmap operands. At low bit densities, there is a

greater portion of ﬁll word to ﬁll word operations. As the bit density increases, more

ﬁll word to literal word operations occur. At high bit densities, literal word to literal

104

Figure 4.3: Query Execution Times vs Total Bitmap Length, ANDing 6 bitmaps, 2M

entries

word operations become more frequent. As these inter-word operations have diﬀerent

costs, the overall query time to bitmap length relationship is not linear. However it

is clear that query execution time is related to overall bitmap length.

Relationship Between Number of Bitmap Operations and Bitmap Query

Execution Time. Figure 4.4 shows AND query times over WAH-compressed bitmaps

as the number of bitmaps ANDed varies. Again the bitmaps are randomly generated

with the reported bit densities for 2 million entries and the number of bitmaps in-

volved in a AND query is varied. There is clear relationship between query times and

the number of bitmap operations.

105

Figure 4.4: Query Execution Times vs Number of Attributes queried, Point Queries,

2M entries versus

Since query performance is related to the total bitmap length and number of

bitmap operations, a logical question is ”can we modify the representation of the data

in such a way that we can decrease these factors?” One could combine attributes and

generate bitmaps over multiple attributes in order to accomplish this. This is because

a combination over multiple attributes has a lower bit density and is potentially more

compressible, and already has an operation that may be necessary for query execution

built into its construction. However, several research questions arise in this scenario

such as which attributes should be combined, how can the additional indexes be

utilized, and what is the additional burden added by the richer index sets.

106

Technical Motivation for Multi-Attribute Bitmaps. Figure 4.5 shows a

simple example combining two single attributes with cardinality 2, into one 2-attribute

bitmap with cardinality 4. In the table, the combined column labeled b1.b2 means

that the value of attribute 1 maps to b1 and the value for attribute 2 maps to b2.

Such a representation can have the eﬀect of reducing the size of the bitmaps that are

used to answer a query and can also reduce the number of operations. As attributes

are combined, bitcolumns become sparser, and the opportunity for compressing that

column increases, potentially decreasing the size of a bitcolumn. Additionally, if a bit

operation is already incorporated in the multi-attribute bitmap construction, then it

does not need to be evaluated as part of the query execution.

One could think of multi-attribute bitmaps as the AND bitmap of two or more

bitmaps from diﬀerent attributes. One could argue that multi-attribute bitmaps are

reducing the dimensionality of the dataset but increasing the cardinality of the at-

tributes involved in the new dataset. Although bitmap indexes have historically been

discouraged for high-cardinality attributes, the advances in compression techniques

and query processing over compressed bitmaps have made bitmaps an eﬀective solu-

tion for attributes with cardinalities on the order of several hundreds [58].

As an illustrative example of enhancing compression from combining attributes,

consider a bitmap index built over two attributes, each with cardinality 10, where

the data is uniformly randomly distributed independently for each attribute for 5000

tuples. Using WAH compression, there are very few instances where there are long

enough runs of 0’s to achieve any compression (since out of 10 random tuples, we

would expect 1 set bit for a value). In such a test which matched this scenario, very

107

Attribute 1 Attribute 2 Combined

Tuple b1 b2 b1 b2 b1.b1 b1.b2 b2.b1 b2.b2

t

1

1 0 0 1 0 1 0 0

t

2

0 1 1 0 0 0 1 0

t

3

1 0 1 0 1 0 0 0

t

4

1 0 0 1 0 1 0 0

t

5

0 1 0 1 0 0 0 1

t

6

0 1 1 0 0 0 1 0

Figure 4.5: Simple multiple attribute bitmap example showing the combination two

single attribute bitmaps.

little compression was achieved and bitmaps representing each of the 20 attribute-

value pairs were between 640 and 648 bytes long.

As an alternative, we can generate a bitmap index for the combination of the two

attributes. Now the cardinality of the combined attributes is 100, and the expected

number of set bits in a run of 100 tuples for a single value is 1. We can expect much

greater compression for this situation. In the given scenario, the generated size of the

100 2-D bitmaps was between 220 and 400 bytes.

Now consider a range query such that attribute 1 must be between values 1 and

2, and attribute 2 must be equal to 3. The operations using the traditional bitmap

processing to answer the query is ((ATT1=1) OR (ATT1=2)) AND (ATT2=3). The

time required is dependent on the total length of the bitmaps involved. Each of

these bitmaps as well as the resultant intermediate bitmap from the OR operation is

648 bytes, for a total of 2592 bytes. For the multi-attribute case, the query can be

answered by (ATT1=1,ATT2=3) OR (ATT1=2,ATT2=3). These bitmaps are 268

and 260 bytes respectively, for a total of 528 bytes. In this scenario, we can reduce

108

both the size of the bitmaps involved in the query execution and also the number of

bitmaps involved.

This example demonstrates the opportunity to improve query execution by rep-

resenting multiple attribute values from separate attributes in a bitmap. In general,

point queries would always beneﬁt from the multi-attribute bitmaps as long as all the

combined attributes are queried. However, if only some of the attributes were queried

or a large range from one or more attributes were queried, then the single attribute

bitmaps would perform better than the multi-attribute bitmaps. This is the reason

why we propose to supplement the single attribute bitmaps with the multi-attribute

bitmaps and not to replace the single attribute bitmaps. Multi-attribute bitmaps

would only be used in the queries that result in an improved query execution time.

Supplementing single attribute bitmaps with additional bitmaps that improve

query execution time has also been done in [50], where lower resolution bitmaps were

used in addition to the single attribute bitmaps or high resolution bitmaps. One could

think of this approach as storing the OR of several bitmaps from the same attribute

together.

4.3 Proposed Approach

In this section we present our proposed solution for the problems presented in the

previous section. The primary overall tasks can be summarized as multi-attribute

bitmap index selection and query execution. In index selection, we determine which

attributes are appropriate or the most appropriate to combine together. This can be

performed with varying degrees of knowledge about a potential query set. Once can-

didate attribute combinations are determined, the combined bitmaps are generated.

109

In query execution, the candidate pool of bitmaps are explored to determine a plan

for query execution.

4.3.1 Multi-attribute Bitmap Index Selection for Ad Hoc

Queries

In many cases the attributes that should be combined together as multi-attribute

bitmaps are obvious given the context of the data and queries. Whenever multiple

attributes are frequently queried together over a single bitmap value, that combination

of attributes is a candidate for multi-attribute indexing. As an example, consider a

database system behind a car search web form. If the form has categorical entries

for color and model, then a multi-attribute bitmap built over these two attributes

can directly ﬁlter those data objects that meet both search criteria without further

processing.

In order to support situations where no contextual information is available about

queries and in order to allow for comparative performance evaluation, we provide a

cost improvement based candidate evaluation system. The goal behind bitmap index

selection is to ﬁnd sets of attributes that, if combined, can yield a large beneﬁt in terms

of query execution time. The analogy within the traditional multi-attribute index

selection domain would be ﬁnding attributes that are not individually eﬀective in

pruning data space, but can eﬀectively prune the data space when combined together.

An eﬀective index selection technique for bitmap indexes should be cognizant of

the relationships between attributes. As well, the selectivity estimated for a given

attribute combination should accurately reﬂect the true data characteristics.

Since query execution time is dependent both on the total size of the operands

involved in a bitmap computation and also the number of bitmap operations, we use

110

these measures to determine the potential beneﬁt for a given attribute combination.

Algorithm 3 shows how candidate multi-attribute indexes are evaluated and selected

in terms of potential beneﬁt.

Determining Index Selection Candidates

SelectCombinationCandidates (Ordered set of attributes A,

dLimit, cardLimit)

notes:

attSets - is a list of Attribute Sets and corresponding goodness

measures

selected - are a set of recommended attribute sets to build indexes

over

1: attSets = {[{a

i

}, 0] |a

i

∈ A}

2: currentDim = 1

3: while (currentDim < dLimit AND newSetsAdded)

4: unset newSetsAdded

5: for each as

i

∈ attSets, s.t. |as

i

| = currentDim

6: for each attribute a

j

, j = (maxAtt(as

i

)+1) to |A|

7: if (as

i

.card*a

j

.card¡cardLimit)

8: as

j

= [as

i

∪ {a

j

}, detGoodness(as

i

, a

j

)]

9: if goodness(as

j

) > goodness(as

i

)

10: add as

j

to attSets; set newSetsAdded

11: currentDim + +

12: sort attSets by goodness

13: ca = ∅

14: while ca = A

15: attribute set as = next set from attSets

16: if as ∩ ca = ∅

17: ca = ca ∪ as.attributes

18: include as in selected

Algorithm 3: Selecting Attribute Combination Candidates.

111

The algorithm is somewhat analogous to the Apriori method for frequent itemset

mining in that a given candidate set is only evaluated if it’s subset is considered bene-

ﬁcial. The algorithm takes several parameters as input. These include an array, A, of

the attributes under analysis. The parameter dLimit provides a maximum attribute

set combination size. The maximum cardinality allowed for a given prospective com-

bination can be controlled by the cardLimit parameter.

The algorithm evaluates increasingly larger attribute sets (loop starting at line 3).

Each unique combination is evaluated (lines 5-6). Initially all size 1 attribute sets are

combined with each other size 1 attribute set such that each unique size 2 attribute

set is evaluated. In the next iteration of the outer loop, each potentially beneﬁcial

size 2 set is evaluated with relevant size 1 attributes (note that by evaluating only

those singleton sets of attributes numbered larger than the largest in the current set,

we can evaluate all unique set instances). If the number of bitmaps generated by

this combination is excessive (line 7), the set will not be evaluated. The card value

associated with an attribute set is the number of value combinations possible for the

attribute set, and reﬂects the number of bitmaps that would have to be generated to

index based on the set. A goodness measure is evaluated for the proposed combination

(line 8). If the goodness of the combination exceeds the goodness of its ancestor, then

the combination set is added to the list of attribute sets for further evaluation during

the next loop iteration. The term newSetsAdded in line 3 is a boolean condition

indicating if any attribute set was added during the last loop iteration.

Once the loop terminates, the sets in attSets are sorted by their goodness mea-

sures (line 12). Then sets of attributes that are disjoint from the covered attributes,

112

ca, are greedily selected from the sorted sets and added to the recommended bitmap

indexes.

Estimating Performance Gain for a Potential Index

detGoodness (attribute set as

i

, attribute a

j

)

1: set probQT and probSR

2: goodness=0

3: for each bitcolumn bc

k

in as

i

4: for each bitcolumn bc

j

in a

j

5: bitcolumn cb = AND(bc

k

, bc

j

)

6: savings = bc

k

.length + bc

j

.length+

bc

k

.creationLength −cb.length

7: if savings > 0

8: goodness+ = savings ∗ cb.matches

9: goodness∗ = (probQT ∗ probSR)

|as

i

|+1

10: return goodness

Algorithm 4: Determine goodness of including an attribute in a multi-attribute

bitmap.

The goodness of a particular attribute set is computed by Algorithm 4. The

variables, probQT and probSR describe global query characteristics and are used to

estimate diminishing beneﬁts as index dimensionality increases. The probQT pa-

rameter represents the probability that an attribute is queried. As with traditional

multi-attribute index, a 2-attribute index is not as useful for a single attribute query.

The probSR parameter is the probability that the query range for an attribute will

be small enough such that a multi-attribute index over the query criteria will still be

beneﬁcial.

113

Each unique value combination of the entire attribute set (combinations generated

in lines 4-5) is evaluated in terms of bitmap length savings over a corresponding

single attribute bitmap solution. The goodness of an attribute set is measured as the

weighted average of the bitlength savings of each value combination (line 8). The

savings of a particular combination is measured as the sum of the compressed lengths

of the bitmap operands and the compressed length of the bitmaps needed to create

the ﬁrst operand, minus the compressed length of the combined bitmap. For size 1

attribute sets, the bitcolumn creation length is 0. However, for a multiple attribute

bitmap this value is the length of all the constituent bitmaps used to create it. The

total savings reﬂects the diﬀerence in total bitmap operand length to answer the

combination under analysis.

If the savings for a particular combination is negative, then its negative savings

is not accumulated (line 7). This is because this option will not be selected during

the query execution for this combination and will not incur an additional penalty.

Each combination is weighted by its frequency of occurrence. This strategy assumes

that the query space is similar in distribution to the data space. The ﬁnal goodness

measure for an attribute set is also modiﬁed based on the probability that beneﬁt

will actually be realized for the query (line 9).

This technique ﬁnds attributes for which their combinations will lead to a reduc-

tion in the number of operations as well as a reduction in the constituent bitmap

lengths. Both of these factors can provide a performance gain depending on the

query criteria. Our approach takes into account the characteristics of the data by

weighting and measuring the eﬀectiveness of speciﬁc inter-attribute combinations.

The index selection solution space is controlled by limiting parameters and through

114

greedy selection of potential solutions. Although the indexes recommended are not

guaranteed to be optimal, they can be determined eﬃciently. Later experiments show

that the reduction in the number of total operations has a more substantial eﬀect on

query execution time than reduction of bitmap length, (i.e. seemingly less than ideal

attribute combinations can still provide performance gains).

The behavior of the algorithm can be aﬀected with slight alteration. For example,

if index space is critical, we can order the pairs in line 12 of the Select Combination

Candidates algorithm (Algorithm 3) by goodness over combined attribute size. If the

query is equally likely to occur in the data space, then the weighting in line 8 of the

determine Goodness algorithm (Algorithm 4) can be removed.

Once bitmap combinations are recommended, each combined bitmap can be gen-

erated by assigning an appropriate identiﬁer to the bitmap and ANDing together the

applicable single attribute bitmaps. Once multi-attribute bitmaps are available in the

system, we need a mechanism to decide when single attribute and/or multi-attribute

bitmaps would be used. In other words, we need to decide on a query execution plan

for each query.

4.3.2 Query Execution with Multi-Attribute Bitmaps

One issue associated with using multiple representations for data indexes, is that

the query execution becomes more complex. In order to take advantage of the most

useful representation, the potential execution plans must be evaluated and the op-

tion that is deemed better should be performed. To be eﬀective, the complexity

added by the additional representations can not be more costly than the beneﬁts

attained by incorporating them. We take a number of measures in order to limit the

115

overhead associated with selecting the best query execution option. We keep track

of those attribute sets that have bitmap representations (either single attribute or

multi-attribute). We use a unique mapping for bitmap identiﬁcation. We explore

potential query execution options using inexpensive operations.

Global Index Identiﬁcation

In order to identify the potential index search space, we maintain a data structure

of those attributes and combined attributes for which we have bitmap indexes con-

structed. If we have available a single attribute bitmap for the ﬁrst attribute, we will

include a set identiﬁed as [1] in this structure. If we have available a multi-attribute

bitmap built over attributes 1,2, and 3 then we include the set [1,2,3] in the struc-

ture. Upon receiving a query, we compare the set representation of the attributes

in the query to the sets in the index identiﬁcation structure. Those sets that have

non-empty intersections with the query set are candidates for use in query execution.

Bitmap Naming

We need to be able to easily translate a query to the set of bitmaps that match the

criteria of a query. In order to do this we devise a one-to-one mapping that directly

translates query criteria to bitmap coverage. The key is made up of the attributes

covered by the bitmap and the values for those attributes. For example, a multi-

attribute bitmap over attributes 9 and 10 where the value for attribute 9 is 5 and the

value for attribute 10 is 3 would get ”[9,10]5.3” as its key. The attributes are always

stored in numerical order so that there is only one possible mapping for various set

orders, and the values are given in the same order as the attributes. Finding relevant

bitmaps is a matter of translating the query into its key to access the bitmaps.

116

Query Execution

During query execution we need a lightweight method to check which potential

query execution will provide the fastest answer to the query. The ﬁrst step is to

compare the set of attributes in the query to the sets of attributes for which we have

constructed bitmap indexes. Those indexes which cover an attribute involved in the

query are candidates to be used as part of the solution for the query. Once candidate

indexes are determined, we compute the total length of the bitmaps required to answer

the query.

A recursive function is used to compute the total length of all matching bitmaps

in order to handle arbitrary sized attribute sets. The complete unique hash key is

generated in the base case of the recursive method and the length of the matching

bitmaps is obtained and returned to the calling caller. For the recursive case, the

unique key is built up and passed to the recursive call, and the callee length results

are tabulated.

We just need to sum the lengths of bitmaps that match the query criteria, which

we can compute using the recursive function. For example, if the total length of

matching bitmaps for the multi-attribute bitmap over attributes [1,2] is less than the

total length for both attributes [1] and [2], then the multi-attribute bitmap should be

used to answer the query. However, if the multi-attribute bitmaps have greater total

length, then the single attribute bitmaps should be used.

The logic applied to determine the query execution plan needs to be lightweight

so as to not add too much overhead to handle the increased complexity of bitmap

representation. So rather than incorporating logic that guarantees an optimal minimal

total bitmap length, we approximate this behavior using heuristics. We order the

117

potential attribute sets in descending order based on the amount of length saved by

using the index. As a baseline, all potential single attribute indexes get a value of

0 for the amount of length saved. Multi-attribute indexes get a value which is the

diﬀerence between the length of the single attribute indexes that would be required

to answer the query and the length of the multi-attribute index to answer the query.

Then we process the query by greedily selecting non-covered attribute sets from this

order until the query is answered. Those multi-attribute indexes that have positive

beneﬁt compared to single attribute indexes will have positive value, and will be

processed before any single attribute index. Those that have negative values, would

actually slow down the query, and will not be processed before their constituent single

attribute indexes.

Algorithm 5 provides the algorithm for executing selection queries when the entire

bitmap index has both single attribute and multi-attribute bitmap indexes available to

process the query. It involves two major steps. In the ﬁrst step, the potential indexes

are prioritized based on how promising they are with respect to saving total bitmap

length during query execution. The second step performs the bit operations required

to answer the query. The algorithm assumes that the bitmap indexes available are

identiﬁed and that we can access any bitcolumn through a unique hashkey.

In lines 1-5, we compute the total size of the bitmaps that match the query

considering those attributes that intersect between the query and the index attribute

set currently under analysis. For example, consider a query over values 1 and 2 for

attribute 1, and and over value 3 for attribute 2 where we have bitmaps available for

the attribute sets [1], [2], and [1,2] combined. Table 4.1 shows the query matching

keys for each of the potential indexes. At the end of this section, we will have a total

118

SelectionQueryExecution (Available Indexes AI,

BitColumn Hashtable BI, query q)

notes:

AI - contains identiﬁcation of sets of indexed attributes,

with length and saved ﬁelds and query matching

keys list

BI - provides access to bitcolumn using unique key

B1 - a bitcolumn of all 1’s

// Prioritize Query Plan

1: for each attribute set as in AI

2: if as.attributes ∩ q.attributes = ∅

3: generate query matching keys for as

4: attach query matching keys to as

5: sum query matching bitmap lengths

6: for each attribute set as in AI

7: if as.numAtts == 1

8: as.saved = 0

9: else

10: for each size 1 attribute subset a in as

11: as.saved += a.length

12: as.saved -= as.length

13: sort AI by saved ﬁeld

// Execute Query

14: attribute set ca = null

15: queryResult = B1

16: while ca = q.attributes

17: as = next attribute set from AI

18: if as.attributes ∩ ca = ∅

19: ca = ca ∪ as.attributes

20: subresult = BI.get(as.keys[0])

21: for i = 2 to |as.keys|

22: subresult = OR(subresult,BI.get(as.keys[i]))

23: queryResult = AND(queryResult, subresult)

24: return queryResult

Algorithm 5: Query Execution for Selection Queries when Combined Attribute

Bitmaps are Available.

119

query matching bitmap length for each available index. Note that the computation

of bitmap length in line 5 is not actually a simple sum. The bit density of each

matching bitmap is estimated based on its length. Since the matching bitmaps will

not overlap, these bit densities are summed to arrive at an eﬀective bit density of the

resultant OR bitmap. Length of the resultant bitmap is then derived using eﬀective

bit density. If the bit density is within the range of about 0.1 to 0.9, then its value

can not be determined. In these cases, we conservatively estimate its value as 0.5.

This does not adversely aﬀect query operation since the multi-attribute option would

not be selected anyway.

Available Index Matching Keys

[1] [1]1,[1]2

[2] [2]3

[1, 2] [1, 2]1.3 , [1, 2]2.3

Table 4.1: Sample Query Matching Keys for Query Attributes 1 and 2.

In lines 6-12, we use this initial information in order to compute the total bitmap

length saved by using a multi-attribute bitmap in relation to using the relevant single

attribute alternatives. For a multiple attribute index, we sum up the query matching

bitmap length associated with each single attribute subset of the multi-attribute

attribute set. The diﬀerence between this length and the total length of the query

matching bitmaps for the multi-attribute set is the amount of bitmap length saved by

using the multi-attribute set. All single attribute indexes are assigned a saved value

of 0. Those multi-attribute indexes that reﬂect a cost savings will get a positive value

120

for saved, while those that are more costly will get a negative value. In line 13, we

sort the potential indexes in terms of this saved value.

In lines 14 and 15, we initialize a set that indicates the attributes covered by the

query so far and initialize the query result to all 1’s. Then we start processing the

query in order of how promising the prospective indexes were an continue until all

query attributes are covered (line 15). We get the next best index, and check if it’s

attributes have already been covered (line 17,18). If not, we will process a subquery

using the index. We update the covered attribute set and OR together all bitcolumns

that match the queries relevant to the query and the index. We AND together this

result to build the subqueries into the overall query result. When we have covered all

query attributes, the query is complete and we return the result.

4.3.3 Index Updates

Bitmap indexes are typically used in write-once read-mostly databases such as

data warehouses. Appends of new data are often done in batches and the whole index

is rebuilt from scratch once the update is completed. Bitmap updates are expensive

as all columns in the bitmap index need to be accessed. In [12], we propose the use of

a padword to avoid accessing all bitmaps to insert a new tuple but only to update the

bitmaps corresponding to the inserted values. In other words, the number of bitmaps

that need to be updated when a new tuple is appended in a table with A attributes

is A as opposed to Σ

A

i=1

c

i

, where c

i

is the number of bitmap values (or cardinality)

for attribute A

i

. This approach, which clearly outperforms the alternative, is suitable

when the update rate is not very high and we need to maintain the index up-to-date

in an online manner.

121

Updating multi-attribute bitmaps is even more eﬃcient than updating traditional

bitmaps. If multi-attributes bitmaps are also padded we only need to access the

bitmaps corresponding to the combined values (attributes) that are being inserted.

For example, consider the previous table with A attributes where multi-attribute

bitmaps are built over pair of attributes, the cost of updating the multi-attribute

bitmaps is half of the cost of updating the traditional bitmaps, as only

A

2

bitmaps

need to be accessed.

4.4 Experiments

4.4.1 Experimental Setup

Experiments were executing using a 1.87GHz HP Pavilion PC with 2GB RAM

and Windows Vista operating system. The bitmap index selection, bitmap creation

and query execution algorithms were implemented in Java.

Data

Several data sets are used during experiments. They are characterized as follows:

• synthetic - 12 attributes, 1M records, attributes 1-3 have cardinality 2, 4-6 have

cardinality 5, 7-9 have cardinality 10, and 10-12 have cardinality 25. Attributes

1,4,7,10 follow uniform random distribution, attributes 2,5,8,and 11 follow Zipf

distribution with s parameter set to 1, attributes 3,6,9, and 12 follow Zipf

distribution with s parameter set to 2.

• uniform - uniformly distributed random data set. It has 12 attributes, each

with cardinality 10, 1M records.

122

• census - a discretized version of the USCensus1990raw data set. The raw dataset

is from the (U.S. Department of Commerce) Census Bureau website. The dis-

cretized dataset comes from the UCI Machine Learning Repository. We used the

ﬁrst 1 million instances and 44 attributes. The cardinality of these attributes

varies from 2 to 16. The data set can be obtained from [47]

• HEP - a bin representation of High-Energy Physics data. It has 12 attributes.

The cardinality per dimensional varies between 2 and 12. There are a total of

122 bitcolumns for this data set. This data set is not uniformly distributed.

This data set is made up of more than 2 million records.

• Landsat - a bin representation of satellite imagery data. The data set is 60

attributes, each cardinality 10. It contains over 275000 records.

Queries

Query sets were generated by selecting random points from the data and con-

structing point queries from those points. These points remain in the data set, so

that it is guaranteed that there will be at least one match for each query. For those

cases where range queries are used, the range is expanded around a selected point so

that the range is guaranteed to contain at least one match.

Metrics

Our goal is to compare performance of a system reﬂecting the current state of the

art, where only single attribute bitmaps are available, against our proposed approach

in which both single and multi-attribute bitmaps are available. We present compar-

isons in terms of query execution time, number of bitmaps involved in answering a

123

query, and total bitmap size accessed to answer a query. Query execution assumes

the bitmap indexes are in main memory. Further performance enhancement would

be possible during query execution if the required bitmaps need to be read from disk.

4.4.2 Experimental Results

Index Selection

We executed the index selection algorithm over the synthetic data set. We set the

cardLimit parameter to 100 and set the query probability parameters to 1 (i.e. no

degradation at higher number of attributes). The attribute sets that were selected

to combine together were [1,4,7], [2,5,8], and [3,6,9]. This solution makes sense in

the context of the performance metric of weighted beneﬁt. We would expect more

expected beneﬁt for combinations for which the number of value combinations ap-

proaches the cardinality limit, because a sparser bitmap can be more eﬀectively com-

pressed. We would also expect more expected beneﬁt from combining attributes that

are independent. Therefore set [1,4,7] is selected ﬁrst, because it’s value combination

cardinality is 100, and the values are independent. The other combinations have

greater inter-attribute dependency because they follow Zipf distributions. Attributes

10, 11, and 12 do not combine with any others in this scenario because there are

no free attributes that would be within cardinality constraints. The average query

execution of 100 point queries using only single attribute bitmaps was 41.47ms, while

it was 17.39ms using the recommended indexes. The time for using recommended

index set includes an average overhead of 0.25ms to explore the richer set of options

and select the best query execution plan.

After executing index selection with parameters probQT and probSR set to 0.5,

we instead get 2 attribute recommendations. The proposed set of indexes is [7,8],

124

Range Length

Query A1 A2

Q1 1 1

Q2 1 2

Q3 1 5

Q4 2 5

Q5 2 5

Q6 5 5

Table 4.2: Query Deﬁnitions for 2 Attributes with cardinality 10 and uniform random

distribution.

[1,10], [2,11], [4,9], [5,6], and [3,12]. Each of these has a value combination cardinality

greater than or equal to 50 and selection order is dependent on cardinality and then

inter-attribute independence.

Query Execution

Controlled Query Types. We ran 6 query types over bitmaps built over 2

attributes with uniform random distribution and cardinality 10. The queries are

detailed in Table 4.2 and varied in the length of the range. A range of 1, means that

the query for that attribute is a point query. Q1, for example, is a point query on

both attributes, and Q3 selects one value from the ﬁrst attribute and ﬁve values from

the second attribute. Note that in this case, 5 is the worst case scenario. Any range

bigger than ﬁve could take the complement of the non-queried range to answer the

query.

Figures ?? and 4.8 show the query execution time, number of bitmaps involved in

answering each query type, and total size of the bitmaps involved in answering the

query, respectively. In these ﬁgures, The ‘Only 1-Dim’ label refers to the case where

only Single Attribute Bitmaps are available, the ‘Only M-Dim’ label refers to the

125

case where the queries are forced to use Multi-Attribute Bitmaps, and the ‘Proposed’

label refers to the case where the query execution plan is selected considering both

indexes available.

For queries Q1-Q4, the multi-attribute bitmaps are always selected, and they are

signiﬁcantly faster than the single-attribute options. For queries Q5-Q6, if single and

multi-attribute bitmaps are available, single attribute is always selected. However,

the best is always slightly faster than the proposed technique due to the overhead of

evaluating and selecting the best option. For these queries, the overhead was on the

order of 0.2ms. The proposed technique usually selects the option that involves the

shortest original bitmaps required to answer the query. The exception here is Q5.

The single dimension index is used because it involves fewer bitmap operations and

involves less total bitmap operand length over the set of operations that need to take

place.

Figure 4.6: Query Execution Time for each query type.

126

Figure 4.7: Number of bitmaps involved in answering each query type.

Query Execution for Point Queries. Figure 4.9 demonstrates the potential

query speedup associated with our proposed query processing technique when multi-

attribute bitmaps are available to answer queries. The single-D bar shows the average

query execution time required to answer 100 point queries using traditional bitmap

index query execution, where points are randomly selected from the data set (there

is at least one query match). For the multi-attribute test, indexes were selected and

generated with a maximum set size of 2. So a suite of bitcolumns representing 2-

attribute combinations is available for query execution as well the single attribute

bitmaps. Each single attribute is represented in exactly one attribute pair. The

results show signiﬁcant speedup in query execution for point queries. Point queries

over the uniform, landsat, and hep datasets experience speedups of factors of 5.01,

3.23, and 3.17 respectively.

127

Figure 4.8: Total size of the bitmaps involved in answering each query type.

Similar results were achieved for the census data set. Indexes were selected without

a dimensionality limit and with a cardinality limit of 100. The query execution time

using only single attribute bitmaps was 83.01 ms while the query execution time was

only 24.92 ms when both, single attribute and multi-attribute bitmaps were available.

Query Execution Over Range Queries. Figure 4.10 shows the impact of

using the single-attribute bitmaps instead of the proposed model as the queries trend

from being pure point queries to gradually more range-like queries. Average query

execution times over 100 queries are shown. The queries are constructed by making

the matching range of an attribute either a single value, or half of the attribute

cardinality. Whether an attribute is a point or a range is randomly determined. The

x-axis shows the probability that a given pair of attributes in the query represents a

range (e.g .1 means that there is 90% probability that 2 query attributes are both over

single values). The graph shows signiﬁcant speedup for point queries. Although the

128

Figure 4.9: Query Execution Times, Point Queries, Traditional versus Multi-

Attribute Query Processing

speedup ratio decreases as the queries become more range-like, the multi-attribute

alternative maintains a performance advantage until nearly all the query attributes

are over ranges. This is because there are still some cases where using the multi-

attribute bitmap is useful.

Index Size

Figure 4.11 shows bitmap index sizes for each data set. The lower portion of the

bar shows the total size of the bitmaps for all of the single attribute indexes. The

upper portion of the bar shows the size of the suite of 2-attribute bitmaps (each at-

tribute represented exactly once in the suite). The entire bar represents the space

required for both the single and 2-attribute bitmaps. Even though the multi-attribute

129

Figure 4.10: Query Execution Times as Query Ranges Vary, Landsat Data

bitmaps represent many more columns than the single attribute bitmaps(e.g. the sin-

gle dimension landsat are covered by 600 bitcolumns, while its 2-attribute counterpart

is represented by 3000 bitmaps (30 attribute pairs, 100 bitmaps per pair)), the com-

pression associated with the combined bitmaps means that the space overhead for

the multi-attribute representations is not as severe as may be expected.

130

Figure 4.11: Single and Multi-Attribute Bitmap Index Size Comparison

Bitmap Operations versus Bitmap Lengths

Since query execution enhancement is derived from two sources, reduction in

bitmap operations and reduction in bitmap length, it is valuable to know the rel-

ative impacts of each one. To this end, we tested query execution after performing

index selection using diﬀerent strategies.

Table 4.3 shows the query speedup for point queries using diﬀerent strategies to

determine the attributes that should be combined together. The speedup is measured

over 100 random point queries taken from HEP data. For each listed strategy, a suite

of size 2 attribute sets is determined diﬀerently. Randomly selected attributes simply

combines random pairs of attributes. The other 2 strategies use Algorithm 3 to

generate goodness factors for each combination based on bitmap length saved and

weighted by result density. The goodness using best greedily selects non-intersecting

131

Attribute Combination Strategy Query Speedup

1 Algorithm 1, Selecting Best

Combinations

3.15

2 Randomly Selected At-

tributes

2.79

3 Algorithm 1 Selecting Worst

Beneﬁcial Combinations

2.67

Table 4.3: Query Speedup, HEP Data, Using Diﬀerent Attribute Combination Strate-

gies

pairs starting from the top of the list, while the other technique greedily selects

non-overlapping pairs from the bottom. The results show that signiﬁcant speedup

can be achieved simply by combining reasonable attributes together, and that query

execution can be further enhanced by careful selection of the indexes.

4.5 Conclusion

In this paper, we propose the use of multi-attribute bitmap indexes to speed up

the query execution time of point and range queries over traditional bitmap indexes.

We introduce an index selection technique in order to determine which attributes

are appropriate to combine as multi-attribute bitmaps. The technique is driven by a

performance metric that is tied to actual query execution time. It takes into account

data distribution and attribute correlations. It can also be tuned to avoid undesirable

conditions such as an excessive number of bitmaps or excessive bitmap dimensionality.

We introduce a query execution model when both traditional and multi-attribute

bitmaps are available to answer a query. The query execution takes advantage of the

multi-attribute bitmaps when it is beneﬁcial to do so while introducing very little

overhead for cases when such bitmaps are not useful. Utilizing our index selection

132

and query execution techniques, we were able to consistently achieve up to 5 times

query execution speedup for point queries while introducing less than 1% overhead

for those queries where multi-attribute indexing was not useful.

133

CHAPTER 5

Conclusions

Data retrieval for a given application is characterized by the query types, query

patterns over time, and the distribution and characteristics of the data itself. In order

to maximize query performance, the query processing and access structures should

be tailored to match the characteristics of the application.

The optimization query framework presented herein describes an I/O-optimal

query processing technique to process any query within the broad class of queries

that can be stated as a convex optimization query. For such applications, the query

processing follows a access structure traversal path that matches the function being

optimized while respecting additional problem constraints.

While resolving queries, the possible query execution plans evaluated by a data

management system are limited by available indexes. The online high-dimensional in-

dex recommendation system provides a cost-based technique to identify access struc-

tures that will be beneﬁcial for query processing over a query workload. It considers

both the query patterns and the data distribution to determine the beneﬁt of diﬀerent

alternatives. Since workloads can shift over time, a method to determine when the

current index set is no longer eﬀective is also provided.

134

The multi-attribute bitmap indexes presents both query processing and index

selection for the traditionally single attribute bitmap index. Query processing selects

a query execution plan to reduce the overall number of operations. The index selection

task ﬁnds attribute combinations that lead to the greatest expected query execution

improvement.

The primary end goal of the research presented in this thesis is to improve query

execution time. Both the run-time query processing and design-time access methods

are targeted as potential sources of query time improvement. By ensuring that the

run-time traversal and processing of available access structures matches the charac-

teristic of a given query and that the access structures available match the overall

query workload and data patterns, improved query execution time is achieved.

135

BIBLIOGRAPHY

[1] S. Agrawal, S. Chaudhuri, L. Koll´ar, A. P. Marathe, V. R. Narasayya, and

M. Syamala. Database tuning advisor for microsoft sql server 2005. In VLDB,

pages 1110–1121, 2004.

[2] G. Antoshenkov. Byte-aligned bitmap compression. In Data Compression Con-

ference, Nashua, NH, 1995. Oracle Corp.

[3] E. Barucci, R. Pinzani, and R. Sprugnoli. Optimal selection of secondary indexes.

IEEE Transactions on Software Engineering, 1990.

[4] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University

Press, 1961.

[5] S. Berchtold, C. B¨ohm, D. A. Keim, and H.-P. Kriegel. A cost model for nearest

neighbor search in high-dimensional data space. PODS, pages 78–86, 1997.

[6] S. Berchtold, D. Keim, and H. Kriegel. The x-tree: An index structure for high-

dimensional data. In Proceedings of the International Conference on Very Large

Data Bases, pages 28–39, 1996.

[7] A. Blum and P. Langley. Selection of relevant features and examples in machine

learning. AI, 1997.

[8] C. Bohm. A cost model for query processing in high dimensional data spaces.

ACM Trans. Database Syst., 25(2):129–178, 2000.

[9] C. Bohm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces:

Index structures for improving the performance of multimedia databases. ACM

Comput. Surv., 33(3):322–373, 2001.

[10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University

Press, New York, NY. USA, 2004.

[11] N. Bruno and S. Chaudhuri. To tune or not to tune? a lightweight physical

design alerter. In VLDB, pages 499–510, 2006.

136

[12] G. Canahuate, M. Gibas, and H. Ferhatosmanoglu. Update conscious bitmap

indices. In SSDBM, 2007.

[13] A. Capara, M. Fischetti, and D. Maio. Exact and approximate algorithms for

the index selection problem in physical database design. IEEE Transactions on

Knowledge and Data Engineering, 1995.

[14] Y.-C. Chang, L. Bergman, V. Castelli, C.-S. Li, M.-L. Lo, and R. J. Smith. The

onion technique: indexing for linear optimization queries. ACM SIGMOD, 2000.

[15] S. Chaudhuri and V. Narasayya. Autoadmin ’what-if’ index analysis utility. In

Proceedings ACM SIG-MOD Conference, pages 367–378, 1998.

[16] S. Chaudhuri and V. R. Narasayya. An eﬃcient cost-driven index selection tool

for microsoft SQL server. In The VLDB Journal, pages 146–155, 1997.

[17] S. Choenni, H. Blanken, and T. Chang. On the selection of secondary indexes

in relational databases. Data and Knowledge Engineering, 1993.

[18] R. L. D. C. Costa and S. Lifschitz. Index self-tuning with agent-based databases.

In XXVIII Latin-American Conference on Informatics (CLIE), Montevideo,

Uruguay, 2002.

[19] A. Dogac, A. Y. Erisik, and A. Ikinci. An automated index selection tool for

oracle7: Maestro 7. Technical Report LBNL/PUB-3161, TUBITAK Software

Research and Development Center, 1994.

[20] H. Edelstein. Faster data warehouses. Information Week, December 1995.

[21] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middle-

ware. Journal of Computer and System Sciences, 2003.

[22] C. Faloutsos, T. Sellis, and N. Roussopoulos. Analysis of object oriented spatial

access methods. In SIGMOD ’87: Proceedings of the 1987 ACM SIGMOD in-

ternational conference on Management of data, pages 426–439, New York, NY,

USA, 1987. ACM Press.

[23] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, and A. E. Abbadi. Vector approx-

imation based indexing for non-uniform high dimensional data sets. In CIKM

’00.

[24] M. Frank, E. Omiecinski, and S. Navathe. Adaptive and automated index selec-

tion in rdbms. In International Conference on Extending Database Technologhy

(EDBT), Vienna, Austria, 1992.

137

[25] V. Gaede and O. Gunther. Multidimensional access methods. ACM Comput.

Surv., 30(2):170–231, 1998.

[26] A. Gionis, P. Indyk, and R. Motwani. Similarity searching in high dimensions

via hashing. In Proceedings of the Int. Conf. on Very Large Data Bases, pages

518–529, Edinburgh, Sootland, UK, September 1999.

[27] M. Grant, S. Boyd, and Y. Ye. Disciplined convex programming. Kluwer (Non-

convex Optimization and its Applications series), Dordrecht, 2005. In press.

[28] C.-W. C. Guang-Ho Cha. The gc-tree: a high-dimensional index structure

for similarity search in image databases. IEEE Transactions on MultiMedia,

4(2):235–247, June 2002.

[29] I. Guyon and A. Elissef. An introduction to variable and feature selection. Jour-

nal of Machine Learning Research, 2003.

[30] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate gen-

eration. In W. Chen, J. Naughton, and P. A. Bernstein, editors, 2000 ACM

SIGMOD Intl. Conference on Management of Data, pages 1–12. ACM Press, 05

2000.

[31] G. R. Hjaltason and H. Samet. Ranking in spatial databases. In Symposium on

Large Spatial Databases, pages 83–95, 1995.

[32] V. Hristidis, N. Koudas, and Y. Papakonstantinou. PREFER: A system for the

eﬃcient execution of multi-parametric ranked queries. In SIGMOD Conference,

2001.

[33] I. Inc. Informix decision support indexing for the enterprise data warehouse.

http://www.informix.com/informix/corpinfo/- zines/whiteidx.htm.

[34] M. Ip, L. Saxton, and V. Raghavan. On the selection of an optimal set of indexes.

IEEE Transactions on Software Engineering, 1983.

[35] S. Kai-Uwe, E. Schallehn, and I. Geist. Autonomous query-driven index tun-

ing. In International Database Engineering & Applications Symposium, Coimbra,

Portugal, 2004.

[36] R. Kohavi and G. John. Wrappers for feature subset selection. AI, 1997.

[37] F. Korn and S. Muthukrishnan. Inﬂuence sets based on reverse nearest neighbor

queries. In SIGMOD, 2000.

[38] M. S. Lobo. Robust and Convex Optimization with Applications in Finance. PhD

thesis, The Department of Electrical Engineering, Stanford University, March

2000.

138

[39] B. S. Manjunath. Airphoto dataset. http://vivaldi.ece.ucsb.edu/Manjunath/research.htm,

May 2000.

[40] B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of

image data. IEEE Transactions on Pattern Analysis and Machine Intelligence,

18(8):837–842, August 1996.

[41] R. P. Mount. The Oﬃce of Science Data-Management Challenge. DOE Oﬃce

of Advanced Scientiﬁc Computing Research, March–May 2004.

[42] Y. Nesterov and A. Nemirovsky. A general approach to polynomial-time algo-

rithms design for convex programming. Technical report, Centr. Econ. & Math.

Inst., USSR Acad. Sci., Moscow, USSR, 1988.

[43] Y. Nesterov and A. Nemirovsky. Interior-Point Polynomial Algorithms in Convex

Programming. SIAM, Philadelphia, PA 19101, USA, 1994.

[44] J. Pei, J. Han, and R. Mao. CLOSET: An eﬃcient algorithm for mining frequent

closed itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining

and Knowledge Discovery, pages 21–30, 2000.

[45] S. Ponce, P. M. Vila, and R. Hersch. Indexing and selection of data items in

huge data sets by constructing and accessing tag collections. In Proceedings of

the 19th IEEE Symposium on Mass Storage Systems and Tenth Goddard Conf.

on Mass Storage Systems and Technologies, 2002.

[46] M. Ramakrishna. In Indexing Goes a New Direction., volume 2, page 70, 1999.

[47] U. M. L. Repository. The us census1990 data set.

http://kdd.ics.uci.edu/databases/census1990/USCensus1990-desc.html.

[48] N. Roussopoulos, S. Kelly, and F. Vincent. Nearest neighbor queries. In SIG-

MOD, 1995.

[49] A. Shoshani, L. Bernardo, H. Nordberg, D. Rotem, and A. Sim. Multi-

dimensional indexing and query coordination for tertiary storage management.

In Proceedings of the 11th International Conference on Scientiﬁc and Statistical

Data(SSDBM), 1999.

[50] R. R. Sinha and M. Winslett. Multi-resolution bitmap indexes for scientiﬁc data.

ACM Trans. Database Syst., 32(3):16, 2007.

[51] M. Theobald, G. Weikum, and R. Schenkel. Top-k query evaluation with prob-

abilistic guarantees. VLDB, 2004.

139

[52] G. Valentin, M. Zuliani, D. Zilio, G. Lohman, and A. Skelley. Db2 advisor: An

optimizer smart enough to recommend its own indexes. In Proceedings Interna-

tional Conference Data Engineering, 2000.

[53] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance

study for similarity-search methods in high-dimensional spaces. In VLDB, 1998.

[54] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance

study for similarity-search methods in high-dimensional spaces. In Proceedings

of the 24th International Conference on Very Large Databases, pages 194–205,

1998.

[55] K. Whang. Index selection in relational databases. In International Conference

on Foundations on Data Organization (FODO), Kyoto, Japan, 1985.

[56] K. Wu, W. Koegler, J. Chen, and A. Shoshani. Using bitmap index for interactive

exploration of large datasets. In Proceedings of SSDBM, 2003.

[57] K. Wu, E. Otoo, and A. Shoshani. Compressing bitmap indexes for faster search

operations. In SSDBM, 2002.

[58] K. Wu, E. Otoo, and A. Shoshani. On the performance of bitmap indices for high

cardinality attributes. Lawrence Berkeley National Laboratory. Paper LBNL-

54673., March 2004.

[59] K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indexes with eﬃcient

compression. ACM Transactions on Database Systems, 2006.

[60] Z. Zhang, S. Hwang, K. C.-C. Chang, M. Wang, C. Lang, and Y. Chang. Boolean

+ ranking: Querying a database by k-constrained optimization. In SIGMOD,

2006.

140

ABSTRACT

The proliferation of massive data sets across many domains and the need to gain meaningful insights from these data sets highlight the need for advanced data retrieval techniques. Because I/O cost dominates the time required to answer a query, sequentially scanning the data and evaluating each data object against query criteria is not an eﬀective option for large data sets. An eﬀective solution should require reading as small a subset of the data as possible and should be able to address general query types. Access structures built over single attributes also may not be eﬀective because 1) they may not yield performance that is comparable to that achievable by an access structure that prunes results over multiple attributes simultaneously and 2) they may not be appropriate for queries with results dependent on functions involving multiple attributes. Indexing a large number of dimensions is also not eﬀective, because either too many subspaces must be explored or the index structure becomes too sparse at high dimensionalities. The key is to ﬁnd solutions that allow for much of the search space to be pruned while avoiding this ‘curse of dimensionality’. This thesis pursues query performance enhancement using two primary means 1) processing the query eﬀectively based on the characteristics of the query itself and 2) physically organizing access to data based on query patterns and data characteristics. Query performance enhancements are described in the context of several novel applications

ii

including 1) Optimization Queries, which presents an I/O-optimal technique to answer queries when the objective is to maximize or minimize some function over the data attributes, 2) High-Dimensional Index Selection, which oﬀers a cost-based approach to recommend a set of low dimensional indexes to eﬀectively address a set of queries, and 3) Multi-Attribute Bitmap Indexes, which describes extensions to a traditionally single-attribute query processing and access structure framework that enables improved query performance.

iii

ACKNOWLEDGMENTS

First of all, I would like to thank my wife, Moni Wood, for her love and support. She has made many sacriﬁces and expended much eﬀort to help make this dream a reality. I would not have been to complete this dissertation without her. I also would not have been able to accomplish this without the help of my family. I thank my mother Marilyn, my sister Susan and my grandmother Margaret for their support. I thank my late father Murray for providing an example to which I have always aspired. Moni’s family was also understanding and supportive of the travails of Ph.D research. I want to express my appreciation to my adviser, Hakan Ferhatosmanoglu, for his guidance and encouragement over the years. He has been generous with his time, eﬀort, and advice during my graduate studies. I am thankful for the technical discussions we have had and the opportunities he has given me to work on interesting projects. Many people in the department have helped me to get to this point. I speciﬁcally want to thank my dissertation committee members, Dr. Atanas Rountev and Dr. Hui Fang, for their eﬀorts and valuable comments on my research. I wish to thank Dr. Bruce Weide for his help and support during my studies, Dr. Rountev for my early exposure to research and paper writing, and Dr. Jeﬀ Daniels for giving me the opportunity to work on an interdisciplinary geo-physics project. iv

Many friends and members of the Database Research Group provided valuable guidance, discussions, and critiques over the past several years. I want to especially thank Guadalupe Canahuate for her help and friendship during this time. Also thanks to her family for providing enough Dominican coﬀee to get through all the paper deadlines. My Ph.D. research was supported by NSF grant NSF IIS -0546713.

v

VITA

May 10, 1969 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Omaha, Nebraska 1993 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.S. Electrical Engineering, Ohio State University, Columbus Ohio 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Computer Science and Engineering, Ohio State University, Columbus Ohio 1989-1992 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Electrical Engineering Co-op, British Petroleum, Various Locations 1994-2002 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Project Manager, Acquisition Logistics Engineering, Worthington, Ohio 2003-present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Assistant, Oﬃce Of Research, Department of Computer Science and Engineering, Department of Geology Ohio State University, Columbus, Ohio

PUBLICATIONS

Research Publications Michael Gibas, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Online Index Recommendations for High-Dimensional Databases Using Query Workloads. IEEE Transactions on Knowledge and Data Engineering (TDKE) Journal, February 2008 (Vol. 20, no.2), pp. 246-260

vi

15 pp. Hakan Ferhatosmanoglu. pages 14-16. Germany. 2006/March. International Conference on Extending Database Technology (EDBT). 14 pp. A General Framework for Modeling and Processing Optimization Queries. Encyclopedia of Geographical Information Science. Michael Gibas. and Hakan Ferhatosmanoglu. 33rd International Conference on Very Large Data Bases (VLDB 07). Michael Gibas. June 2004 FIELDS OF STUDY vii . Masters Thesis. Michael Gibas. Developing Indices for High Dimensional Databases Using Query Access Patterns. Scott Kagan. pages 1-11. 2005/October Atanas Rountev. Evaluating the Imprecision of Static Analysis. Ohio State Technical Report. OSUCISRC-10/07-TR74. Austria. Scott Kagan. Dayton. 10691080 Guadalupe Canahuate. Static and Dynamic Analysis of Call Chains in Java. Hakan Ferhatosmanoglu. OH. Indexing Incomplete Databases. ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE’04). 884-901 Michael Gibas. LexisNexis. and Michael Gibas. Ning Zheng. OSU-CISRC-4/06-TR45. June 2004 Atanas Rountev. International Conference on Scientiﬁc and Statistical Database Management (SSDBM). September. Canada. Hakan Ferhatosmanoglu. Update Conscious Bitmap Indexes. pp. Indexes for Databases with Missing Data. High-Dimensional Indexing. Banﬀ. Hakan Ferhatosmanoglu. Michael Gibas. Hakan Ferhatosmanoglu. and Michael Gibas. Guadalupe Canahuate. Conference on Using Metadata to Manage Unstructured Text.Michael Gibas. July 2004 Fatih Altiparmak. (E-GIS) Michael Gibas. Michael Gibas. Relationship Preserving Feature Selection for Unlabelled Time-Series. Munich. Tan Apaydin. ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’04). 2007/July Guadalupe Canahuate. pp. Ohio State Technical Report. Hakan Ferhatosmanoglu. Fast Semantic-based Similarity Searches in Text Repositories. Vienna. Guadalupe Canahuate. 2007.

Major Field: Computer Science and Engineering viii .

. . . . . . . . . . . . . .3. . . . . .3. . . . . . . . . . 2. . . . . . . . . . . . 2. . . . Vita . . .2 Common Query Types Cast into Optimization Query Model 2. . . . . . . . .TABLE OF CONTENTS Page Abstract . . . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.3 Non-Covered Access Structures . . . . . . . . . Acknowledgments . . . . . . . . . . Query Processing Framework . . . .2 2.1 Query Processing over Hierarchical Access Structures . . . . . . Novel Database Management Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapters: 1.6 2. . .4 2. . . . . . . . . .3.1 Background Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . Query Model Overview . . . . . . . . . . . . . . . 1 1 2 6 7 9 10 10 13 15 16 19 21 26 28 28 33 45 ii iv vi xi Modeling and Processing Optimization Queries . . . . . . . . . . . . . . . . . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . Overall Technical Motivation . . . . . . . . . . . . . . . . . . . . . . .7 . . . . . . . . . . . . . . . . . . . . .4 Approach Overview . . . . . . . . . . . . . . . . . . . . . I/O Optimality .3 Optimization Query Applications . . . . . . . . 2. . . . . . . . . . . . . . . . . . ix 2. . . . . Performance Evaluation .3 Introduction . . . . . . . . .2 Query Processing over Non-hierarchical Access Structures . . Related Work . 2. . . . . .5 2. . . . . . . . . . . .3. . . .4. .2 2. . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . .1 2. . . . . . . . .4. . . . . . . . . . . . . . . . . 2. . List of Figures . . . . . .

. . . . . . . . . . . . . . . . . . .3. . . . 3. Approach . . . . . . . . . . . . . . . . . . .2 Approach Overview . . . . . . . . . . . . . . . . . . . 3. . . . . . .4. . . .1 4. 4. . . . . . . . .1 Experimental Setup . . . . . . . . . . . . . . .3 Index Selection . .3. . . . . . . . . . . . . . . . .3. . . . . . . . . .3. .3. . . . . . . . . . . . . . .4 3. . . . . . . . . . . . . . . . . .4 Proposed Solution for Online Index Selection 3. . . .2 Experimental Results . .1 High-Dimensional Indexing . . . .4 Automatic Index Selection . . . . . .4. . . . . . . . . . . . Empirical Analysis . . . . . . . . . . .2 Introduction . .3. . . . . . . . . . .2 Experimental Results . . . . . . . . 136 x . . . . . . . . 4. . 3. . . . . . . . .4. . . . .1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . . 134 Bibliography . . . . 4. . . . . . . . Related Work . . . .3. . .3. . . . . . Conclusions . . . . . . . .2 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . Online Index Recommendations for High-Dimensional Databases using Query Workloads .3 3. . . . . . . . . . . . . . . . . . . . . . . . . . .1 3. . . . . . . . .3 Proposed Solution for Index Selection . . . . . . . . . . . . . . . . . . . . . . . . . Multi-Attribute Bitmap Indexes . . . . . . . . . . . . .5 4. . . . . . . . . . .4 4. . . . . . .4. . . . . . . . . Background .3 Introduction . . . . . . . . . . . . . . . . . 48 48 54 54 55 55 57 57 57 59 61 71 76 78 78 80 95 98 98 101 109 110 115 120 121 121 123 131 3. . . . .5 5. . . . . . . . Experiments . . . . 3. . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . .2.5 System Enhancements . . . . . . .3.2. . 3. . . . . . . . . . . . . . 3. . . 4. . . . . . . . . . . . . . . . . . . . . . . . . . .1 Multi-attribute Bitmap Index Selection for Ad Hoc Queries 4. Proposed Approach . . . . . . . . . . . . . . . . . .1 Experimental Setup . . . . . . . . . . . . . . . . .2 Query Execution with Multi-Attribute Bitmaps . . . . . . .2. . . . . 3. . 3. . .2 Feature Selection . . . . . . . . . . . . 4. . . .3 Index Updates . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . A) : (x − 1)2 + (y − 2)2 . . . . . A) : x − 5 = 0. . . . .5 Cases of MBRs with respect to R and optimal objective function contour 30 Cases of partitions with respect to R and optimal objective function contour . . Accesses. 8-D Histogram. . . . . 8-D Histogram. . . . . VA-File. 100k pts . . . . . . . . . . . . . . . CPU Time. . . . . . . . . . . . . . . . NN Query. . Page 12 20 25 2. . . . 100k Points . . . . . R*-Tree. . . . Example of Hierarchical Query Processing . . . . . . . . Accesses. . . . . . . . . . . . . . . Random weighted kNN vs kNN. . . Index Type. . . . . Constraints. . .4 2. . . . . . 50k Points . . 100k pts Page Accesses. . . . . . . . . . . . . Random Weighted Constrained k-LP Queries. . . 60-D Sat. . . R*-Tree. 32 2. . . . Random weighted kNN vs kNN. . A) : −y + 3 ≤ 0. . . . . . 8-D Color Histogram. . . .7 36 38 2.6 35 2. . .2 2. . . . . . . . . . . . . . . h1 (β. . .1 Sample Optimization Objective Function and Constraints F (θ. . . 100k pts . . . . . . . . . . g2 (α. . . . 100k pts .11 Pages Accesses vs. . g1 (α. Optimizing Random Function. . . k = 1 . . . . . R*-tree. . . A) : y − 6 ≤ 0. . . . . . . . . . . . . . . 2. 8-D Histogram. .3 2. . Random weighted kNN vs kNN. 43 44 xi . .LIST OF FIGURES Figure 2. . . . . .8 2. . . . . . . 5-D Uniform Random. .10 Accesses vs. . . . . . . . . . . . . . R*-tree. . o(min). . . . . . . . . . Optimization Query System Functional Block Diagram . . . .9 40 2. . .

AutoAdmin. . Dynamic Index Analysis Framework . . . .3 Index Selection Flowchart . . . . Point Queries.10 Comparative Cost of Online Indexing as Major Change Threshold Changes. synthetic Query Workload . . . . . . . . . . . . . . . . . clinical Query Workload . . . . . . . . . . . . .7 89 3. . Naive. . . . . . . . 3. . . . . . . . . . .3 93 94 A WAH bit vector. . . . 105 Query Execution Times vs Number of Attributes queried. . . . . hr Query Workload . . . stock Dataset. .6 87 3. . . . . . . . . . . hr Query Workload . . . . . . . . . . . 90 Comparative Cost of Online Indexing as Major Change Threshold Changes.2 4. 106 xii 4. . . . . . . . .1 3. and the Proposed Index Selection Technique using Relaxed Constraints . . synthetic Query Workload . . . . . . . .9 92 3. . . .8 Comparative Cost of Online Indexing as Support Changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequential Scan. . . . . 4. . . . . . 102 Query Execution Example over WAH compressed bit vectors. . . . . random Dataset. . . . . . . . . . . .5 86 3. . Baseline Comparative Cost of Adaptive versus Static Indexing. . . . . . . 2M entries . . . . . . . . . . synthetic Query Workload . . 2M entries versus . . . . . . hr Query Workload . . . . 103 Query Execution Times vs Total Bitmap Length. . . . . . . . . random Dataset. . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . random Dataset. . . . . . .4 85 3. .11 Comparative Cost of Online Indexing as Multi-Dimensional Histogram Size Changes. 61 73 81 3. random Dataset. . . stock Dataset. . . .1 4. . . . . . . . . . Costs in Data Object Accesses for Ideal. . .4 . . . . random Dataset. . . . . . . . . . . . Baseline Comparative Cost of Adaptive versus Static Indexing. . Baseline Comparative Cost of Adaptive versus Static Indexing. . . . . . . . ANDing 6 bitmaps. Comparative Cost of Online Indexing as Support Changes. . . synthetic Query Workload .3. stock Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 3. . .

127 Query Execution Times. 128 4. Point Queries. . .6 4. . . 126 Total size of the bitmaps involved in answering each query type. . 125 Number of bitmaps involved in answering each query type. Traditional versus Multi-Attribute Query Processing . . 108 Query Execution Time for each query type. . . . . . . . . . . . . . . . . . .7 4.4. . . . . . Landsat Data . . . . . . . . . .8 4. .11 Single and Multi-Attribute Bitmap Index Size Comparison . . . . . .5 Simple multiple attribute bitmap example showing the combination two single attribute bitmaps. . . . . . . .10 Query Execution Times as Query Ranges Vary. . . . . . . . . . . . 129 4. . . . . . . . . . . . . . . 130 xiii . . . . . .9 4. .

1 Overall Technical Motivation Analogous to our physical haystack example. he would proceed at once with the diligence of the bee to examine straw after straw until he found the object of his search. I was a sorry witness of such doings. 1931 In order to make better use of the massive quantities of data being generated by and for modern applications.. Two primary challenges of this task are specifying the meaning of interesting within the context of an application and searching the data set to ﬁnd the interesting data eﬃciently.. 1. it is necessary to ﬁnd information that is in some respect interesting within the data. the query response time during data exploration tasks using a database system is aﬀected both by the physical organization 1 . knowing that a little theory and calculation would have saved him ninety per cent of his labor. we would be much better served by limiting our search to a small region within the haystack.CHAPTER 1 Introduction If Edison had a needle to ﬁnd in a haystack. . we can think of each piece of straw as a data object and needles as a data objects that are interesting. In the case of our proverbial haystack. While we could guarantee that we ﬁnd the most interesting needle by examining each and every piece of straw.Nikola Tesla.

Physically reorganizing data or data access structures is time consuming and therefore should be performed for a set of queries rather than by trying to optimize an organization for a single query. similarity-based analyses and complex mathematical and statistical operators are used to query such data [41]. Therefore. and business data usually requires iterative analyses that involve sequences of queries with varying parameters. motivated and described in the next section. Depending on the application. it is highly desirable to process generic queries at run-time in a way that is eﬀective for each particular query.of the data and by the way in which the data is accessed. Besides traditional queries. Exploration of image. Additionally. the object attributes queried. queries can vary widely on a per query basis. 2) data organization such that the access structures available during query resolution are appropriate for the overall query patterns and data distribution and allow for signiﬁcant data space pruning during query resolution. Many of these data exploration queries can be cast as optimization queries where a linear or non-linear function over object attributes is minimized or maximized (our needle). query performance can be enhanced by traversing the data more eﬀectively during query resolution or by organizing access to the data so that it is more suitable for the query. and data constraints 2 . function parametric values.2 Novel Database Management Applications Optimization Queries. scientiﬁc. This thesis targets query performance enhancement through two sources 1) runtime query processing such that the search paths and intermediate operations during query resolution are appropriate for a given generic query. Query performance enhancement is explored using three introduced applications. 1. As such.

This problem can be addressed by using sets of lower dimensional indexes that eﬀectively answer a large portion of the incoming queries. Because scientiﬁc data exploration and business decision support systems rely heavily on function optimization tasks. A general optimization model and query processing framework are proposed for optimization queries. 3 . these application domains can beneﬁt from a general model-based optimization framework. possibly within a set of constraints. and because such tasks encompass such disparate query types with varying parameters. due to the oft-cited curse of dimensionality. an important and signiﬁcant subset of data exploration tasks. and (iii) maintain eﬃcient performance when the scoring criteria or optimization function changes. indexes built over many dimensions actually perform worse than sequential scan. This approach can work well if the characteristics of future queries are well known. access structures or indexes that cover query attributes can be used to prune much of the data space and greatly reduce the required number of page accesses to answer the query. High-Dimensional Index Selection. but in general this is not the case. With respect to the physical organization of the data. For access structures where the underlying partitions are convex. the query processing framework is used to access the most promising partitions and data objects and is proven to be I/O-optimal. The model covers those queries where some function is being minimized or maximized over object attributes. (ii) take advantage of query constraints to prune search space.can vary on a per-query basis. The query processing framework can be used in those cases where the function being optimized and the set of constraints is convex. However. Such a framework should meet the following requirements: (i) have the expressive power to support solution of each query type.

A set of indexes is obtained that in eﬀect reduces the dimensionality of the query search space. This ﬂexible framework that can be tuned to maximize indexing accuracy or indexing speed which allows it to be used for the initial static index recommendation and also online dynamic index recommendation. Potential indexes are evaluated to derive an initial index set recommendation that is eﬀective in reducing query cost for a large portion of queries such that the set of indexes is within some indexing constraint. Static bitmap indexes have traditionally built over each attribute individually.and an index set built based on historical queries may be completely ineﬀective for new queries. A control feedback mechanism is incorporated so that the performance of newly arriving queries can be monitored. Bitmap indexes have been shown to be eﬀective as an data management system for many domains such a data warehousing and scientiﬁc applications. The nature of the database management system used and the way in which data is physically accessed also aﬀects query execution times. the index set can be changed. Multi-Attribute Bitmap Indexes. and is eﬀective in answering queries. This owes to the fact that they are a column-based storage structure and only the data needed to address a query is ever retrieved from disk and that their base operations are performed using hardware supported fast bitwise operations. Historical query patterns are data mined to ﬁnd those attribute sets that are frequently queried together. The query model for selection queries is simple and involves ORing together bitcolumns that match query 4 . and data statistics. and if the current index set is no longer eﬀective for the current query patterns. In the case of physical organization. query performance is aﬀected by creating a number of low dimension indexes using the historical and incoming query workloads.

A query processing model where multi-attribute bitmaps are available for query resolution is presented. Methods to determine attribute combination candidates are provided. the bitmap index structure is limited to those domains that it is naturally suited for without change. For many queries. Multi-attribute bitmap indexes are introduced to extend the functionality and applicability of traditional bitmap indexes. query execution times are improved by reducing size of bitmaps used in query execution and by reducing the number of binary operations required to resolve the query without excessive overhead for queries that do not beneﬁt from multi-attribute pruning. 5 . As a result.criteria within an attribute and ANDing inter-attribute intermediate results. As well query resolution does not take advantage of potential multi-attribute pruning and possible reductions in the number of binary operations.

Given the time expense of I/O. the query is I/O-optimally processed over that structure.CHAPTER 2 Modeling and Processing Optimization Queries The most exciting phrase to hear in science. While the answer to a given query can be found by computing the function for all data objects and ﬁnding those objects that yield the most extreme values.. while ensuring that. minimizing the I/O required to answer a query is a primary strategy in query optimization. The optimization query framework presented in this chapter provides a method to generically handle an important broad class of queries..Isaac Asimov One challenge associated with data management is determining a cost-eﬀective way to process a given query. Scientiﬁc discovery and business intelligence activities are frequently exploring for data objects that are in some way special or unexpected. given an access structure. 6 . this approach may not be feasible for large data sets.’ . Such data exploration can be performed by iteratively ﬁnding those data objects that minimize or maximize some function over the data attributes. the one that heralds new discoveries. is not ‘Eureka!’ (I found it!) but ‘That’s funny .

2. (ii) take advantage of query constraints to proactively prune search space. and data constraints can vary on a per-query basis. scientiﬁc. these application domains can beneﬁt from a general model-based optimization framework. and business data usually requires iterative analyses that involve sequences of queries with varying parameters. and (iii) maintain eﬃcient performance when the scoring criteria or optimization function changes. Such a framework should meet the following requirements: (i) have the expressive power to support existing query types and the power to formulate new query types. especially for information retrieval and data analysis.1 Introduction Motivation and Goal Optimization queries form an important part of realworld database system utilization. Besides traditional queries. top-k queries [21. etc. Because scientiﬁc data exploration and business decision support systems rely heavily on function optimization tasks. such as the query processing algorithms for Nearest Neighbors (NN) [48] and its variants [37]. Many of these data exploration queries can be cast as optimization queries where a linear or non-linear function over object attributes is minimized or maximized. 51]. Additionally. similar-ity-based analyses and complex mathematical and statistical operators are used to query such data [41]. However. they either only deal with speciﬁc function forms or require strict properties for the query functions. and because such tasks encompass such disparate query types with varying parameters. These queries can all be considered to be a type of optimization query with diﬀerent objective functions. There has been a rich set of literature on processing speciﬁc types of queries. 7 . function parametric values. the object attributes queried. Exploration of image.

By applying this single optimization query model with I/O-optimal query processing. and apply the appropriate tools for the speciﬁc query type. a uniﬁed query technique for a general optimization model has not been addressed in the literature. we can provide the user with a ﬂexible and powerful method to deﬁne and execute arbitrary optimization queries eﬃciently. In conjunction with a technique to model optimization query types. Example optimization query applications are provided in Section 2. and complex user-deﬁned functional analysis. We propose a generic optimization model to deﬁne a wide variety of query types that includes simple aggregates. Since the queries we are considering do not have a speciﬁc objective function but only have a “model” for the objective (and constraints) with parameters that vary for diﬀerent users/queries/iterations. we also propose a query processing technique that optimally accesses data and space partitioning structures to answer the query. • We propose a general framework to execute convex model-based optimization queries (using linear and non-linear functions) and for which additional constraints over the attributes can be speciﬁed without compromising query accuracy or performance. Contributions The primary contributions of this work are listed as follows. we refer to them as model-based optimization queries. range and similarity queries.3.Despite an extensive list of techniques designed speciﬁcally for diﬀerent types of optimization queries. • We deﬁne query processing algorithms for convex model-based optimization queries utilizing data/space partitioning index structures without changing the 8 . possess. application of specialized techniques requires that the user know of.3. Furthermore.

The computation of convex hulls over the whole data set is expensive and in the case of high-dimensional data. The solution followed in the Onion technique is based on constructing convex hulls over the data set. • We introduce a generic model and query processing framework that optimally addresses existing optimization query types studied in the research literature as well as new types of queries with important applications. which deals with an unconstrained and linear query model.2 Related Work There are few published techniques that cover some instances of model-based optimization queries.index structure properties and the ability to eﬃciently address the query types for which the structures were designed. It is also not applicable to constrained linear queries because convex hulls need to be recomputed for each query with a diﬀerent set of constraints. infeasible in design time. 2. Query 9 . One is the “Onion technique” [14]. • We prove the I/O optimality for these types of queries for data and space partitioning techniques when the objective function is convex and the feasible region deﬁned by the constraints is convex (these queries are deﬁned as CP queries in this paper). It is infeasible for high dimensional data because the computation of convex hulls is exponential with respect to the number of dimensions and is extremely slow. The access structure is a sorted list of records for an arbitrary linear function. The Prefer technique [32] is used for unconstrained and linear top-k query types. An example linear model query selects the best school by ranking the schools according to some linear function of the attributes of each school.

processing is performed for some new linear preference function by scanning the sorted list until a record is reached such that no record below it could match the new query. and still does not incorporate constraints.3. assume all attributes have the same domain. θm ) is a vector of parameters for function F . covers a broader spectrum of optimization queries. . . . addresses only single dimension access structures (i. 2. a2 . . θ2 . and α and β are vector parameters for the constraint functions.1 Query Model Overview Background Deﬁnitions Let Re(a1 . . Denote by A a subset of attributes of Re. our technique organizes data retrieval to answer more general queries. . A) be certain functions over attributes in A. Boolean + Ranking [60] oﬀers a technique to query a database to optimize boolean constrained ranking functions. our proposed solution is proven to be I/O-optimal for both hierarchical and space partitioned access structures with convex partition boundaries. 1 ≤ i ≤ u. a2 . A). 1 ≤ j ≤ v and F (θ. and therefore largely uses heuristics to determine a cost eﬀective search strategy. can not optimize on multiple dimensions simultaneously). A). an ) be a relation with attributes a1 . Let gi (α.3 2. 10 . . In contrast with the other techniques listed. and requires no additional storage space. Without loss of generality. it is limited to boolean rather than general convex constraints. . However. hj (β. . but does not provide a guarantee on minimum I/O. While the Onion and Prefer techniques organize data to eﬃciently answer a speciﬁc type of query (LP). .e. This avoids the costly build time associated with the Onion technique. an . . . where u and v are positive integers and θ = (θ1 .

Without loss of generality. 1 ≤ j ≤ v (if not empty). . constraint parameters α and β.1 shows a sample objective function with both inequality and equality constraints. 1 ≤ i ≤ u (if not empty) and equality constraints hj (β. A) . 11 . The constraints limit the feasible region to the line passing through x = 5. weighted NN queries and linear/non-linear optimization queries can all be considered as speciﬁc modelbased optimization queries. This deﬁnition is very general. A) ≤ 0. with optimization objective o (min or max). top-k queries. . A) = 0. A) as the distance function (e.g. and (iii) user/query adjustable objective parameters θ. and answer set size integer k. and almost any type of query can be considered as a special case of model-based optimization query. Figure 2. The points b and d represent the best and worst point within the constraints for the objective function respectively. Euclidean) and the optimization objective is minimization. a2 . Similarly. we will consider minimization as the objective optimization throughout the rest of this chapter. an ). Point c could be the actual point in the database that meets the constraints and minimizes the objective function. a model-based optimization query Q is given by (i) an objective function F (θ. For instance.. NN queries over an attribute set A can be considered as model-based optimization queries with F (θ.2).Deﬁnition 1 (Model-Based Optimization Query) Given a relation Re(a1 . where 3 ≤ y ≤ 6. (ii) a set of constraints (possibly empty) speciﬁed by inequality constraints gi (α. The objective function ﬁnds the nearest neighbor to point a (1. . The answer of a model-based optimization query is a set of tuples with maximum cardinality k satisfying all the constraints such that there exists no other tuple with smaller (if minimization) or larger (if maximization) objective function value F satisfying all constraints as well. .

users can ask any form of queries with any coeﬃcients as long as these assumptions hold. A) : (x − 1)2 + (y − 2)2 . g2 (α. o(min). 1 ≤ j ≤ v (if not empty) are linear or aﬃne.. A). A) : −y + 3 ≤ 0. g1 (α. A) is convex. A) : x − 5 = 0.1: Sample Optimization Objective Function and Constraints F (θ. k=1 Deﬁnition 2 (Convex Optimization (CP) Query) A model-based optimization query Q is a Convex Optimization (CP) query if (i) F (θ. The only assumptions are that the queries can be formulated into convex functions and the feasible regions deﬁned by the constraints of the queries are convex (e. A) : y − 6 ≤ 0. (ii) gi (α.g. and (iii) hj (β. h1 (β. A function is convex if it is second diﬀerentiable and 12 . The conditions (i). Notice that the deﬁnition of a CP query does not have any assumptions on the speciﬁc form of the objective function. 1 ≤ i ≤ u (if not empty) are convex. Therefore. polyhedron regions).Figure 2. (ii) and (iii) form a well known type of problem in Operations Research literature Convex Optimization problems. A).

wn > 0 are the weights and can be diﬀerent for each query. . gi and hj . top-k queries. w2 . . . . . 2. . + cn an 13 . an ) = c1 a1 + c2 a2 + . . if one travels in a straight line from inside a convex region to outside the region. including NN queries. . The formulation of some common query types as CP optimization queries are listed and discussed below. a2 . . . a0 ) can be considered as an optimization n 1 2 query asking for a point (a1 . .3. a2 .2 Common Query Types Cast into Optimization Query Model Although appearing to be restricted in functions F . . an ) = w1 (a1 − a0 )2 + w2 (a2 − a0 )2 + . For Euclidean WNN the objective function is WNN(a1 . . linear optimization queries. . linear programming problems (LP) and convex quadratic programming problems (QP) [27]. it is not possible to re-enter the convex region. . Linear Optimization Queries . and angular similarity queries. a0 .its Hessian matrix is positive deﬁnite. . a2 . . an ) such that an objective function is minimized. . 1 2 n where w1 . More intuitively. . . + wm (an − a0 )2 . . they cover a wide variety of linear and non-linear queries. . . Euclidean Weighted Nearest Neighbor Queries .The objective function of a Linear Optimization query is L(a1 .The Weighted Nearest Neighbor query asking NN for a point (a0 . Traditional Nearest Neighbor queries are the subset of weighted nearest neighbor queries where the weights are all equal to 1. . Therefore. the set of CP problems is a superset of all least square problems.

but also ‘irregular’ range queries such as l ≤ a1 + a2 ≤ u and l ≤ 3a1 − 2a2 ≤ u 14 . for some (or all) dimensions i. Clearly. . . where [li . Also note that our model can not only deal with the ‘traditional’ hyper-rectangular range queries as described above. . . li ≤ ai ≤ ui . Top-k Queries . Therefore. . sum and average are special cases of linear optimization queries. n Any points that are within the constraints will get the objective value C. Constructing an objective function as f (a1 . cn ) ∈ Rn are coeﬃcients and can be diﬀerent for diﬀerent queries. . Since linear functions are convex and polyhedrons are convex. and the constraints form a polyhedron. and constraints gi = li − ai <= 0. where C is any constant. 2. range queries can be formed as special cases of CP queries with constant objective functions applying the range limits as constraints. an ) = C. a2 . · · · . gn+i = ai − ui <= 0. all form the solution set of the range query. c2 . the top-k query is a CP query. i = 1. ui ]n is the range for dimension i.where (c1 .The objective functions of top-k queries are score (or ranking) functions over the set of attributes. Those points that remain all have an objective value of C and because they all have the same maximum(minimum) objective value. an ) s.t. . linear optimization queries are also special cases of CP queries with a parametric function form. The common score functions are sum and average. Points that are not within the constraints are pruned by those constraints. .A range query asks for all data (a1 . Range Queries . . a2 . . but could be any arbitrary function over the attributes. . If the ranking function in question is convex. .

Some hard exclusion criteria is known and some attributes are more important than others with respect to estimating participation probability and suitability. This type of query would be applicable in situations where attribute distances vary in importance and some constraints to the result set are known. A potential application is clinical trial patient matching where an administrator is looking to ﬁnd the best candidate patients to ﬁll a clinical trial. In such a case. In cases where the optimization problem or objective function parameters are not known in advance.Weighted nearest neighbor queries give the ability to assign diﬀerent weights for diﬀerent attribute distances for nearest neighbor queries. kNN with Adaptive Scoring .In some situations we may have an objective function that changes based on the characteristics of a population set we are trying to ﬁll. Weighted Constrained Nearest Neighbors . In a case where we want half of the participants to be male and 15 . again consider the patient clinical trial matching application where we want to maintain a certain statistical distribution over the trial participants.3.3 Optimization Query Applications Any problem that can be rewritten as a convex optimization problem can be cast to our model.2. Weighted Constrained Nearest Neighbors (WCNN) ﬁnd the closest weighted distance object that exists within some constrained space. this generic framework can be used to solve the problem by eﬃciently accessing available access structures. As an example. we will change the nearest neighbor scoring weights dependent on the current population set. We provide a short list of potential optimization query applications that could be relevant during scientiﬁc data exploration or business decision support where the problem being optimized is continuously evolving. and solved using our framework.

each query or each member of a set of queries can be rewritten as an optimization query in our model. we can adjust weights of the objective optimization function to increase the likelihood that future trial candidates will match the currently underrepresented gender. the data attributes may be subject to lower and upper bounds. We can solve a continuous CP problem over a convex partition of space by optimizing the function of interest within the 16 . a series of optimization queries may search a stock database for maximum stock gains over diﬀerent time intervals. In the examples listed above. Weights. For example. Queries over Changing Attributes .3. Additionally. The goal in CP is optimize (minimize or maximize) some convex function over data attributes. A database system that can eﬀectively handle the potential variations in optimization queries will beneﬁt data exploration tasks. All of the discussed applications can be stated as an objective function with a objective of maximization or minimization. The solution can optionally be subject to some convex criteria over the data attributes. This demonstrates the power and ﬂexibility that the user has to deﬁne data exploration queries and the examples represent a small subset of the query types that are possible. constraints. The attributes involved in each query will be diﬀerent.4 Approach Overview We propose the use of Convex Optimization (CP) in order to traverse access structures to ﬁnd optimal data objects. and optimization functions themselves can all change on a per-query basis.half female. functional attributes.The attributes involved in optimization queries can vary based on the iteration of the query. 2.

we can use CP to ﬁnd the optimal answer for the function within the intersection of the problem constraints and partition constraints. and a set of partitions. With our technique we endeavor to minimize the I/O operations and access structure search computations required to perform arbitrary model-based optimization queries. If the problem under analysis has constraints in addition to the constraints imposed by the partition boundaries. and problem constraints. the partition the yields the best CP result with respect to the function under analysis is the one that contains some space that optimizes the problem under analysis better than the other partitions. partition constraints. These partition functional values can be used to order partitions according to how promising they are with respect to optimizing the function within problem constraints. The access structure partitions to search and evaluate during query processing are determined by solving CP problems that incorporate the objective function. The solution of these CP problems allows us to prune partitions and search paths that can not contain an optimal answer during query processing. Particularly common partitions are Minimum Bounding Rectangles (MBRs). Most access structures are built over convex partitions. Other convex partitions can be addressed with data attribute constraints (such as x + y < 5). the partition is found to be infeasible for the problem. and partition constraints. If no solution exists (problem constraints and partition constraints do not intersect). 17 . Given a function. The partition and its children can be eliminated from further consideration.constraints of that space. they can also be added to the CP problem as constraints. problem constraints. Rectangular partitions can be represented by lower and upper bounds over data attributes and can be directly addressed by the CP problem bounds. Given the function. model constraints.

it is stated that a CP problem can be very eﬃciently solved with polynomial-time algorithms. we found very little time diﬀerence in the functions we examined. solve the CP problem without considering the database. We are in eﬀect replacing the bound computations from existing access structure distance optimization algorithms with a CP problem that covers the objective function and constraints. which can be applied to existing space or data partitioning-based indices without losing the capabilities of those indexing techniques to answer other types of queries.The CP literature discusses the eﬃciency of CP problem solution. Details for our technique for diﬀerent index types are provided in the following section. then you have solved the original problem. Current implementations of CP algorithms can solve problems with hundreds of variables and constraints in very short times on a personal computer [38]. 18 . The readers can refer to [43] for a detailed review on interior-point polynomial algorithms for CP problems.” ([10] page 8). (i. Second. We focus on query processing techniques by presenting a technique for general CP objective functions. CP problems can be so eﬃciently solved that Boyd and Vandenberghe stated “if you formulate a practical problem as a convex optimization problem. In [42]. search through the index by pruning the impossible partitions and access the most promising pages.e. This “relaxed” optimization problem can be solved eﬃciently in the main memory using many possible algorithms in the convex optimization literature. Although such computations can be more complex. The basic idea of our technique is to divide the query processing into two phases: First. It can be solved in polynomial time in the number of dimensions in the objective functions and the number of constraints ([43]). in continuous space).

converts the function and constraints into appropriate format for the query processor. A popular taxonomy of access structures divides the structures into the categories of hierarchical and non-hierarchical structures. and executes the appropriate query processing algorithm using the speciﬁed access structure. Query processing proceeds ﬁrst by the user providing an objective function and constraints to the system. Figure 2. The bounds can be deﬁned for any type of convex partition (such as minimum bounding regions) used for indexing the data. there exists a lower bound for each partition in the database that can be used to prune the data space before retrieving pages from disk. partition constraints. Hierarchical structures are usually 19 . we can ensure that we access pages in the order of how promising they are. By solving CP subproblems using index partition boundaries as additional constraints. Thus. The user gets back those data object(s) in the database that answer the query within the constraints. This framework can be applied to indexing structures in which the partitions on which they are built are convex because the partition boundaries themselves are used to form the new CP problems. we ﬁnd the best possible objective function value achievable by a point in the partition.2 is a functional block diagram of the optimization query processing system that shows interactions between components.4 Query Processing Framework By solving CP problems associated with the problem constraints. The generic query processing framework for a CP query is described below. The query processor invokes a CP solver to ﬁnd bounds for partitions that could potentially contain answers to the query.2. A lightweight query compiler validates the user input. and objective function.

e.4. Therefore we provide algorithms for hierarchical structures in Subsection 2.1 and non-hierarchical structures in 2. Solutions to maximization problems can be performed by using maximums (i. “worst” or upper bounds) in place of minimums. 20 . We terminate when we come to a partition that could not contain an optimal answer to the query.4.Figure 2.2. The overall idea is to traverse the access structure and solve the problem in sequential order of how good partitions and search paths could be.2: Optimization Query System Functional Block Diagram partitioned based on data objects while non-hierarchical structures are usually based on partitioning space. These two families have important diﬀerences that aﬀect how queries are processed over them. The implementations described address a minimization objective function and prune based on comparison to minimum values.

P-tree. the query processing will be at least as I/O-eﬃcient as standard range query processing over a given access structure. It can however. Buddy-tree. GBD-tree. and SR-tree [25. k-d-B-tree. For range queries. B-tree. the value of k should be set to n. Only the data points within the constrained region will be returned.1 Query Processing over Hierarchical Access Structures Algorithm 1 shows the query processing over hierarchical access structures. the number of data points. X-tree. be an arbitrary value in order to return the k best data objects. SS-tree. if all objects that meet the range constraints are desired. 2. 9].4. Now. This algorithm is applicable to any hierarchical access structure where the partitions at each tree level are convex. If the user desires only the single best result. BSP-tree. no unnecessary MBRs (or partitions) given an access structure are accessed during the query processing. A partial list of structures covered by this algorithm includes the R-tree. SKD-tree.The number of results that are returned by the query processing is dependent on an input k. B+-tree. k = 1. Extended k-DTree.e. i. Quad-tree. LSD-tree. R*-tree. In Section 2. Oct-tree.. and only the points that are within the lowest-level partitions that intersect the constrained region will be examined during query processing. Therefore. R+-tree.5 we show that both of these algorithms are in fact optimal. 21 . we describe the implementations of this framework based on hierarchical and non-hierarchical indexing techniques.

If this partition is a leaf partition. The algorithm takes as input the convex optimization function of interest OF . the set of convex constraints C. and a desired result set size k. We solve a new CP problem for each partition that we traverse using the objective function. we initialize the structures that we maintain throughout a query. we ﬁnd its child partitions. our algorithm also prunes out any partitions and search paths that do not fall within problem constraints. original problem constraints. except that partitions are traversed in order of promise with respect to an arbitrary convex function. then it will not have a feasible solution and will be discarded from further consideration. We keep a vector of the k best points found so far as well as a vector of the objective function values for these points. Additionally. We then process entries from the worklist until the worklist is empty. If a point is not within the original problem constraints. In lines 1-3. we solve the CP problem corresponding to the objective function. and the constraints imposed by the partition boundaries. In line 5. We keep a worklist of promising partitions along with the minimum objective value for the partition. original problem constraints. rather than distance. If the partition under analysis is not a leaf. Points with objective function values lower than the objective function value of the k th best point so far will cause the best point and best point objective function value vectors to be updated accordingly. we remove the ﬁrst promising partition. For each of these child partitions. This yields 22 .The algorithm is analogous to existing optimal NN processing [31]. We initialize the worklist to contain the tree root partition. and partition constraints (line 11). then we access the points in the partition and compute their objective function values.

k) notes: b[i] ..CPQueryHierarchical (Objective Function OF . it is possible for a point within the intersection of the problem constraints and partition constraints to 23 . minObjVal) pairs 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: b[1].the ith best point found so far F (b[i]) .e.. Constraints C.C) if minObjV al < F (b[k]) insert Pchild into L according to minObjV al end for end if prune partitions from L with minObjV al > F (b[k]) end while return vector b Algorithm 1: I/O Optimal Query Processing for Hierarchical Index Structures..F (b[k]) ← +∞ L ← (root) while L = ∅ remove ﬁrst partition P from L if P is a leaf access and compute functions for points in P within constraints C update vectors b and F if better points discovered else for each child Pchild of P minObjV al = CPSolve(OF . the best possible objective function value achievable within the intersection of the original problem constraints and partition constraints. F (b[1]). If the minimum objective function value is less than the k th best objective function value (i.Pchild ..b[k] ←null.the objective function value of the ith best point L .worklist of (partition.

consider Figure 2. Query Processing Example As an example of query processing. We ﬁnd the minimum objective function for partition B and get the distance from the query point to the point where the constrained region and the partition B intersect. the partition will be dropped from consideration.2. we examine its child partitions A and B. and f respectively. Since partition B. we may have updated our k th best objective function value. constrained 1-NN query).2.2.1.2. If there is no feasible solution for the partition within problem constraints.e.3. and the best point objective function is updated. We do not process A further. We solve for the minimum objective function value for A and do not obtain a feasible solution because A and the constrained region do not intersect. We prune any partitions in the worklist that can not beat this value. we insert partitions into the worklist in the order B. e. we examine its child partitions.a. Since B is not a leaf. After processing B. B.3 with an optimization objective of minimizing the distance to the query point within the constrained region (i. we prune it from the 24 . then the partition is inserted into the worklist according to its minimum objective value. we compute objective function values for the points in B. and B. Point B. We process B. We ﬁnd minimum objective function values for each and obtain the distances from the query point to point d. We initialize our worklist with the root and start by processing it. Since point e is closest. we know we can do no worse than the distance from the query point to B. B. point c.3. Since it is not a leaf. we process it next. B.3 can not do better than this.2.a is the closer of the two.2. We insert partition B into our worklist and since it is the only partition in the worklist. This point is designated as the best point thus far.1.beat the k th best point). After a partition is processed.2 and since it is a leaf. and B.

a gets a value which is better than the current best point.1. Figure 2.worklist.1.1. we have completed the query and return the best point. Point B. We examine only points in partitions that could contain points as good as the best solution.1.3: Example of Hierarchical Query Processing 25 . We examine points in B.1. Point B. The optimal point for this optimization query this query is B.b is not a feasible solution (it is not in the constrained region) so it is dropped.a. Since the worklist is now empty.a. We update the best point to be B.

we process each populated partition vin the structure by solving the CP problem that corresponds to the original objective function and constraints combined with the additional constraints imposed by the partition boundaries (line 5). we process the unpruned partitions by accessing objects contained by the ﬁrst partition and computing actual objective values. Algorithm 2 can be applied to non-hierarchical access structures where the partitions are convex. we show that the general framework for CP queries can also be implemented on top of non-hierarchical structures. VA+-File. original problem constraints C. We use a two stage approach detailed in Algorithm 2 to determine which data objects optimize the query within the constraints. we place unique unpruned partitions in increasing sorted order based on their lower bound. but we use the objective function rather than a distance and also consider original problem constraints to ﬁnd lower bounds. In stage 2. 9. For those that do have a feasible solution. Those partitions that do not intersect the constraints will be bypassed.2. and continue accessing points contained by subsequent partitions until a partition’s lower bound is greater than the k th best value.2 Query Processing over Non-hierarchical Access Structures In this section. By accessing partitions 26 . the lower bound of the CP problem is recorded. In stage 1.4. Grid-File. At the end of stage 1. and Pyramid Technique [25. 23]. VA-File. OF . A partial list of structures where the algorithm can be applied is Linear Hashing. We initialize the structures to be used during query processing (lines 1-3). and result set size k as inputs. This technique is similar to the technique identiﬁed in [53] for ﬁnding NN for a non-hierarchical structure. We take the objective function. 53. EXCELL. We maintain a sorted list of the top-k objects we have found so far.

27 .the objective function value of the ith best point L . 2: F (b[1]).the ith best point found so far F (b[i]) . P . according to the CP problem.worklist of (partition.F (b[k]) ← +∞ 3: L ← ∅ Stage 1: 4: for each populated partition P 5: minObjV al = CPSolve(OF .b[k] ←null. CPQueryNonHierarchical (Objective Function OF .. minObjVal) pairs Initialize: 1: b[1].k) notes: b[i] . Constraints C.in order of how good they could be. minObjV al) into L according to minObjV al Stage 2: 8: for i = 1 to |L| 9: if minObjV al of partition L[i] > F (b[k]) 10: break 11: Access objects that partition L[i] contains 12: for each object o 13: if OF (o) < F (b[k]) 14: insert o into b 15: update F (b) Algorithm 2: I/O Optimal Query Processing over Non-hierarchical Structures.. we access only those points in cell representations that could contain points as good as the actual solution.. C) 6: if minObjV al < +∞ 7: insert (P ..

Since attribute information is lost. 28 .4. They include structures that contain non-convex regions (BANG-File.5 I/O Optimality In this section. space ﬁlling curves which are used for approximate ordering). they can not be applied to general functions over the original attributes).4. Deﬁnition 3 (Optimality) An algorithm for optimization queries is optimal iﬀ it retrieves only the pages that intersect the intersection of the constrained region R and the optimal contour. Note that M-trees would still work to answer queries for the functions they are built over and the structures built with non-convex leaves could still be optimally traversed down to the level of non-convexity. BV-tree) [25. we deﬁne the optimality as follows. hB-tree. 9]. The k th optimal contour would pass through all points in continuous space that yield the same objective function value as the k th best point in the database. We denote the objective function contour which goes through the actual optimal data point in the database as the optimal contour. we will show that the proposed query processing technique achieves optimal I/O for each implementation in Section 2. In the following arguments.e. we use the optimal contour.3 Non-Covered Access Structures A review of access structure surveys yielded few structures that are not covered by the algorithms. They include structures built over values computed for a speciﬁc function (such as distance in Mtrees. This contour also goes through all other points in continuous space that yield the same objective function value as the optimal point.2. Following the traditional deﬁnition in [5]. These include structures that either do not guarantee spatial locality (i. but could replace them with the k th optimal contour. 2.

A: intersects neither R nor the optimal contour. D: intersects the optimal contour and R. 29 . It is easy to see that any partition intersecting the intersection of optimal contour and feasible region R will be accessed since they are not pruned in any phase of the algorithm. a is an optimal data point in the data base. R) to denote the minimum objective function value that is computed for the partition and constraints R. i. Proof.e. C: intersects the optimal contour but not R.. We will use minObjV al(P artition. but not the intersection of optimal contour and R. Lemma 1 The proposed query processing algorithm is I/O optimal for CP queries under convex hierarchical structures. We need to prove only these partitions are accessed.4 shows diﬀerent cases of the various position relations between a partition. B: intersects R but not the optimal contour.For hierarchical tree based query processing. E: intersects the intersection of the optimal contour and R. R and the optimal contour under 2 dimensions. Figure 2. other partitions will be pruned without accessing the pages. the proof is given in the following.

. .Figure 2. R) 30 . . Let CP0 be the data partition that contains an optimal data point. R) > F (a) and E will be accessed earlier than B in the algorithm because minObjV al(E. R) ≥ . . and CPk be the root partition that contains CP0 . R) < minObjV al(B. CP1 be the partition that contains CP0 . R) = minObjV al(C. . ≥ minObjV al(CPk . CP1 . We shall show that a page in case D is pruned by the algorithm. CPk−1 .4: Cases of MBRs with respect to R and optimal objective function contour A and C will be pruned since they are infeasible regions. Because the objective function is convex. R) = +∞. R). . R) ≥ minObjV al(CP1 . when the CP for these partitions are solved. we have F (a) ≥ minObjV al(CP0 . B is pruned because minObjV al(B. they will be eliminated from consideration since they do not intersect the constrained region and minObjV al(A..

the algorithm is also optimal in terms of the access structure traversal. The I/O-optimality relates to accessing data from leaf partitions. whether it contains data or other partitions. we can prove the I/O optimality of proposed algorithm over non-hierarchical structures. CPk is replaced by CPk−1 . we will never retrieve the objects within an MBR (leaf or non-leaf). If D will be accessed at some point during the search. . unless that MBR could contain an optimal answer. Similarly. B. and E are the same as deﬁned for Figure 2. ≥ minObjV al(CPk . R) > F (a) ≥ minObjV al(CP0 . including D. Proof. The proof applies to any partition. This only can occur after CP0 has been accessed because minObjV al(D. Partitions A.4. If CP0 is accessed earlier than D. and so on.5 shows diﬀerent possible cases of the partition position. until CP0 is accessed. R) ≥ .and minObjV al(D. However. Figure 2. however. and CPk−1 by CPk−2 . the algorithm prunes all partitions which have a constrained function value larger than F (a). R) During the search process of our algorithm. This contradicts with the assumption that D will be accessed. . C. D. then D must be the ﬁrst in the MBR list at some time. R) ≥ minObjV al(CP1 . that is. 31 . R) is larger than constrained function value of any partition containing the optimal data point. Lemma 2 The proposed query processing algorithm is I/O optimal for CP queries over non-hierarchical convex structures.

Partitions C and A will not be retrieved because they are infeasible and our algorithm will never retrieve infeasible partitions.5: Cases of partitions with respect to R and optimal objective function contour Without loss of generality. Thus. Then an optimal algorithm retrieves only partition E. 32 . assume that only these ﬁve partitions contain some data points. Partition B will not be retrieved because partition E will be retrieved before partition B (because minObjV al(E. y). and the best data point found in E will prune partition B. R)). for a query F (x.Figure 2. we only need to show that partition D will not be retrieved. a. R) < minObjV al(B. and partition E contains an optimal data point.

6 Performance Evaluation The goals of the experiments are to show i) that the proposed framework is general and can be used to process a variety of queries that optimize some objective ii) that the generality of the framework does not impose signiﬁcant cpu burden over those queries that already have an optimal solution. Therefore.Because partition D does not intersect the intersection of the optimal contour and R but E does. and iii) that non-relevant data subspace can be eﬀectively pruned during query processing.e. R) > minObjV al(E. 2. we can ﬁnd at least one virtual point x∗ (without considering the database) in partition D such that x∗ is both in R and the optimal contour.. R)) is contained by the optimal contour (because of the convexity of function F ). minObjV al(D. the virtual solution corresponding to (minObjV al(D. This contradicts the assumption that partition D does not intersect the intersection of R and the optimal contour. Weighted kNN. and Constrained Weighted Linear Optimization. Where possible. R). i. we compare cpu times against a non-general optimization technique. which also means the virtual solution corresponding to (minObjV al(D. One of these is Color Histogram data. We index and perform que-ries over four real data sets. and the data point a is found at that time. We performed experiments for a variety of optimization query types including kNN. Therefore. R). partition E must be retrieved before partition D since the algorithm always retrieves the most promising partition ﬁrst. This means partition D cannot be pruned by data point a. R)) ∈ D ∩ R∩optimal contour. F (a) ≥ minObjV al(D. Weighted Constrained kNN. Hence.000 33 . a 64-dimensional color image histogram dataset of 100. Suppose partition D is also retrieved later.

Intuitively. and randomly assign weights between 0 and 1 for the k-WNN queries. The second is Satellite Image Texture (Landsat) data. The index is built over the ﬁrst 8 dimensions of the histogram data and the queries are generated to ﬁnd the kNN or k-WNN to a random point in the data space over the same 8 dimensions. Because the weights are 1 for the kNN queries. We execute 100 queries for each trial. the optimal contour is a hyper-sphere around the query point with a radius equal to the distance of the nearest point in the data set. 34 . the queries are weighted nearest neighbor queries. our algorithm matches the I/O-optimal access oﬀered by traditional R-tree nearest neighbor branch and bound searching. The vectors represent color histograms computed from a commercial CD-ROM.6 shows the performance of our query processing over R*-trees for the ﬁrst 8-dimensions of the histogram data for both kNN and k weighted NN (k-WNN) queries. For unconstrained. the optimal contour is a hyper-ellipse. We vary the value of k between 1 and 100. 26]. which consists of 100. Weighted kNN Queries Figure 2. The ﬁgure shows that we achieve nearly the same I/O performance for weighted kNN queries using our query processing algorithm over the same index that is appropriate for traditional non-weighted kNN queries. These datasets are widely used for performance evaluation of index structures and similarity search algorithms [39. unweighted nearest neighbor queries.000 60-dimensional vectors representing texture feature vectors of Landsat images [40]. as the weights of a weighted nearest neighbor query become more skewed.data points. We used a clinical trials patient data set and a stock price data set to perform real world optimization queries. If instead. We assume the index structure is in memory and we measure the number of additional page accesses required to access points that could be kNN according to the query processing.

6: Accesses. and the opportunity to intersect with access structure partitions increases. Random weighted kNN vs kNN.Figure 2. 8-D Histogram. which is not practical for applications where weights are not typically known prior to the 35 . However. R*-tree. Results for the weighted queries track very closely to the non-weighted queries. We could then traverse the access structure using traditional nearest neighbor techniques. this would require generating an index for every potential set of query weights. This indicates that these weighted functions do not cause the optimal contour to cross signiﬁcantly more access structure partitions than their non-weighted counterparts. 100k pts the hyper-volume enclosed by the optimal contour increases. For a weighted nearest neighbor query. we could generate an index based on attribute values transformed based on the weights and yield an optimal contour that was hyper-spherical.

we would be better served to process the query over a single reasonable access structure in an I/O-optimal way than by building speciﬁc access structures for a speciﬁc set of weights.7 shows the average cpu times required per query to traverse the access structure using convex problem solutions for the weighted kNN queries compared to the standard nearest neighbor traversal used for unweighted kNN queries. Random weighted kNN vs kNN. Figure 2. 100k pts 36 . Figure 2. For similar applications where query weights can vary.7: CPU Time. 8-D Histogram. and these weights are not known in advance. R*-tree.query.

The ﬁgure shows that the generic CP-driven solution takes slightly longer to perform than a specialized solution for nearest neighbor queries alone. Hierarchical data partitioning access structures are not eﬀective for high-dimensional data sets. part is due to the additional partitions that the optimal contour crosses. Similar to the R*-tree case.000 objects. the results show that the performance of the algorithm is not signiﬁcantly aﬀected by re-weighting the objective function. we would expect to see object accesses much closer to k. As data set dimensionality increases. VA-ﬁle built over the landsat dataset. the probability that a point’s objective function value can prune other partitions decreases. while the remaining part is due to incorporating data set constraints (unused in this example) into computing minimum partition function values. much of the data in this data set is clustered. 5-bit per attribute. For this reason. If the dataspace were more uniformly populated. the same characteristics that make high-dimensional R-tree type structures ineﬀective for traditional queries also make them ineﬀective for optimization queries. the partitions overlap with each other more. However. we perform similar experiments for high-dimensional data using VA-ﬁles. Sequential scan would result in examining 100. Therefore. Figure 2. As partitions overlap more. and some vector representations are heavily 37 . The experiments assume the VA-ﬁle is in memory and measure the number of objects that need to be accessed to answer the query. Part of this diﬀerence is due to the slightly more complicated objective function (weighted versus unweighted Euclidean distance).8 shows page access results for kNN and k-WNN queries over 60-dimensional.

it can not overcome the inherent weaknesses of that structure. VA-File. It should be noted that while the framework results in optimal access of a given structure. hierarchical data structures are ineﬀective. we need to look at each object assigned the same vector representation. following the VA-ﬁle processing model so too must a CP problem be solved for each vector for general optimization queries. For example. and the framework can not overcome this. at high dimensionality. Random weighted kNN vs kNN.Figure 2. 100k pts populated. As distance computations need to be performed for each vector in traditional VA-ﬁle nearest neighbor algorithms. A consequence of VA-File access structures is that.8: Accesses. 60-D Sat. If a best answer comes from one of these vector representations. a computation needs to occur for every approximated object. without some other data reorganization. Times 38 .

Because of the sparser density of partitions around the point that optimizes the query. However. and the maximum constrained value for the negatively weighted attributes will yield the optimal result within the constrained region (e. We generated 100 queries for each data point in the form of a summation of weighted attributes over the 8 dimensions. The graph shows interesting results.for this experiment reﬂect this fact. corner [0. very little computational burden is added to allow query generality).006 (i. leading to denser packing of access structure partitions near the origin. about 1. Figure 2.+]).-.g. It takes longer to perform each query but the ratio of the general optimization query times to the traditional kNN CPU times is similar to the R*-tree case. We vary k and we vary the constrained area region ﬁxed at the origin as a 8-D hypercube with the indicated value. we access fewer partitions when this particular problem is less constrained.0] would be an optimal minimization answer for a constrained unit cube and a linear function with a weight vector [+. the radius of the k th -optimal contour increases more in these less dense regions than 39 . The rate of increase in page accesses to accommodate diﬀerent values of k does vary with the constraint size.9 shows the number of page accesses required to answer randomly weighted LP minimization queries for the histogram data set using the 8-D R*-Tree access structure. Weights are uniformly randomly generated and are between the values of -10 and 10.e. For this data set. Constrained Weighted Linear Optimization Queries We also explore Constrained Weighted-Linear Optimization queries. When weights can be negative. The number of page accesses is not directly related to the constraint size. the corner corresponding to the minimum constraint value for the positively weighted attributes.1. many points are clustered around the origin.

9: Page Accesses. and it. R*-Tree. the sequential scan of this data set is 834 pages. For comparative purposes. we only need to examine those points that are in partitions 40 . leading to a faster increase in the number of page accesses required when k increases. We should also note what happens when there are less than k optimal answers in the data set. For our query processing. This means that there are less than k objects in our constrained region. 100k pts more clustered regions. Random Weighted Constrained k-LP Queries.Figure 2. 8-D Color Histogram. as well as the other competing techniques are not appropriate for convex constrained areas. Note that the Onion technique is not feasible for this dataset dimensionality.

This means that on average the number of vectors within the intersection of the constrained space and 10th -optimal contour is 11.09x1 − 0.3. Arbitrary Functions over Diﬀerent Access Structures In order to show that the framework can be used to ﬁnd optimal data points for arbitrary convex functions. is to minimize 0. we performed a number of queries over a set of 17000 real patient records.91x2 +0. We also wanted to explore that constrained optimization problems for real applications.91x4 − 0.that intersect the constrained region. we achieved an average of 11. i where xi is the data attribute value for the ith dimension.5 vector accesses to ﬁnd the k = 10 results. randomly weighted queries. the continuous optimal solution is not intuitively obvious by examining the function. we generated a set of 100 random functions.4x2 + 1 2 3 0.76x2 − 0. It corresponds to a vector access ratio of 0. We performed Weighted Constrained kNN experiments to emulate a patient matching data exploration task described in Section 2. To test this. Unlike nearest neighbor and linear optimization type queries. where coeﬃcients are rounded to two decimal places. A sample function. Queries with Real Constraints The previous example showed results with synthetically generated constraints.51x5 . A technique that did not evaluate constraints prior to or during query processing will need to examine every point.49x2 +0.55x2 + 0. We established constraints on age and gender and queried to ﬁnd the 10 weighted nearest neighbors in the data set according to 7 measured blood analytes.28x3 − 0.49x2 − 0. 4 5 41 .3.5.0007. Using the VA-ﬁle structure with 5 bits assigned per attribute and 100 randomly selected. Functions contain both quadratic and linear terms and were generated over 5 dimensions in the form 5 i=1 (random() ∗ x2 − random() ∗ xi ).

Each index yielded the same correct optimal data point for each function. the average ratio of the data points accessed is below 0. including the R-tree. and these optimal answers were scattered throughout the data space. We ran the set of random functions over each structure and measured the number of objects in the partitions that we retrieve. and the VA-File 32000. the grid ﬁle and the VAﬁle. the grid ﬁle 16000. and the VAM-split bulk loaded R*-tree. Results between hierarchical access structures are as expected.We explored the results of our technique over 5 index structures. It shows the minimum. We explored 2 non-hierarchical structures. For the hierarchical structures these are leaf partitions and for the non-hierarchical structures they are grid cells.0025 for all the access structures. Figure 2. maximum. and average object access ratio to guarantee discovery of the optimal answer over the set of functions. We added a new set of 50000 uniform random data points within the 5-dimension hypercube and built each of the index structures over this data set.10 shows results over the 100 functions. Each structure performed well for these optimization queries. For each structure we try to keep the number partitions relatively close to each other. Even the worst performing structure over the worst case function still prunes over 99% of the search space. These include 3 hierarchical structures. For each of these index types. The number of partitions for the hierarchical structures is between 12000 and 13000. The VAM-split R*-tree bulk loads the data into the 42 . The R*-tree looks to avoid some of the partition overlaps produced by the R-tree insertion heuristics and it shows better performance. the R*-tree. we modiﬁed code to include an optimization function that uses CP to traverse the index structure using the algorithm appropriate for the structure.

For a constrained NN type query we could process the query according to traditional NN techniques and when we ﬁnd a result. check it against the constraints until we ﬁnd a point that meets the 43 . this yields better performance than the dynamically constructed R*-tree. and therefore achieves better access ratios.10: Accesses vs. 50k Points index and can avoid more partition overlaps than an index built using dynamic insertion.Figure 2. We compare the proposed framework to alternative methods for handling constraints in optimization query processing. The VA-File has greater resolution than the grid-ﬁle in this particular test. 5-D Uniform Random. As expected. Index Type. Incorporating Problem Constraints during Query Processing The proposed framework processes queries in conjunction with any set of convex problem constraints. Optimizing Random Function.

and then check if the result meets the constraints. the R*-tree line shows the number of page accesses required when we use traditional NN techniques. The Selectivity line is a 44 . As the problem becomes less constrained.11: Pages Accesses vs. we could perform a query on the constraints themselves. 100k Points In the ﬁgure. Figure 2. Alternatively. the more likely a discovered NN will fall within the constraints. and then perform distance computations on the result set and select the best one. and we can terminate the search. Constrained kNN queries were performed over the 8-D R*-tree built for the Color Histogram data set. Constraints. Figure 2.constraints. NN Query.11 shows the number of pages accessed using our technique compared to these two alternatives as the constrained area varies. R*-Tree. 8-D Histogram.

we prune search paths that can not lead to feasible solutions.lower bound on the number of pages that would be read to obtain the points within the constraints. The CP line shows the results for our framework. and the underlying access structure is made up of convex partitions. Our search drills down to the leaf partition that either contains the optimal answer within the problem constraints. linear function optimization. We experimentally show that the framework can be used to answer a number of diﬀerent types of optimization queries including nearest neighbor queries. 2. the fewer pages will need to be read to ﬁnd potential answers. query constraints are convex.7 Conclusions In this chapter we present a general framework to model optimization queries. or yields the best potential answer within constraints. We present a query processing framework to answer optimization queries in which the optimization function is convex. We demonstrate better results with respect to page accesses except in cases where the constraints are very small (and we can prune much of the space by performing a query over the constraints ﬁrst) or when the NN problem is not constrained (where our framework reduces to the traditional unconstrained NN problem). as the problem becomes more constrained. We provide speciﬁc implementations of the processing framework for hierarchical data partitioning and non-hierarchical space partitioning access structures. 45 . which processes original problem constraints as part of the search process and does not further process any partitions that do not intersect constraints. When there are problem constraints. and non-linear function optimization queries. Clearly. We prove the I/O optimality of these implementations.

we only access points in partitions that could potentially have better answers than the actual optimal database answers. This results in a signiﬁcant reduction in required accesses compared to alternative methods in the cases where the inclusion of constraints makes these alternatives less eﬀective. it demonstrates how well the particular access structure isolates the optimal contour within the problem constraints to access structure partitions. the access structure will yield nice results. and partitions and search paths that do not intersect with problem constraints are pruned during the access structure traversal. As such. if the optimal contour crosses many partitions. it captures the beneﬁts as well as the ﬂaws of the underlying structure.The ability to handle general optimization queries within a single framework comes at the price of increasing the computational complexity when traversing the access structure. the framework can be used to measure page access performance associated with using diﬀerent indexes and index types to 46 . Because the framework is built to best utilize the access structure. If the optimal contour and problem constraints can be isolated to a few leaf partitions. the performance will not be as good. Since constraints are handled during the processing of partitions in our framework. we do not need to examine points that do not intersect with the constraints. Furthermore. Essentially. This means that given an access structure we will only read points from partitions that have minimum objective function values lower than the k th actual best answer. However. The proposed framework oﬀers I/O-optimal access of whatever access structure is used for the query. We show a slight increase in CPU times in order to perform convex programming based traversal of access structures in comparison with equivalent nonweighted and unconstrained versions of the same problem.

or constraints in advance.answer certain classes of optimization queries. To use this framework. Database researchers and administrators can use this technique as a benchmarking tool to evaluate the performance of a wide range of index structures available in the literature. and does not incur a signiﬁcant penalty in order to provide this generality. in order to determine which structures can most eﬀectively answer the optimization query type. 47 . weights. one does not need to know the objective function. The technique provides optimization of arbitrary convex functions. The system does not need to compute a queried function for all data objects in order to ﬁnd the optimal answers. This makes the framework appropriate for applications and domains where a number of diﬀerent functions are being optimized or when optimization is being performed over diﬀerent constrained regions and the exact query parameters are not known in advance.

a method for monitoring performance and recommending access structure changes is provided. Appropriate access structures should match well with query selection criteria and allow for a signiﬁcant reduction in the data search space. D’Angelo. access structures that are appropriate for the query must be available to the query processor at query run-time. As the number of dimensions/attributes and overall size of data sets increase. Since query patterns can change over time. it becomes essential to eﬃciently retrieve speciﬁc queried data from the database in order to eﬀectively utilize 48 . In order to eﬀectively prune search space during query resolution. 3.Anthony J. such as business data warehouses and scientiﬁc data repositories.CHAPTER 3 Online Index Recommendations for High-Dimensional Databases using Query Workloads When solving problems. rendering statically determined access structures ineﬀective.1 Introduction An increasing number of database applications. dig at the roots instead of just hacking at the leaves. deal with high dimensional data sets. . This chapter describes a framework to determine a meaningful set of appropriate access structures for a set of queries considering query cost savings estimated from both query attributes and data characteristics.

each of these potential solutions has inherent problems.000 objects). For data-partitioning indexes such as the R-tree family of indexes. Another possibility would be to build an index over each single dimension. for high-dimensional data sets. Multi-dimensional indexing.000 data objects with several hundred attributes. the overlaps between partitions increase and 49 . To illustrate these problems. To answer a query. As the dimensionality of the index increases. We could build a multi-dimensional index over the data set so that we can directly answer any query by only using the index. Range queries are consistently executed over ﬁve of the attributes. For the given example. performance of multi-dimensional index structures is subject to Bellman’s curse of dimensionality [4] and degrades rapidly as the number of dimensions increases. and RDBMS index selection tools all could be applied to the problem. The eﬀectiveness of this approach is limited to the amount of search space that can be pruned by a single dimension (in the example the search space would only be pruned to 100. data is placed in a partition that contains the data point and could overlap with other partitons. consider a uniformly distributed data set of 1.000. dimensionality reduction. An ideal solution would allow us to read from disk only those pages that contain matching answers to the query. all potentially matching search paths must be explored. the answer set contains about 10 results). so the overall query selectivity is 1/105 (i. Indexing support is needed to eﬀectively prune out signiﬁcant portions of the data set that are not relevant for the queries. such an index would perform much worse than sequential scan. However.the database. The query selectivity over each attribute is 0.1. However.e.

Therefore.at high enough dimensions the entire data space needs to be explored. and perform poorly especially when the data is not highly correlated. However. the dimensionality reduction approaches are mostly based on data statistics. For spacepartitioning structures.g. These tools are optimized for the domains for which these systems are primarily employed and the indexes that the systems provide. and transform the query in the same way that the data was transformed. Another possible solution would be to use some dimensionality reduction technique. 50 . the problem is the exponential explosion of the number of cells. This phenomenon is thoroughly described in [54]. Commercial RDBMS’s have developed index recommendation systems to identify indexes that will work well for a given workload. Another possible solution is to apply feature selection to keep the most important attributes of the data according to some criteria and index the reduced dimensionality space. the selected features may oﬀer little or no data pruning capability given query attributes. They also introduce a signiﬁcant overhead in the processing of queries. However. index the reduced dimension data space. A 100 dimensional index with only a single split per attribute results in 2100 cells. The index structure can become intractably large just to identify the cells. They are targeted towards lower dimension transactional databases and do not produce results optimized for single high-dimensional tables. As well. where partitions do not overlap and data points are associated with cells which contain them (e. grid ﬁles). they also select attributes based on data statistics to support classiﬁcation accuracy rather than focusing on query performance and workload in a database domain. traditional feature selection techniques are based on selecting attributes that yield the best classiﬁcation capabilities.

only a small subset of the overall data dimensions are popular for a majority of queries and recurring patterns of dimensions queried occur. However. which corresponds to 300TBs of data per year consisting of 100-500 million objects. forcing their collision. The queries are predominantly range queries and involve mostly around 5 dimensions out of a total of 200. The new set of low dimensional indexes is designed to address a large portion of expected queries and allow eﬀective pruning of the data space to answer those queries. Our reduction considers both data and access patterns. We address the high dimensional database indexing problem by selecting a set of lower dimensional indexes based on joint consideration of query patterns and data statistics. For example. Each such collision generates on the order of 1-10MBs of raw data. and results in multiple and potentially overlapping sets of dimensions. Their long term eﬀectiveness is dependent on the stability of the query workload. Another example is High Energy Physics (HEP) experiments [49] where sub-atomic particles are accelerated to nearly the speed of light. Researchers have proposed workload based index recommendation techniques. However. Query pattern evolution over time presents another challenging problem. rather than a single set.Our approach is based on the observation that in many high dimensional database applications. This approach is also analogous to dimensionality reduction or feature selection with the novelty that the reduction is speciﬁcally designed for reducing query response times. rather than maintaining data energy as is the case for traditional approaches. Large Hadron Collider (LHC) experiments are expected to generate data with up to 500 attributes at the rate of 20 to 40 per second [45]. the search criterion is expected to consist of 10 to 30 parameters. query access patterns may change over time becoming completely dissimilar from the 51 .

the index selection technique we have developed uses an abstract representation of the query workload and the data set that can be adjusted to yield faster analysis.g. current events cause an increase in queries for certain search attributes). To estimate the query cost. When the current query patterns are substantially different from the query patterns used to recommend the database indexes. the system performance will degrade drastically since incoming queries do not beneﬁt from the existing indexes. For this reason. Pattern change could be the result of periodic time variation (e. the index set should evolve with the query patterns.g. a change in the popularity of a search attribute (e. We generate this abstract representation of the query workload by mining patterns in the workload. the replacement of an existing index. To make this approach practical in the presence of query pattern change.g. a researcher discovery spawns new query patterns). or the construction of an entirely new index set is beneﬁcial. a change in the focus of user knowledge discovery (e. diﬀerent database uses at diﬀerent times of the month or day). for each query. There are many common reasons why query patterns change. Because of the need to proactively monitor query patterns and query performance quickly. The query workload representation consists of a set of attribute sets that occur frequently over the entire query set that have non-empty intersections with the attributes of the query. or simply random variation of query attributes.patterns on which the index set were originally determined. the data set is represented by a multi-dimensional histogram where each unique value represents an approximation of data and contains a count of the number of 52 . we introduce a dynamic mechanism to detect when the access patterns have changed enough that either the introduction of a new index.

For each possible index for each query. we make major or minor changes to the recommended index set. As new queries arrive.records that match that approximation. This process is iterated until some indexing constraint is met or no further improvement is achieved by adding additional indexes. a ﬁne grain control loop and a coarse control loop. 53 . the estimated cost of using that index for the query is computed. The size of the multi-dimensional histogram aﬀects the accuracy of the cost estimates associated with using an index for a query. 2. Introduction of a technique using control feedback to monitor when online query access patterns change and to recommend index set changes for highdimensional data sets. we monitor the ratio of potential performance to actual performance of the system in terms of cost and based on the parameters set for the control feedback loops. we propose a control feedback system with two loops. The contributions of this work can be summarized as follows: 1. The resolution of the abstract representation can be tuned to achieve either a high ratio of index-covered queries for static index selection or fast index selection to facilitate online index selection. The number of potential indexes considered is aﬀected by adjusting data mining support level. Introduction of a ﬂexible index selection technique designed for high-dimensional data sets that uses an abstract representation of the data set and query workload. Analysis speed and granularity is aﬀected by tuning the resolution of the abstract representations. Initial index selection occurs by traversing the query workload representation and determining which frequently occurring attribute set results in the greatest beneﬁt over the entire query set. In order to facilitate online index selection.

Our technique is optimized to take advantage of multi-dimensional pruning oﬀered by multi-dimensional index structures.5. Presentation of a novel data quantization technique optimized for query workloads. It takes into consideration both data and query characteristics and can be applied to perform real-time index recommendations for evolving query patterns. each of these methods provides less than ideal solutions for the problem of fast high-dimensional data access. Our work diﬀers from the related index selection work in that we provide a index selection framework that can be tuned for speed or accuracy. and the GC-tree [28]. We conclude in Section 3.4 presents the empirical analysis. feature selection.1 High-Dimensional Indexing A number of techniques have been introduced to address the high-dimensional indexing problem. Experimental analysis showing the eﬀects of varying abstract representation parameters on static and online index selection performance and showing the eﬀects of varying control feedback parameters on change detection response. 4.3 explains our proposed index selection and control feedback framework. The rest of the chapter is organized as follows. Section 3. 3.3. such as the X-tree [6].2 presents the related work in this area.2 Related Work High-dimensional indexing. Section 3. While these index 54 . 3. As described below. Section 3. and DBMS index selection tools are possible alternatives for addressing the problem of answering queries over a subspace of high-dimensional data sets.2.

single dimension candidate indexes are chosen.2. almost every commercial Relational Database Management System (RDBMS) provides the users with an index recommendation tool based on a query workload and using the query optimizer to obtain cost estimates. A query workload is a set of SQL data manipulation statements. an iterative process is utilized to ﬁnd an optimal conﬁguration. they still suﬀer performance degradation at higher index dimensionality. The query workload should be a good representative of the types of queries an application supports. 3.3 Index Selection The index selection problem has been identiﬁed as a variation of the Knapsack Problem and several papers proposed designs for index recommendations [34. 24. 3. First. Then a candidate index selection step evaluates the queries in a given query workload and eliminates from consideration those candidate 55 . These techniques are also focused on maximizing data energy or classiﬁcation accuracy rather than query response. Currently. Microsoft SQL Server’s AutoAdmin tool [15.2. 55. 13] based on optimization rules.2 Feature Selection Feature selection techniques [7. In the AutoAdmin algorithm. 16. 29] are a subset of dimensionality reduction targeted at ﬁnding a set of untransformed attributes that best represent the overall data set. As a result. 17.structures have been shown to increase the range of eﬀective dimensionality. 3. These earlier designs could not take advantage of modern database systems’ query optimizer. 1] selects a set of indexes for use with a speciﬁc data set given a query workload. selected features may have no overlap with queried attributes. 36.

As a result the indexes suggested by the tool often do not capture the query performance that could be achieved for multidimensional queries. single and multi-level B+-trees are evaluated. These index structures can not achieve the same level of result pruning that can be oﬀered by an index technique that indexes multiple dimensions simultaneously (such as R-tree or grid ﬁle).indexes which would provide no useful beneﬁt. Remaining candidate indexes are evaluated in terms of estimated performance improvement and index cost. The process is iterated for increasingly wider multi-column indexes until a maximum index width threshold is reached or an iteration yields no improvement in performance over the last iteration. The SQL statements are classiﬁed by their execution plan and their weights are accumulated. In the case of SQL Server. 56 . The cost function computed by the query optimizer is used to calculate the beneﬁt of using the index. For DB2. MAESTRO (METU Automated indEx Selection Tool)[19] was developed on top of Oracle’s DBMS to assist the database administrator in designing a complete set of primary and secondary indexes by considering the index maintenance costs based on the valid SQL statements and their usage statistics automatically derived using SQL Trace Facility during a regular database session. Costs are estimated using the query optimizer which is limited to considering those physical designs oﬀered by the DBMS. IBM has developed the DB2Adviser [52] which recommends indexes with a method similar to AutoAdmin with the diﬀerence that only one call to the query optimizer is needed since the enumeration algorithm is inside the optimizer itself.

3 3. DS. the }. where d is the dimensionality of the dataset. and Q (the query set) is a set of subsets of DS. the domain is d .. Q).3. [18] proposes an agent-based database architecture to deal with automatic index creation. In our case. DS ⊆ D is a ﬁnite subset (the data set). do not reﬂect the pruning that could be achieved by indexing multiple dimensions together. where D is the domain. In [35] a cost model is used to identify beneﬁcial indexes and decide when to create or drop an index at runtime. ad )|ai ∈ set of range queries. 3.2. 3. Microsoft Research has proposed a physical design alerter [11] to identify when a modiﬁcation to the physical design could result in improved performance. 57 . we deﬁne a workload as follows: Deﬁnition 4 A workload W is a tuple W = (D. and the query set Q is a instance DS is a set of n tuples t = {(a1 .These commercial index selection tools are coupled to physical design options provided by their respective query optimizers and therefore.. More formally.1 Approach Problem Statement In this section we deﬁne the problem of index selection for a multi-dimensional space using a query workload. 18]. a2 .4 Automatic Index Selection The idea of having a database that can tune itself by automatically creating new indexes as the queries arrive has been proposed [35. .. A query workload W consists of a set of queries that select objects within a speciﬁed subspace in the data domain.

the cost of answering the query can be lower. a query workload W .e. attribute order has no impact on the amount of pruning possible. reduces to scanning all the points in the dataset and testing whether the query conditions are met. from which the queries 58 . In the case that an index is present. In the context of this problem an index is considered to be the set of attributes that can be used to prune subspace simultaneously with respect to each attribute. that provides the best estimated cost over W . The degree to which an index prunes the potential answer set for a query determines its eﬀectiveness for the query. an optional analysis time constraint ta . Therefore. when no index is present. Our problem can be deﬁned as ﬁnding a set of indexes I. we are aiming to recommend the set of indexes that yields the lowest estimated cost for every query in a workload for any query that can beneﬁt from an index. given a multi-dimensional dataset DS. either as the minimum number of indexes or a time constraint. an optional indexing constraint C. For static index selection.Finding the answers to a query. i. when no constraints are speciﬁed. we want to recommend a set of indexes within the constraint. the time to retrieve the data pages from disk. In this scenario we can deﬁne the cost of answering the query as the time it takes to scan the dataset. The assumption is that the time spend doing I/O dominates the time required to perform the simple bound comparisons. The overall goal of this work is to develop a ﬂexible index selection framework that can be tuned to achieve eﬀective static and online index selection for highdimensionsal data under diﬀerent analysis constraints. In the case when a constraint is speciﬁed. The index can identify a smaller set potential matching objects and only those data pages containing these objects need to be retrieved from disk.

we want an online index selection system that diﬀerentiates between low-cost index set changes and higher cost index set changes and also can make decisions about index set changes based on diﬀerent cost-beneﬁt thresholds. we need to be able to estimate the cost of executing the queries. For online index selection. we apply one abstraction to the query workload to convert each 59 . we need to automatically adjust the analysis parameters to increase the speed of analysis. Instead of using the query optimizer to estimate query cost. While maintaining the original query information for later use to determine estimated query cost. 3. we are striving to develop a system that can recommend an evolving set of indexes for incoming queries over time. such that the beneﬁt of index set changes outweighs the cost of making those changes. a cost model is embedded into the query optimizer to decide on the query plan. Therefore. When there is a time-constraint. The histogram captures data correlations between only those attributes that could be represented in a selected index.can beneﬁt the most. we conservatively estimate the number of matches associated with using a given index by using a multi-dimensional histogram abstract representation of the dataset. whether the query should be answered using a sequential scan or using an existing index. Typically. Increasing the size of the multi-dimensional histogram enhances the accuracy of the estimate at the cost of abstract representation size. The cost associated with an index is calculated based on the number of estimated matches derived from the histogram and the dimensionality of the index.2 Approach Overview In order to measure the beneﬁt of using a potential index over a set of queries. with and without the index.3.

Our technique diﬀers from existing tools in the method we use to determine the potential set of indexes to evaluate and in the quantization-based technique we use to estimate query costs. In the following two sections we present our proposed solution for index selection which is used for static index selection and as a building block for the online index selection. index selection considering a broader analysis space or frequent. The ﬂexibility aﬀorded by the abstract representation we use allows it to be used for infrequent. we aﬀect the speed of index selection and the ratio of queries that are covered by potential indexes.query into the set of attributes referenced in the query. We perform frequent itemset mining over this abstraction and only consider those sets of attributes that meet a certain support to be potential indexes. The Database Administrator (DBA) has to decide when to run this wizard and over which workload. The assumption is that the workload is going to remain static over time and in case it does change. Lowering the conﬁdence threshold improves analysis time by eliminating some lower dimensional indexes from consideration but can result in recommending indexes that cover a strict superset of the queried attributes. the DBA would collect the new workload and run the wizard again. By varying the support. All of the commercial index wizards work in design time. 60 . We further prune the analysis space using association rule mining by eliminating those subsets above a certain conﬁdence threshold. online index selection.

3. In the following subsections we describe these components and the dataﬂow between them.1: Index Selection Flowchart 3. We identify three major components in the index selection framework: the initialization of the abstract representations.1 shows a ﬂow diagram of the index selection framework. a Query Set Q. our framework produces a set of suggested indexes as an output. The description of the outputs and how they are generated is given below. • Potential Index Set P 61 . and the index selection loop.3 Proposed Solution for Index Selection The goal of the index selection is to minimize the cost of the queries in the workload given certain constraints. the indexing constraints. according to the support. Figure 3. and several analysis parameters. a dataset. and a Multi-dimensional Histogram H. and histogram size speciﬁed by the user. the query cost computation. Initialize Abstract Representations The initialization step uses a query workload and the dataset to produce a set of Potential Indexes P . Given a query workload. Table 3.1 provides a list of the notations used in the descriptions. conﬁdence.Figure 3.

a representation of the query workload. Considering the attributes involved in each query from the input query workload to be a single transaction. query ranges. the set of attribute sets under consideration as a suggested indexes Q Query Set. if the input support is 10%. for each query. and attributes 1 and 2 are queried together in greater than 10 percent of the queries. It consists of the attribute sets in P that intersect with the query attributes. the set of attribute sets currently selected as recommended indexes i attribute set currently under analysis support the minimum ratio of occurrence of an attribute in a query workload to be included in P conf idence the maximum ratio of occurrence of an attribute subset to the occurrence of a set before the subset is pruned from P histogram size the number of bits used to represent a quantized data object Table 3. support of a set of attributes A is deﬁned as: n SA = i=1 1 if A ⊆ Qi 0 otherwise n where Qi is the set of attributes in the ith query and n is the number of queries. P consists of the sets of attributes that occur together in a query at a ratio greater than the input support.1: Index Selection Notation List The potential index set P is a collection of attribute sets that could be beneﬁcial as an index for the queries in the input query workload. and estimated query costs H Multi-Dimensional Histogram. Formally. For instance. This set is computed using traditional data mining techniques. then a representation of the set 62 . used as a workload-optimized abstract representation of the data set S Suggested Indexes.Symbol P Description Potential Set of Indexes.

an index built over the subset will likely not provide much beneﬁt over the query workload if an index is built over the attributes in the set. is deﬁned as: 63 . the input conﬁdence is used to prune analysis space. we also maintain the association rules for disjoint subsets and compute the conﬁdence of these association rules.2} will be included as a potential index. Note that our particular system is built independently from a query optimizer. but the sets of attributes appearing in the predicates from a query optimizer log could just as easily be substituted for the query workload in this step. Note that because a subset of an attribute set that meets the support requirement will also necessarily meet the support. conﬁdence of an association rule {set of attributes A}→{set of attributes B }. While data mining the frequent attribute sets in the query workload in determining P . The conﬁdence of an association rule is deﬁned as the ratio that the antecedent (Left Hand Side of the rule) and consequent (Right Hand Side of the rule) appear together in a query given that the antecedent appears in the query. where A and B are disjoint. In order to enhance analysis speed with limited eﬀect on accuracy. Formally. Conﬁdence is the ratio of a set’s occurrence to the occurrence of a subset. all subsets of attribute sets meeting the support will also be included as a potential index (in the example above both the sets {1} and {2} will be included). As the input support is decreased. If a set occurs nearly as often as one of its subsets. Such an index will only be more eﬀective in pruning data space for those queries that involve only the subset’s attributes. the number of potential indexes increases.of attributes {1.

The conﬁdence can contain more than one value depending on set cardinality. another possible solution in the case where a large number of attributes are frequent together would be to partition a large closed frequent itemset into disjoint subsets for further examination. a more eﬃcient algorithm such as the FP-Tree [30] could be applied if the attribute sets associated with queries are too large for the apriori technique to be eﬃcient. we want to be more conservative with respect to pruning attribute subsets. • Query Set Q 64 .n CA→B = i=1 n 1 if (A ∪ B) ⊆ Qi 0 otherwise 1 if A ⊆ Qi 0 otherwise i=1 where Qi is the set of attributes in the ith query and n is the number of queries. While the apriori algorithm was appropriate for the relatively low attribute query sets in our domain.5. if every time attribute 1 appears. If we have set the conﬁdence input to 0. Since the cost of including extra attributes that are not useful for pruning increases with increased indexed dimensionality. While it is desirable to avoid examining a high-dimensional index set as a potential index.0.6. but we will keep attribute set {2}. attribute 2 also appears then the conﬁdence of {1}→{2} = 1. If attribute 2 appears without attribute 1 as many times as it appears with attribute 1. then the conﬁdence {2}→{1} = 0. Techniques such as CLOSET [44] could be used to arrive at the initial closed frequent itemsets. We can also set the conﬁdence level based on attribute set cardinality. then we will prune the attribute set {1} from P . In our example.

we deﬁne the ranges of each bucket so that each bucket contains roughly the same number of points. Data for an attribute that has been assigned b bits is divided into 2b buckets. we only aggregate the count of rows with each unique bucket representation because we are just interested in estimating 65 . The input histogram size dictates the number of bits used to represent each unique bucket in the histogram. The histogram is built by converting each record in the data set to its representation in bucket numbers. These are the indexes in the potential index set P that share at least one common attribute with the query. This representation is in the form of a multi-dimensional histogram H. These bits are designated to represent only the single attributes that met the input support in the input query workload. As we process data rows. This gives more resolution to those attributes that occur more frequently in the query workload. At the end of this step. There is no reason to sacriﬁce data representation resolution for attributes that will not be evaluated. In order to handle data sets with uneven data distribution. • Multi-Dimensional Histogram H An abstract representation of the data set is created in order to estimate the query cost associated with using each query’s possible indexes to answer that query. If a single attribute does not meet the support. The number of bits that each of the represented attributes gets is proportional to the log of that attribute’s support.The query set Q is the abstract representation of the query workload is initialized by associating the potential indexes that could be beneﬁcial for each query with that query. each query has an identiﬁed set of possible indexes for that query. then it can not be part of an attribute set appearing in P . A single bucket represents a unique bit representation across all the attributes represented in the histogram.

This histogram covers 3 attributes and uses 1 bit to quantize attributes 2 and 3.0.1. In this example. Note that the multi-dimensional histogram is based on a scalar quantizer designed on data and access patterns. and values from 6-10 quantize to 1.0. Note that we do not maintain any entries in the histogram for bit representations that have no occurrences. values 1 and 2 quantize to 00.0 1 01.1 1 Table 3. 5-7 quantize to 10.1 1 01.query cost. A1 2 4 1 6 3 2 5 8 3 9 Sample Dataset A2 A3 Encoding 5 5 0000 8 3 0110 4 3 0000 7 1 1010 2 2 0100 2 6 0001 6 5 1010 1 4 1100 8 7 0111 3 8 1101 Histogram Value Count 00. For attribute 1. attribute 1 has 2 bits assigned to it). For illustration.’s in the ’value’ column denote attribute boundaries (i. as opposed to just data in the traditional case.1.0. 3 and 4 quantize to 01. Table 3.0 1 11.0.0 1 01.1 1 10.0 2 11.0 2 00.2 shows a simple multi-dimensional histogram example. So we can not have more histogram entries than records and will not suﬀer from exponentially increasing the number of potential multi-dimensional histogram buckets for high-dimensional histograms.1. A higher accuracy in representation is achieved by using more bits to quantize the attributes that are more frequently queried. and 8 and 9 quantize to 11.0. and 2 bits to quantize attribute 1. for attributes 2 and 3 values from 1-5 quantize to 0.e.2: Histogram Example 66 . The . assuming it is queried more frequently than the other attributes.

which is the index type used for this work. For each query in the query set representation. At the end of this step. we then apply a cost function based on the number of matches we obtain using the index and the dimensionality of the index. For a given query-index pair. we also keep a current cost ﬁeld. the abstract representations of the query set Q and the multidimensional histogram H are used to estimate the cost of answering each query using all possible indexes for the query. To estimate the query cost. • Cost Function A cost function is used to estimate the cost associated with using a certain index for a query. our abstract query set representation has estimated costs for each index that could improve the query cost.Query Cost Calculation Once generated. the expected number of data page accesses is estimated in [22] by: d Ann. The cost function can be varied to accurately reﬂect a cost model for the database system. For example. one could apply a cost function that amortized the cost of loading an index over a certain number of queries or use a function tailored to the type of index that is used.mm. we aggregate the number of matches we ﬁnd in the multi-dimensional histogram looking only at the attributes in the query that also occur in the index (bits associated with other attributes are considered to be don’t cares in the query matching logic). For an R-Tree. which we initialize to the cost of performing the query using sequential scan. At this point.F BF = d 1 Cef f +1 67 . we also initialize an empty set of suggested indexes S. Many cost functions have been proposed over the years.

Each of the cost estimate formulas require a range radius.where d is the dimensionality of the dataset and Cef f is the number of data objects per disk page. The cost model for R-trees we use in this work is given by (d(d/2) ∗ m) where d is the dimensionality of the index and m is the number of matches returned for query matching attributes in the multi-dimensional histogram. we apply a cost estimate that is based on the actual matches that occur over the multi-dimension histogram over the attributes that form a potential index. N is the number of data objects. These cost estimates also assume that data distribution is independent between attributes. While these published cost estimates can be eﬀective to estimate the number of page accesses associated with using a multi-dimensional index structure under certain conditions. and that the data is uniformly distributed throughout the data space. However.em. d is the dataset dimensionality. In order to overcome these limitations. this formula assumes the number of points N approaches to inﬁnity and does not consider the eﬀects of high dimensionality or correlations. and Cef f is the capacity of a data page. A more recently proposed cost model is given in [8] where the expected number of pages accesses is determined as: d Ar. Using actual matches 68 . Therefore the formulas break down when assessing the cost of a query that is an exact match query in one or more of the query dimensions.ui (r) = 2r · d N 1 +1− Cef f Cef f where r is the radius of the range query. they have certain characteristics that make them less than ideal for the given situation.

The cost estimate provided is conservative in that it will provide a result that is at least as great as the actual number of matches in the database. the multi-dimensional subspace pruning that can be achieved using diﬀerent index possibilities is taken into account. such as B+-trees. and we assumed that we would retrieve data from disk for all query matches. It could also reﬂect the appropriate index structures used in the database system. There is additional cost associated with higher dimensionality indexes due to the greater number of overlaps of the hyperspaces within the index structure. The cost function could be more complicated in order to more accurately model query costs.eliminates the need for a range radius. A penalty is imposed on a potential index by the dimensionality term. By evaluating the number of matches over the set of attributes that match the query.e. Given equal ability to prune the space. for example by crediting complete attribute coverage for coverage queries. and additional cost of traversing the higher dimension structure. while the published models will produce results that are dependent only on the range radius for a given index structure). incorporates both data correlation between attributes and data distribution. we use a greedy algorithm that takes into account the indexes already selected to iteratively select indexes that would be 69 . It could model query cost with greater accuracy. Index Selection Loop After initializing the index selection data structures and updating estimated query costs for each potentially useful index for a query. It also ties the cost estimate to the actual data characteristics (i. a lower dimensional index will translate into a lower cost. We used this particular cost model because the index type was appropriate for our data and query sets.

we prune the complexity of the abstract representations in order to make the analysis more eﬃcient. If no potential index yields further improvement. The indexing constraint could be any constraint such as the number of indexes. The overall speed of this algorithm is coupled with the number of potential indexes analyzed. The improvement for a given query-index pair is the diﬀerence between the cost for using the index and the query’s current cost. total index size. the current query cost is replaced by the improved cost.appropriate for the given query workload and data set. a check is made to determine if the index selection loop should continue. For each index in the potential index set P . At the end of a loop iteration. so the analysis time can be reduced by increasing the support or decreasing the conﬁdence. This includes actions such as eliminating potential indexes that do not provide better cost estimates than the current cost for any query and pruning from consideration those queries whose best index is already a member of the set of suggested indexes. The set of suggested indexes S contains the results of the index selection algorithm. or total number of dimensions indexed. then the loop exits. The potential index i that yields the highest improvement over the query set Q is considered to be the best index. when possible. If the index does not provide any positive beneﬁt for the query. or the indexing constraints have been met. The input indexing constraints provides one of the loop stop criteria. Index i is removed from potential index set P and is added to suggested index set S. no improvement is accumulated. we traverse the queries in query set Q that could be improved by that index and accumulate the improvement associated with using that index for that query. 70 . After each i is selected. For the queries that beneﬁt from i.

we are able to maintain good performance as query patterns evolve.4 Proposed Solution for Online Index Selection The online index selection is motivated by the fact that query patterns can change over time. the input to the system is readjusted through a control feedback loop. we use control feedback to monitor the performance of the current set of indexes for incoming queries and determine when adjustments should be made to the index set. By monitoring the query workload and detecting when there is a change on the query pattern that generated the existing set of indexes. Our situation is analogous but more complex than the typical electrical circuit control feedback system in several ways: 1. The strategy provided assumes a indexing constraint based on the number of indexes and therefore uses the total beneﬁt derived from the index as the measure of index ’goodness’. The recommendation set can be pruned in order to avoid recommending an index that is non-useful in the context of the complete solution.3. such as an electrical signal. 3. However.Diﬀerent strategies can be used in selecting a best index. In a typical control feedback system. the output of a system is monitored and based on some function involving the input and output. Our system input is a set of indexes and a set of incoming queries rather than a simple input. Query performance is the obvious 71 . In our approach. then beneﬁt per index size unit may be a more appropriate measure. If the indexing constraint is based on total index size. this may result in recommending a lower-dimensioned index and later in the algorithm a higher-dimensioned index that always performs better. 2. The system output must be some parameter that we can measure and use to make decisions about changing the input.

However. For example. 3. If it responds too quickly.parameter to monitor. The other loop is for coarse control and is used to avoid very poor system performance by recommending major index set changes. If the system is too slow. then it fails to cause the output to change based on input changes in a timely manner. System output is the ratio of potential system performance to actual system performance in terms of database page accesses to answer the most recent queries. The control system can be too slow to respond to changes. Our system simulates and estimates costs for the execution of incoming queries.2 represents our implementation of dynamic index selection. because lower query performance could be related to other aspects rather than the index set. but there is no guarantee that those attributes will ever be queried again. or it can respond too quickly. 72 . We do not have a predictable function to relate system input and output because of the non-determinism associated with new incoming queries. inexpensive changes to the index set. Both situations are undesirable and should be designed out of the system. we may have a set of attributes that appears in queries frequently enough that our system indicates that it is beneﬁcial to create an index over those attributes. Our system input is a set of indexes and a set of incoming queries. We implement two control feedback loops. Figure 3. Each control feedback loop has decision logic associated with it. our decision making control function must necessarily be more complex than a basic control system. Control feedback systems can fail to be eﬀective with respect to response time. One is for ﬁne grain control and is used to recommend minor. then the output overshoots the target and oscillates around the desired output before reaching it.

when a new query q arrives. System The system simulates query execution over a number of incoming queries. For clarity. which is initialized to be the suggested indexes S from the output of the initial index selection algorithm. In this case.3. where w is an adjustable window size parameter.Figure 3. we determine which of the current indexes in I most eﬃciently answers 73 . a notation list for the online index selection is included as Table 3.2: Dynamic Index Analysis Framework System Input The system input is made up of new incoming queries and the current set of indexes I. W is used to estimate performance of a hypothetical set of indexes Inew against the current index set I. The abstract representation of the last w queries stored as W . This representation is similar to the one kept for query set Q in the static index selection.

and the identiﬁed best index iq from P or Pnew for the new incoming query: 74 . The system also keeps track of the current potential indexes P . System Output In order to monitor the performance of the system.Symbol I Inew w W q P Pnew iq Description Current set of attribute sets used as indexes Hypothetical set of attribute sets used as indexes window size abstract represention of the last w queries the current query under analysis Current Potential Indexes. and the current multi-dimensional histogram H. This information is used in the control feedback loop decision logic. Consider the possible new indexes Pnew to be the set of attribute sets that currently meet the input support and conﬁdence over the last w queries.3: Notation List for Online Index Selection this query and replace the oldest query in W with the abstract representation of q. the set of attribute sets in consideration to be indexes before q arrives New Potential Indexes. the set of attribute sets in consideration to be indexes after q arrives the attribute set estimated to be the best index for query q Table 3. we compare the query performance using the current set of indexes I to the performance using a hypothetical set of indexes Inew . The query performance using I is the summation of the costs of queries using the best index from I for the given query. We also incrementally compute the attribute sets that meet the input support and conﬁdence over the last w queries. The hypothetical cost is calculated diﬀerently based on the comparison of P and Pnew .

Fine Grain Control Loop The ﬁne grain control loop is used to recommend low cost. P = Pnew and i is in I. 75 . Hypothetical cost for other queries is the same as the real cost. 4. 2. The ratio of the hypothetical cost. We traverse the last w queries and determine those queries that could beneﬁt from using a new index from Pnew . This loop is entered in case 2 as described above when the ratio of hypothetical performance to actual performance is below some input minor change threshold. We compute the hypothetical cost of these queries to be the real number of matches from the database.1. 3. In this case we bypass the control loops since we could do no better for the system by changing possible indexes. In this case we bypass the control loops since we could do no better for the system by changing possible indexes. P = Pnew and i is not in I. We recompute a new set of suggested indexes Inew over the last w queries. Then the indexes are changed to Inew . minor changes to index set. and appropriate changes are made to update the system data structures. Increasing the input minor change threshold causes the frequency of minor changes to also increase. which indicates potential performance. P = Pnew and i is not in I. to the actual performance is used in the control loop decision logic. The hypothetical cost is the cost over the last w queries using Inew . P = Pnew and i is in I.

In our application. The system performance and rate of index change can be monitored and used in order to tune the control feedback system itself. and a new set of suggested indexes Inew is generated. This loop is entered in case 4 as described above when the ratio of hypothetical performance to actual performance is below some input major change threshold. this indicates a non-responsive system. Increasing the input major change threshold increases the frequency of major changes. Self Stabilization Control feedback systems can be either too slow or too responsive in reacting to input changes. we present two system enhancements that provide further robustness and scalability to our framework. where the system continously adds and drops indexes.Coarse Control Loop The coarse control loop is used to recommend more costly. When this condition is 76 . 3. a system that is slow to respond results in recommending useful indexes long after they ﬁrst could have a positive eﬀect. Appropriate changes are made to update the system data structures to the new situation. If the actual number of query results is much lower than the estimated cost using the recommended index set over a window of queries. It could also fail to recommend potentially useful indexes if thresholds are set so that the system is insensitive to change. but changes with greater impact on future performance to the index set. A system that is too responsive can result in system instability. Then the static index selection is performed over the last w queries. abstract representations are recomputed.3.5 System Enhancements In the following subsections.

detected. However. This approach would allow for the creation of indexes that maximize pruning over the global set of dimensions. In this case. This would maintain the abstract representation and collect global query patterns over the entire system. However. This increases the probability that Pnew will be diﬀerent from P . A solution at one extreme is to maintain a centralized index recommendation system. 77 . A single set of indexes would be recommended for the entire system. an oversensitive system or unstable query pattern is indicated. Recommended indexes would be determined based on global trends. it would not optimize query performance at the site level. Partitioning the Algorithm The system can also be applied to environments where the database is partitioned horizontally or vertically at diﬀerent locations. a diﬀerent abstract representation would be built based on the data at each speciﬁc site given requests to the data at that site. the system can be made more responsive by cutting the window size. This allows tight control of the indexes to reﬂect the data at each site. We can reduce the sensitivity by increasing the window size or increasing the support level during recomputation of new recommended indexes. This gives us a more complete P with respect to answering a greater portion of queries. and we can reperform to analysis applying a reduced support. index-based pruning based on attributes and data at multiple locations would not be possible. At the other extreme would be to perform the index recommendation at each site. If the frequency of index change is too high with little or no improvement in query performance. Indexes would be recommended based on the query traﬃc to the data at that site.

4 3.a set of 100.4. 3.This approach also requires traﬃc to each site that contains potentially matching data. A global set of indexes can be centrally recommended based on a global abstract representation of the data set with low support thresholds (the centralized location will store a larger set of globally good indexes). eliminating some traﬃc. A hybrid approach can provide the beneﬁts of index recommendation based on global data and query patterns while optimizing the index set at diﬀerent requesting locations.1 Empirical Analysis Experimental Setup Data Sets Several data sets were used during the performance of experiments.a set of 6500 records consisting of 360 dimensions of daily stock market prices. This data is extremely correlated. The variation in data sets is intended to show the applicability of our algorithm to a wide range of data sets and to measure the eﬀect that data correlation has on results.000 records consisting of 100 dimensions of uniformly distributed integers between 0 and 999. The data is not correlated. • stocks . 78 . Data sets used include: • random . The query patterns emanating from each site can be mined in order to ﬁnd which of those global indexes would be appropriate to maintain at the local site.

Analysis Parameters The eﬀect of varying several analysis input parameters including support. So. and has a variable query selectivity. Some dimensions are correlated with each other. Unless otherwise stated each query in experiments is a point query with respect to 79 . and the nth record was randomly selected from the data set. and online indexing control feedback decision thresholds was analyzed.0. while others are not at all correlated. if a historical query involved the 3rd and 5th attribute. the conf idence parameter for the experiments is 1.a set of 33619 records of major league pitching statistics from between the years of 1900 and 2004 consisting of 29 dimensions of data.• mlb . Therefore. This gives a query workload that reﬂects the attribute correlations within queries. a SELECT type query is generated from the data set where the 3rd attribute is equal to the value of the 3rd attribute of the nth record and the 5th attribute is equal to the value of 5th attribute of the nth record. query workload ﬁles were generated by merging synthetic query histories and query histories from real-world applications with different data sets. Query Workloads It is desirable to explore the general behavior of database interactions without inadvertently capturing the coupling associated with using a speciﬁc query history on the same database. multidimensional histogram size. Unless otherwise speciﬁed. Histories and data were merged by taking a random record from a data set and the numerical identiﬁer of the attributes involved in the synthetic or historical query in order to generate a point query.

3 shows how accurately the proposed static index selection technique can perform with relaxed analysis constraints.16. 3.3. synthetic . and the remaining 40% are between 1 to 5 attributes that could be any attribute.659 queries executed from a clinical application.9}. Attributes that are not covered by the query can be any value and still match the query. Over the last 300 queries.12. {8. is used.2 Experimental Results Index Selection with Relaxed Constraints Figure 3. 20% {5. the estimated cost of using recommended indexes.500 randomly generated queries. hr .7}.13. These query histories form the basis for generating the query workloads used in our experiments: 1. The query distribution ﬁle has 64 distinct attributes. a low support value.6. 20% {15.19}. 0. The distribution of the queries over the ﬁrst 200 queries is 20% involve attributes {1. Due to the size of this query set.4.the attributes covered by the query.2.4} together. and the true number of answers for 80 .01. and the remaining queries involve between 1 and 5 attributes that could be any attribute. some initial portion of the queries are used for some experiments. 20% {18.35. And 256 bits are used for the multi-dimension histogram. the distribution shifts to 20% covering attributes {11. Figure 3. 3. clinical . For this scenario.3 shows the cost of performing a sequential scan over 100 queries using the indicated data sets. 2. 20%. The query distribution ﬁle has 54 attributes. There are no constraints placed on the number of selected indexes.17}.14}.860 queries executed from a human resources application.

81 . The naive algorithm uses indexes for the ﬁrst n most frequent itemsets in the query patterns. Naive. Table 3.4 compares the proposed index selection algorithm with relaxed constraints against SQL Server’s index selection tool. Sequential Scan.3: Costs in Data Object Accesses for Ideal. Note that the cost for sequential scan is discounted to account for the fact that sequential access is signiﬁcantly cheaper than random access. regardless of any reasonable factor used.Figure 3. costs for indexes proposed by AutoAdmin and a naive algorithm are also provided. and the Proposed Index Selection Technique using Relaxed Constraints the set of queries. using the data sets and query workloads indicated. where n is the number of indexes suggested by the proposed algorithm. AutoAdmin [15]. While the actual ratio of a given environment is variable. A factor of 10 is applied as the ratio of random access cost to sequential access cost. AutoAdmin. For comparative purposes. the graph will show the cost of using our indexes much closer to the ideal number of page accesses than the cost of sequential scan.

However. both algorithms suggest indexes which will improve all of the queries. the most common query in the clinical workload is to query over both attributes 11 and 44 together. SQL Server quickly recommended no indexes.Data Set/ Workload stock/ clinical stock/ hr mlb/ clinical Analysis Tool Time(s) AutoAdmin 450 Proposed 110 AutoAdmin 338 Proposed 160 AutoAdmin 15 Proposed 522 % Queries Improved 100 100 100 100 0 87 Number of Indexes 23 18 20 16 0 16 Table 3.4: Comparison of proposed index selection algorithm with AutoAdmin in terms of analysis time and % queries improved For the stock dataset using both the clinical and hr workloads. SQL Server’s index recommendations are all single-dimension indexes for all attributes that appear in the workload. These are all the queries that have selectivities low enough that an index can be beneﬁcial. but ﬁnds indexes that improve 87 % of the queries. The proposed algorithm executes in less time and generates indexes which are more representative of the query patterns in the workload and allow for greater subspace pruning. the amount of the query improvement will be very similar using either recommended index set. For the mlb dataset. It should be noted that the indexes selected by AutoAdmin are appropriate based on the cost models for the underlying index structure used in 82 . Since the selectivity of these queries is low (the queries return a low number of matches using any index that contains a queried attribute). our ﬁrst index recommendation is a 2-dimension index built over attributes 11 and 44. Our index selection takes longer in this instance. For example.

As conf idence decreases.SQL Server.5 presents results on analysis complexity and expected query improvement as support and conf idence are varied. Baseline Online Indexing Results A number of experiments were performed to demonstrate the eﬀectiveness and characteristics of the adaptive system. an initial set of indexes is recommended based on the ﬁrst 100 queries given the stated analysis parameters. Since the underlying index type for SQL Server is B+-trees. The results are shown for the stock data set over the clinical query workload. the single dimension indexes that are recommended are the least costly possible indexes to use for the query set. For this example. Eﬀect of Support and Conﬁdence Table 3. Results show the total number of query-index pairs analyzed over the query set in the index selection loop and the total estimated improvement in query performance in terms of data objects accessed over the query set as the index selection parameters vary. which do not index multiple dimensions concurrently. a single dimensional index is not able to prune the subspace enough to make the application of that index worthwhile compared to sequential scan. In each of the experiments that show the performance of the adaptive system over a sequence of queries. 83 . For this particular experiment. This decreases analysis time but shows very little eﬀect on overall performance. we maintain fewer potential indexes in P and need to analyze fewer attribute sets per index. Using a conﬁdence level of 0% is equivalent to using the maximal itemsets of attributes that meet support criteria as recommended indexes. the strategy yields nearly identical estimated cost although only 34-44% of the query/index pairs need to be evaluated.

this workload drastically changes. an indexing constraint of 10 indexes.Support Conﬁdence 2 4 6 8 10 Query/Index 100 50 229 229 198 198 185 185 178 178 170 170 Pairs 0 90 85 73 66 58 Total 100 57710 55018 46924 42533 37527 Improvement 50 0 57710 57603 55018 55001 46924 46907 42533 42516 37527 37510 Table 3. Figure 3. For the online system.9 and a minor change threshold of 0.95. changes are made to index set. clinical workload New queries are evaluated. The estimated cost of the last 100 queries given the indexes available at the time of the query are accumulated and presented.4 through 3.5: Comparison of Analysis Complexity and Query Performance as support and conf idence vary. Figures 3. a major change threshold of 0. A number of data set and query history combinations are examined. 128 bits for each multi-dimension histogram entry. The performance gets much worse and does not get better for the static system. the performance degrades a little 84 . The baseline parameters used are 5% support. a window size of 100. This demonstrates the evolving suitability of the current index set to incoming queries. These index set changes are considered to take place before the next query. and depending on the parameters of the adaptive system. stock dataset. At query 200. and this is evident in the query cost for the static system.6 show a baseline comparison between using the query pattern change detection and modifying the indexes and making no change to the initial index set.4 shows the comparison for the random data set using the synthetic query workload.

The adaptive system is better able to absorb the change. synthetic Query Workload when the query patterns are changing and then improve again once the indexes have changed to match the new query patterns. Figure 3. This real query workload also changes substantially before reverting back to query patterns similar to those on which the static index selection was performed. 85 .5 shows a comparison of performance for the random data set using the clinical query workload. Performance for the static system degrades substantially when the patterns change until the point that they change back.4: Baseline Comparative Cost of Adaptive versus Static Indexing. The synthetic query workload was generated speciﬁcally to show the eﬀect of a changing query pattern. random Dataset.Figure 3.

Figure 3.5: Baseline Comparative Cost of Adaptive versus Static Indexing, random Dataset, clinical Query Workload

86

Figure 3.6: Baseline Comparative Cost of Adaptive versus Static Indexing, stock Dataset, hr Query Workload

Figure 3.6 shows the comparison for the stock data set using the ﬁrst 2000 queries of the hr query workload. The adaptive system shows consistently lower cost than the static system and in places signiﬁcantly lower cost for this real query workload. The eﬀect of range queries was also explored. The random data set was examined using the synthetic workload using ranges instead of points. As the ranges for a given attribute increased from a point value to 30% selectivity, the cost saved for top 3 proposed indexes (which were the index sets [1,2,3,4], [5,6,7], and [8,9] for all ranges), the overall cost saved decreased. This is due to the increase in the size of the answer 87

set. When the range for each attribute reached 30%, no index was recommended because the result set became large enough that sequential scan was a more eﬀective solution. Online Index Selection versus Parametric Changes Changing online index selection parameters changes the adaptive index selection in the following ways: Support - decreasing the support level has the eﬀect of increasing the potential number of times the index set will be recalculated. It also makes the recalculation of the recommended indexes themselves more costly and less appropriate for an online setting. Figure 3.7 shows the eﬀect of changing support on the online indexing results for the random data set and synthetic query workload, while Figure 3.8 shows the eﬀect on the stock data set using the ﬁrst 600 queries of hr query workload. These graphs were generated using the baseline parameters and varying support levels. As expected, lower support levels translate to lower initial costs because more of the initial query set are covered by some beneﬁcial index. For the synthetic query workload, when the patterns changed, but do not change again (e.g. queries 100-200 in Figure 3.7), lower support levels translated to better performance. However, for the hr query workload, the 10% support threshold yields better performance than the 6% and 8% thresholds for some query windows. The frequent itemsets change more frequently for lower support levels, and these changes dictate when decisions occur. These runs made decisions at points that turned out to be poor decisions, such as eliminating an index that ended up being valuable. Little improvement in cost performance is achieved between 4% support and 2% support. Over-sensitive control feedback can

88

It also has the eﬀect of increasing the number of times the costly reanalysis occurs. independent of the extra overhead that oversensitive control causes. In a real system. random Dataset. Figure 3. one would need to balance the cost of continued poor performance against the cost of making an index set change.increasing the thresholds decreases the response time of aﬀecting change when query pattern change does occur.Figure 3.9 shows the eﬀect of varying the major change threshold in the coarse control loop for the random 89 . Online Indexing Control Feedback Decision Thresholds . synthetic Query Workload degrade actual performance.7: Comparative Cost of Online Indexing as Support Changes.

stock Dataset.Figure 3.8: Comparative Cost of Online Indexing as Support Changes. hr Query Workload 90 .

10 shows the eﬀect of changing the major change threshold for the stock data set and hr query workload. These graphs were generated using the baseline parameters (except that the random graph uses a support of 10%). and a value of 1 means making a change whenever improvement is possible. Using a 32 bit size for each unique histogram bucket yields higher costs than by using greater histogram resolution. Figure 3. Here. Figure 3. As well.data set and synthetic query workload. This indicates that this value should be carefully tuned based on the expected frequency of query pattern changes. Experiments demonstrated that if the representation quality of the histogram was suﬃcient. Figure 3. The graphs show no real beneﬁt once the major change threshold increases (and therefore frequency of major changes) beyond 0. The major change threshold is varied between 0 and 1. a value of 0 translates to never making a change. and only varying the major change threshold in the coarse control loop.increasing the number of bits used to represent a bucket in the multi-dimensional histogram improves analysis accuracy at the cost of histogram generation time and space. but a value of 0.9 shows that a value of 1 or 0.11 shows an example of the eﬀect of varying the multi-dimensional histogram size.3 shows the best performance once the query patterns have stabilized.6.9 are best in terms of response time when a change occurs. 91 . some beneﬁcial index changes are not recommended because of these overly conservative cost estimates. One reason for this is that artiﬁcially higher costs are calculated because of the lower histogram resolution. very little beneﬁt was achieved through greater resolution. Multi − dimensional Histogram Size .

synthetic Query Workload 92 . random Dataset.Figure 3.9: Comparative Cost of Online Indexing as Major Change Threshold Changes.

hr Query Workload 93 . stock Dataset.10: Comparative Cost of Online Indexing as Major Change Threshold Changes.Figure 3.

Figure 3. random Dataset. synthetic Query Workload 94 .11: Comparative Cost of Online Indexing as Multi-Dimensional Histogram Size Changes.

By changing the threshold parameters in the control feedback loop. The technique uses a generated multi-dimension histogram to estimate cost. and as a result is not coupled to the idiosyncrasies of a query optimizer. A more constrained.5 Conclusions A ﬂexible technique for index selection is introduced that can be tuned to achieve diﬀerent levels of constraints and analysis complexity. the system can be tuned to favor analysis time or pattern change recognition. These experiments have shown great opportunity for improved performance using adaptive indexing over real query patterns. this can be addressed by adjusting control sensitivity or by changing control sensitivity over time as more knowledge is gathered about the query patterns. less complex analysis is more appropiate to adapt index selection to account for evolving query patterns. The foundation provided here will be used to explore this tradeoﬀ and to develop an improved utility for real-world applications. A control feedback technique is introduced for measuring performance and indicating when the database system could beneﬁt from an index change. The proposed technique aﬀords the opportunity to adjust indexes to new query patterns.3. Indexes are recommended in order to take advantage of multi-dimensional subspace pruning when it is beneﬁcial to do so. more complex analysis can lead to more accurate index selection over stable query patterns. However. 95 . which may not be able to take advantage of knowledge about correlations between attributes. A low constraint. A limitation of the proposed approach is that if index set changes are not responsive enough to query pattern changes then the control feedback may not aﬀect positive system changes.

From initial experimental results. or from remote storage to local storage depending 96 . the ﬁne grain control threshold was not triggered unless we contrived a situation such that it would be triggered. Index creation is quite time-consuming. and one potential index is now more valuable than a suggested index. This monitoring frequency could be in terms of either a number of incoming queries or elapsed time. It is not feasible to perform real-time analysis of incoming queries and generate new indexes when the patterns change. the frequency of index set change has been aﬀected through the use of the online indexing control feedback thresholds. Alternatively. In this thesis. it seems that the best application for this approach is to apply the more time consuming no-constraint analysis in order to determine an initial index set and then apply a lightweight and low control sensitivity analysis for the online query pattern change detection in order to avoid or make the user aware of situations where the index set is not at all eﬀective for the new incoming queries. the frequency of performance monitoring could be adjusted to achieve similar results and to appropriately tune the sensitivity of the system. This would change if the indexing constraints were very low. and when indicated by online analysis. This could mean moving an index from local storage to main memory. The proposed online change detection system utilized two control feedback loops in order to diﬀerentiate between inexpensive and more time consuming system changes. In practice. The kinds of low cost changes that this threshold would trigger are not the kinds of changes that make enough impact to be that much better than the existing index set. Potential indexes could be generated prior to receiving new queries. moved to active status.

the analysis could prompt a server to create a potential index as the analysis becomes aware that such an index is useful. 97 . Additionally. and once it is created. it could be called on by the local machine.on the size of the index.

However. Query processing becomes more complex as the number of potential ways in which a query can be executed increases. The multi-attribute access structure may not be eﬀective for general queries as a multi-attribute index can only prune results for those attributes that match selection criteria of the query. this added pruning power comes with a cost. 4. Informix [33]. They are successfully implemented in commercial Database Management Systems.CHAPTER 4 Multi-Attribute Bitmap Indexes The signiﬁcant problems we have cannot be solved at the same level of thinking with which we created them. Oracle [2]. An access structure built over multiple attributes aﬀords the opportunity to eliminate more of a potential result set within a single traversal of the structure than a single attribute structure. . e. This chapter presents multi-attribute bitmap indexes..Albert Einstein (attributed). IBM [46]. 98 .1 Introduction Bitmap indexes are widely used in data warehouses and scientiﬁc applications to eﬃciently query large-scale data repositories.g. It describes the changes required when modifying a traditionally single attribute access structure to handle multiple attributes in terms of enhancing the query processing model and adding cost-based index selection.

Bitmap indexes can provide very eﬃcient performance for point and range queries thanks to fast bit-wise operations supported by hardware. In their simplest and most popular form. General compression techniques are not desirable as they require bitmaps to be decompressed before executing the query. on the contrary. Each attribute is encoded independently. Queries are then executed by performing the appropriate bit operations over the bit vectors needed to answer the query. category. are constructed by generating a set of bit vectors where a ’1’ in the ith position of a bit vector indicates that the ith tuple maps to the attributevalue associated with that particular bit vector. Run length encoders. The two most popular are Byte-Aligned Bitmap Code (BBC) [2] and Word Aligned Hybrid (WAH) [57]. the mapping can be used to encode a speciﬁc value. A range query over a value-encoded bitmap can be answered by ORing together the bit vectors that correspond to the range for each attribute. allow the query execution over the compressed bitmaps. and have found many other applications such as visualization of scientiﬁc data [56]. or range of values for a given attribute. 99 .Sybase [20]. equality encoded bitmaps also known as projection indexes. The CPU time required to answer queries is related the total length of the compressed bitmaps used in the query computation. One disadvantage of bitmap indexes is the amount of space they require. The resultant bit vector has set bits for the tuples that fall within the range queried. Compression is used to reduce the size of the overall bitmap index. In general. and then ANDing together these intermediate results between attributes.

the bitvectors built over two or more attributes are sparser. the query would ﬁnd candidates for each attribute and then combine the candidate sets to obtain the ﬁnal answer. the number of bitwise operations is potentially reduced as more than one attribute are evaluated simultaneously. First. just as traditional multi-attribute indexes can be more eﬀective in pruning data search space and reducing query time. This cost savings comes from two sources. then the data space can be pruned simultaneously over both dimensions. multiattribute bitmap indexes could be applied to more eﬃciently answer queries. However. the AND operation is used to combine the bitmaps. the set intersection operation is used to combine the sets. 100 . if a 2-dimensional index such as an R-tree or Grid File exists over the two queried attributes. If single dimensional indexes. In the case of projection indexes. consider a 2-dimensional query over a large dataset. In the case of B+-Trees. Also a set intersection operation is not required. As an example. which allows for greater compression. In order to avoid sequential scan over the data we build a index access structure that can direct the search by pruning the space. Similarly. a multi-dimensional bitmap over a combination of attributes and values identiﬁes those records that match the query combination and avoids a further bit operation in order to answer the query criteria. However. The resulting smaller bitmaps makes the processing of the bitmap faster which translates into faster query execution time. such as B+-Trees or projection indexes. are available over the two attributes queried. This results in a smaller result set if the two dimensions indexed are not highly correlated. Second.These traditional bitmap indexes are analogous to single dimension indexes in traditional indexing domains.

4. • Given a selection query over a set of attributes and a set of single and multiple attribute bitmaps available for those attributes. The two most popular compression techniques for bitmaps are the Byte-aligned Bitmap Code (BBC) [2] and the Word-Aligned Hybrid (WAH) code 101 . the use of Multi-Attribute Bitmaps is proposed to improve the performance of queries in comparison with traditional single attribute indexes. The goals of this research are to: • Evaluate of query performance of multi-attribute bitmap enhanced data management system compared to the current bitmap index query resolution model.In this thesis. determine the bitmap query execution steps that results in the best query performance with minimal overhead.2 Background Bitmap Compression. The proposed approaches to attribute combination and query processing are based on measuring the expected query cost savings of options. Bitmap tables need to be compressed to be eﬀective on large databases. • Support ad hoc queries where information about expected queries is not available and support the experimental analysis provided herein. We propose an expected cost-improvement based technique to evaluate potential attribute set combinations. experiments show signiﬁcant opportunity for query execution performance improvement. Although the techniques incorporate heuristics in order to control analysis search space and query execution overhead.

The second group contains a multiple of 31 0’s and therefore is represented as a ﬁll word (a 1.128 bits 31-bit groups groups in hex WAH (hex) 1.20*0.1: A WAH bit vector.30*1 1. BBC stores the compressed data in Bytes while WAH stores it in Words. followed by the 0 ﬁll 102 . and the remaining (w − 2) bits store the ﬁll length.78*0. The last line shows the values of WAH words. If the word is a ﬁll. Under this assumption.4*1. The second line in Figure 4.1 shows a WAH bit vector representing 128 bits. [59]. Since the ﬁrst and third groups do not contain greater than a multiple of 31 of a single bit value. Let w denote the number of bits in a word. each literal word stores 31 bits from the bitmap. The most signiﬁcant bit indicates the type of word we are dealing with. these schemes mix run length encoding and direct storage. The fourth group is less than 31 bits and thus is stored as a literal. Figure 4. Unlike traditional run length encoding.21*1 001FFFFF 001FFFFF 9*1 000001FF 000001FF Figure 4.1 shows how the bitmap is divided into 31-bit groups. In this example. WAH is simpler because it only has two types of words: literal words and ﬁll words. This requirement is key to ensure that logical operations only access words. then the second most signiﬁcant bit is the ﬁll bit.20*0. and the third line shows the hexadecimal representation of the groups. and each ﬁll word represents a multiple of 31 bits.6*0 400003C0 400003C0 62*0 00000000 00000000 80000002 10*0. the lower (w − 1) bits of a literal word contain the bit values from the bitmap. we assume 32 bit words.4*1. WAH imposes the word-alignment requirement on the ﬁlls. they are represented as literal words (a 0 followed by the actual bit representation of the group).

When the next word is read from both bitmaps. two literal words and one ﬁll word. followed by the number of ﬁlls in the last 30 bits). In this paper we use WAH to compress the bitmaps due to the impact on query execution times. it stores the last few bits that could not be stored in a regular word. consider the two WAH compressed bitvectors in Figure 4. not shown. Using WAH compression. queries are executed by performing logical operations over the compressed bitmaps which result in another compressed bitmap.2. Bit operations over the compressed WAH bitmap ﬁle have been shown to be faster than BBC (2-20 times) [57] while BBC gives better compression ratio. Logical operations are performed over the compressed bitmaps resulting in another compressed bitmap. value. The ﬁll word 80000002 indicates a 0-ﬁll of two-word long (containing 62 consecutive zero bits). the ﬁrst word of each bitvector is decoded. and the ﬁll word from B2 is a one-ﬁll 103 . is needed to stores the number of useful bits in the active word. The ﬁll word from B1 is a zero-ﬁll word compressing 2 words. Relationship Between Bitmap Index Size and Bitmap Query Execution Time. The ﬁrst three words are regular words. the logical AND operation is performed over the two words and the resultant word is stored as the ﬁrst word in the result bitmap. we have two ﬁll words. First.2: Query Execution Example over WAH compressed bit vectors. As an example. Another word with the value nine. and it is determined that they are both literal words. The fourth word is the active word. Therefore.B1 B2 B1 AND B2 400003C0 00000100 00000100 80000002 C0000003 80000002 001FFFFF 00001F08 001FFFFF 000001FF 00000108 Figure 4.

word compressing 3 words. In this case we operate over 2 words simultaneously by ANDing the ﬁlls and appending a ﬁll word into the result bitmap that compresses two words (the minimum between the word counts). Since we still have one word left from B2, we only read a word from B1 this time, and AND it with the all 1 word from B2. Finally, the last two words are read from the two bitmaps and are ANDed together as they are literal words. A full description of the query execution process using WAH can be found in [59]. Query execution time is proportional to the size of the bitmap involved in answering the query [56]. To illustrate the eﬀect of bit density and consequently bitmap size on query times over compressed bitmaps we perform the following experiment. We generated 6 uniformly random bitmaps representing 2 million entries for a number of diﬀerent bit densities and ANDed the bitmaps together. Figure 4.3 shows average query time against the total bitmap length of the bitmaps used as operands in the series of operations. The expected size of a random bitmap with N bits compressed using WAH is given by the formula [59]: mR (d) ≈ N 1 − (1 − dB )2w−2 − d2w−2 w−1

where w is the word size, and d is the bit density. The lower the bit density, the more quickly and more eﬀectively intermediate results can be compressed as the number of AND operations increases. Figure 3 shows a clear relationship between query times and bitmap length. The diﬀerent slopes of the curve are due to changes in frequency of the type of bit operations performed over the word size chunks of the bitmap operands. At low bit densities, there is a greater portion of ﬁll word to ﬁll word operations. As the bit density increases, more ﬁll word to literal word operations occur. At high bit densities, literal word to literal 104

Figure 4.3: Query Execution Times vs Total Bitmap Length, ANDing 6 bitmaps, 2M entries

word operations become more frequent. As these inter-word operations have diﬀerent costs, the overall query time to bitmap length relationship is not linear. However it is clear that query execution time is related to overall bitmap length. Relationship Between Number of Bitmap Operations and Bitmap Query Execution Time. Figure 4.4 shows AND query times over WAH-compressed bitmaps as the number of bitmaps ANDed varies. Again the bitmaps are randomly generated with the reported bit densities for 2 million entries and the number of bitmaps involved in a AND query is varied. There is clear relationship between query times and the number of bitmap operations.

105

Figure 4.4: Query Execution Times vs Number of Attributes queried, Point Queries, 2M entries versus

Since query performance is related to the total bitmap length and number of bitmap operations, a logical question is ”can we modify the representation of the data in such a way that we can decrease these factors?” One could combine attributes and generate bitmaps over multiple attributes in order to accomplish this. This is because a combination over multiple attributes has a lower bit density and is potentially more compressible, and already has an operation that may be necessary for query execution built into its construction. However, several research questions arise in this scenario such as which attributes should be combined, how can the additional indexes be utilized, and what is the additional burden added by the richer index sets.

106

In the table. One could argue that multi-attribute bitmaps are reducing the dimensionality of the dataset but increasing the cardinality of the attributes involved in the new dataset. Such a representation can have the eﬀect of reducing the size of the bitmaps that are used to answer a query and can also reduce the number of operations. As attributes are combined. In such a test which matched this scenario. the advances in compression techniques and query processing over compressed bitmaps have made bitmaps an eﬀective solution for attributes with cardinalities on the order of several hundreds [58]. One could think of multi-attribute bitmaps as the AND bitmap of two or more bitmaps from diﬀerent attributes. Using WAH compression. potentially decreasing the size of a bitcolumn. there are very few instances where there are long enough runs of 0’s to achieve any compression (since out of 10 random tuples. bitcolumns become sparser. into one 2-attribute bitmap with cardinality 4. very 107 . the combined column labeled b1. Additionally. consider a bitmap index built over two attributes. As an illustrative example of enhancing compression from combining attributes. Although bitmap indexes have historically been discouraged for high-cardinality attributes.Technical Motivation for Multi-Attribute Bitmaps. each with cardinality 10. we would expect 1 set bit for a value). then it does not need to be evaluated as part of the query execution. where the data is uniformly randomly distributed independently for each attribute for 5000 tuples. and the opportunity for compressing that column increases.b2 means that the value of attribute 1 maps to b1 and the value for attribute 2 maps to b2. Figure 4. if a bit operation is already incorporated in the multi-attribute bitmap construction.5 shows a simple example combining two single attributes with cardinality 2.

Now consider a range query such that attribute 1 must be between values 1 and 2.5: Simple multiple attribute bitmap example showing the combination two single attribute bitmaps. and attribute 2 must be equal to 3. the query can be answered by (ATT1=1.b2 b2. little compression was achieved and bitmaps representing each of the 20 attributevalue pairs were between 640 and 648 bytes long. We can expect much greater compression for this situation. In the given scenario. Now the cardinality of the combined attributes is 100. The time required is dependent on the total length of the bitmaps involved.ATT2=3) OR (ATT1=2. Each of these bitmaps as well as the resultant intermediate bitmap from the OR operation is 648 bytes.b1 0 0 1 0 0 0 Combined b1. we can reduce 108 . These bitmaps are 268 and 260 bytes respectively. for a total of 528 bytes. For the multi-attribute case. the generated size of the 100 2-D bitmaps was between 220 and 400 bytes.b2 0 0 0 0 1 0 Figure 4. and the expected number of set bits in a run of 100 tuples for a single value is 1. for a total of 2592 bytes. In this scenario.ATT2=3). As an alternative. The operations using the traditional bitmap processing to answer the query is ((ATT1=1) OR (ATT1=2)) AND (ATT2=3). we can generate a bitmap index for the combination of the two attributes.b1 1 0 0 1 0 0 1 0 0 0 0 1 b2.Tuple t1 t2 t3 t4 t5 t6 Attribute 1 b1 b2 1 0 0 1 1 0 1 0 0 1 0 1 Attribute 2 b1 b2 0 1 1 0 1 0 0 1 0 1 1 0 b1.

if only some of the attributes were queried or a large range from one or more attributes were queried. In index selection. This example demonstrates the opportunity to improve query execution by representing multiple attribute values from separate attributes in a bitmap.both the size of the bitmaps involved in the query execution and also the number of bitmaps involved. The primary overall tasks can be summarized as multi-attribute bitmap index selection and query execution. In general. This is the reason why we propose to supplement the single attribute bitmaps with the multi-attribute bitmaps and not to replace the single attribute bitmaps. This can be performed with varying degrees of knowledge about a potential query set. then the single attribute bitmaps would perform better than the multi-attribute bitmaps. the combined bitmaps are generated. One could think of this approach as storing the OR of several bitmaps from the same attribute together. Once candidate attribute combinations are determined.3 Proposed Approach In this section we present our proposed solution for the problems presented in the previous section. However. point queries would always beneﬁt from the multi-attribute bitmaps as long as all the combined attributes are queried. Supplementing single attribute bitmaps with additional bitmaps that improve query execution time has also been done in [50]. 4. Multi-attribute bitmaps would only be used in the queries that result in an improved query execution time. where lower resolution bitmaps were used in addition to the single attribute bitmaps or high resolution bitmaps. we determine which attributes are appropriate or the most appropriate to combine together. 109 .

An eﬀective index selection technique for bitmap indexes should be cognizant of the relationships between attributes. the candidate pool of bitmaps are explored to determine a plan for query execution. The goal behind bitmap index selection is to ﬁnd sets of attributes that.1 Multi-attribute Bitmap Index Selection for Ad Hoc Queries In many cases the attributes that should be combined together as multi-attribute bitmaps are obvious given the context of the data and queries. we use 110 . we provide a cost improvement based candidate evaluation system. In order to support situations where no contextual information is available about queries and in order to allow for comparative performance evaluation. As well. 4.3. that combination of attributes is a candidate for multi-attribute indexing. can yield a large beneﬁt in terms of query execution time. As an example. consider a database system behind a car search web form. then a multi-attribute bitmap built over these two attributes can directly ﬁlter those data objects that meet both search criteria without further processing. Whenever multiple attributes are frequently queried together over a single bitmap value. but can eﬀectively prune the data space when combined together. if combined. The analogy within the traditional multi-attribute index selection domain would be ﬁnding attributes that are not individually eﬀective in pruning data space. the selectivity estimated for a given attribute combination should accurately reﬂect the true data characteristics. If the form has categorical entries for color and model. Since query execution time is dependent both on the total size of the operands involved in a bitmap computation and also the number of bitmap operations.In query execution.

detGoodness(asi .these measures to determine the potential beneﬁt for a given attribute combination.is a list of Attribute Sets and corresponding goodness measures selected . 0] |ai ∈ A} 2: currentDim = 1 3: while (currentDim < dLimit AND newSetsAdded) 4: unset newSetsAdded 5: for each asi ∈ attSets. |asi | = currentDim 6: for each attribute aj .card*aj . s. set newSetsAdded 11: currentDim + + 12: sort attSets by goodness 13: ca = ∅ 14: while ca = A 15: attribute set as = next set from attSets 16: if as ∩ ca = ∅ 17: ca = ca ∪ as.t. j = (maxAtt(asi )+1) to |A| 7: if (asi .are a set of recommended attribute sets to build indexes over 1: attSets = {[{ai }. aj )] 9: if goodness(asj ) > goodness(asi ) 10: add asj to attSets.attributes 18: include as in selected Algorithm 3: Selecting Attribute Combination Candidates. 111 . Algorithm 3 shows how candidate multi-attribute indexes are evaluated and selected in terms of potential beneﬁt. cardLimit) notes: attSets . dLimit. Determining Index Selection Candidates SelectCombinationCandidates (Ordered set of attributes A.card¡cardLimit) 8: asj = [asi ∪ {aj }.

If the number of bitmaps generated by this combination is excessive (line 7). The term newSetsAdded in line 3 is a boolean condition indicating if any attribute set was added during the last loop iteration. The algorithm evaluates increasingly larger attribute sets (loop starting at line 3). The maximum cardinality allowed for a given prospective combination can be controlled by the cardLimit parameter. Each unique combination is evaluated (lines 5-6). 112 . we can evaluate all unique set instances). The card value associated with an attribute set is the number of value combinations possible for the attribute set. then the combination set is added to the list of attribute sets for further evaluation during the next loop iteration. Initially all size 1 attribute sets are combined with each other size 1 attribute set such that each unique size 2 attribute set is evaluated. and reﬂects the number of bitmaps that would have to be generated to index based on the set. The parameter dLimit provides a maximum attribute set combination size. the sets in attSets are sorted by their goodness measures (line 12). These include an array.The algorithm is somewhat analogous to the Apriori method for frequent itemset mining in that a given candidate set is only evaluated if it’s subset is considered beneﬁcial. The algorithm takes several parameters as input. the set will not be evaluated. A goodness measure is evaluated for the proposed combination (line 8). In the next iteration of the outer loop. If the goodness of the combination exceeds the goodness of its ancestor. Once the loop terminates. of the attributes under analysis. A. each potentially beneﬁcial size 2 set is evaluated with relevant size 1 attributes (note that by evaluating only those singleton sets of attributes numbered larger than the largest in the current set. Then sets of attributes that are disjoint from the covered attributes.

probQT and probSR describe global query characteristics and are used to estimate diminishing beneﬁts as index dimensionality increases.matches 9: goodness∗ = (probQT ∗ probSR)|asi |+1 10: return goodness Algorithm 4: Determine goodness of including an attribute in a multi-attribute bitmap.creationLength − cb. The variables. Estimating Performance Gain for a Potential Index detGoodness (attribute set asi .length + bcj . a 2-attribute index is not as useful for a single attribute query. 113 . attribute aj ) set probQT and probSR goodness=0 for each bitcolumn bck in asi for each bitcolumn bcj in aj bitcolumn cb = AND(bck . 1: 2: 3: 4: 5: 6: The goodness of a particular attribute set is computed by Algorithm 4. bcj ) savings = bck .length+ bck .length 7: if savings > 0 8: goodness+ = savings ∗ cb. As with traditional multi-attribute index. The probSR parameter is the probability that the query range for an attribute will be small enough such that a multi-attribute index over the query criteria will still be beneﬁcial.ca. are greedily selected from the sorted sets and added to the recommended bitmap indexes. The probQT parameter represents the probability that an attribute is queried.

minus the compressed length of the combined bitmap. This is because this option will not be selected during the query execution for this combination and will not incur an additional penalty. the bitcolumn creation length is 0. For size 1 attribute sets. Both of these factors can provide a performance gain depending on the query criteria. Each combination is weighted by its frequency of occurrence. The savings of a particular combination is measured as the sum of the compressed lengths of the bitmap operands and the compressed length of the bitmaps needed to create the ﬁrst operand. The total savings reﬂects the diﬀerence in total bitmap operand length to answer the combination under analysis. This technique ﬁnds attributes for which their combinations will lead to a reduction in the number of operations as well as a reduction in the constituent bitmap lengths. The ﬁnal goodness measure for an attribute set is also modiﬁed based on the probability that beneﬁt will actually be realized for the query (line 9). Our approach takes into account the characteristics of the data by weighting and measuring the eﬀectiveness of speciﬁc inter-attribute combinations.Each unique value combination of the entire attribute set (combinations generated in lines 4-5) is evaluated in terms of bitmap length savings over a corresponding single attribute bitmap solution. However. This strategy assumes that the query space is similar in distribution to the data space. If the savings for a particular combination is negative. for a multiple attribute bitmap this value is the length of all the constituent bitmaps used to create it. The goodness of an attribute set is measured as the weighted average of the bitlength savings of each value combination (line 8). then its negative savings is not accumulated (line 7). The index selection solution space is controlled by limiting parameters and through 114 .

Once multi-attribute bitmaps are available in the system. 4.e. the potential execution plans must be evaluated and the option that is deemed better should be performed. Once bitmap combinations are recommended. if index space is critical. then the weighting in line 8 of the determine Goodness algorithm (Algorithm 4) can be removed. seemingly less than ideal attribute combinations can still provide performance gains). To be eﬀective. If the query is equally likely to occur in the data space. they can be determined eﬃciently. In other words. We take a number of measures in order to limit the 115 . In order to take advantage of the most useful representation. we can order the pairs in line 12 of the Select Combination Candidates algorithm (Algorithm 3) by goodness over combined attribute size.greedy selection of potential solutions. each combined bitmap can be generated by assigning an appropriate identiﬁer to the bitmap and ANDing together the applicable single attribute bitmaps. The behavior of the algorithm can be aﬀected with slight alteration. Although the indexes recommended are not guaranteed to be optimal. the complexity added by the additional representations can not be more costly than the beneﬁts attained by incorporating them. Later experiments show that the reduction in the number of total operations has a more substantial eﬀect on query execution time than reduction of bitmap length. (i.3. is that the query execution becomes more complex.2 Query Execution with Multi-Attribute Bitmaps One issue associated with using multiple representations for data indexes. For example. we need to decide on a query execution plan for each query. we need a mechanism to decide when single attribute and/or multi-attribute bitmaps would be used.

Finding relevant bitmaps is a matter of translating the query into its key to access the bitmaps. We explore potential query execution options using inexpensive operations. The key is made up of the attributes covered by the bitmap and the values for those attributes. Upon receiving a query.3” as its key. Bitmap Naming We need to be able to easily translate a query to the set of bitmaps that match the criteria of a query. If we have available a single attribute bitmap for the ﬁrst attribute. a multiattribute bitmap over attributes 9 and 10 where the value for attribute 9 is 5 and the value for attribute 10 is 3 would get ”[9. Those sets that have non-empty intersections with the query set are candidates for use in query execution. For example. we will include a set identiﬁed as [1] in this structure.overhead associated with selecting the best query execution option.10]5. In order to do this we devise a one-to-one mapping that directly translates query criteria to bitmap coverage. and the values are given in the same order as the attributes. The attributes are always stored in numerical order so that there is only one possible mapping for various set orders. 116 .2. Global Index Identiﬁcation In order to identify the potential index search space. we maintain a data structure of those attributes and combined attributes for which we have bitmap indexes constructed. we compare the set representation of the attributes in the query to the sets in the index identiﬁcation structure. We keep track of those attribute sets that have bitmap representations (either single attribute or multi-attribute). and 3 then we include the set [1.3] in the structure. We use a unique mapping for bitmap identiﬁcation.2. If we have available a multi-attribute bitmap built over attributes 1.

We order the 117 . Once candidate indexes are determined. Those indexes which cover an attribute involved in the query are candidates to be used as part of the solution for the query. We just need to sum the lengths of bitmaps that match the query criteria. The complete unique hash key is generated in the base case of the recursive method and the length of the matching bitmaps is obtained and returned to the calling caller. A recursive function is used to compute the total length of all matching bitmaps in order to handle arbitrary sized attribute sets. The logic applied to determine the query execution plan needs to be lightweight so as to not add too much overhead to handle the increased complexity of bitmap representation. and the callee length results are tabulated.2] is less than the total length for both attributes [1] and [2]. which we can compute using the recursive function. we compute the total length of the bitmaps required to answer the query.Query Execution During query execution we need a lightweight method to check which potential query execution will provide the fastest answer to the query. The ﬁrst step is to compare the set of attributes in the query to the sets of attributes for which we have constructed bitmap indexes. then the multi-attribute bitmap should be used to answer the query. we approximate this behavior using heuristics. then the single attribute bitmaps should be used. So rather than incorporating logic that guarantees an optimal minimal total bitmap length. For the recursive case. the unique key is built up and passed to the recursive call. if the multi-attribute bitmaps have greater total length. However. if the total length of matching bitmaps for the multi-attribute bitmap over attributes [1. For example.

For example. In lines 1-5. At the end of this section. and will be processed before any single attribute index. we will have a total 118 . the potential indexes are prioritized based on how promising they are with respect to saving total bitmap length during query execution.2] combined. Then we process the query by greedily selecting non-covered attribute sets from this order until the query is answered. and will not be processed before their constituent single attribute indexes. Algorithm 5 provides the algorithm for executing selection queries when the entire bitmap index has both single attribute and multi-attribute bitmap indexes available to process the query. Those multi-attribute indexes that have positive beneﬁt compared to single attribute indexes will have positive value. would actually slow down the query. As a baseline. all potential single attribute indexes get a value of 0 for the amount of length saved. It involves two major steps. The second step performs the bit operations required to answer the query.1 shows the query matching keys for each of the potential indexes. Multi-attribute indexes get a value which is the diﬀerence between the length of the single attribute indexes that would be required to answer the query and the length of the multi-attribute index to answer the query. we compute the total size of the bitmaps that match the query considering those attributes that intersect between the query and the index attribute set currently under analysis. consider a query over values 1 and 2 for attribute 1. and and over value 3 for attribute 2 where we have bitmaps available for the attribute sets [1]. [2]. Those that have negative values. In the ﬁrst step. Table 4. The algorithm assumes that the bitmap indexes available are identiﬁed and that we can access any bitcolumn through a unique hashkey.potential attribute sets in descending order based on the amount of length saved by using the index. and [1.

BitColumn Hashtable BI.saved -= as.length 12: as.get(as.get(as.attributes 17: as = next attribute set from AI 18: if as.keys[i])) 23: queryResult = AND(queryResult.provides access to bitcolumn using unique key B1 . subresult) 24: return queryResult Algorithm 5: Query Execution for Selection Queries when Combined Attribute Bitmaps are Available.contains identiﬁcation of sets of indexed attributes.saved = 0 9: else 10: for each size 1 attribute subset a in as 11: as.numAtts == 1 8: as.saved += a. 119 .length 13: sort AI by saved ﬁeld // Execute Query 14: attribute set ca = null 15: queryResult = B1 16: while ca = q.attributes ∩ q.attributes 20: subresult = BI.keys| 22: subresult = OR(subresult.a bitcolumn of all 1’s // Prioritize Query Plan 1: for each attribute set as in AI 2: if as.SelectionQueryExecution (Available Indexes AI.attributes ∩ ca = ∅ 19: ca = ca ∪ as.attributes = ∅ 3: generate query matching keys for as 4: attach query matching keys to as 5: sum query matching bitmap lengths 6: for each attribute set as in AI 7: if as. with length and saved ﬁelds and query matching keys list BI .BI. query q) notes: AI .keys[0]) 21: for i = 2 to |as.

In these cases. we use this initial information in order to compute the total bitmap length saved by using a multi-attribute bitmap in relation to using the relevant single attribute alternatives. The bit density of each matching bitmap is estimated based on its length.3 .query matching bitmap length for each available index. 2]1. For a multiple attribute index. Those multi-attribute indexes that reﬂect a cost savings will get a positive value 120 . 2]2.3 Table 4.9. 2] Matching Keys [1]1. This does not adversely aﬀect query operation since the multi-attribute option would not be selected anyway. All single attribute indexes are assigned a saved value of 0. Since the matching bitmaps will not overlap. Length of the resultant bitmap is then derived using eﬀective bit density. If the bit density is within the range of about 0. In lines 6-12.1 to 0. we sum up the query matching bitmap length associated with each single attribute subset of the multi-attribute attribute set. Available Index [1] [2] [1.5. Note that the computation of bitmap length in line 5 is not actually a simple sum. then its value can not be determined. The diﬀerence between this length and the total length of the query matching bitmaps for the multi-attribute set is the amount of bitmap length saved by using the multi-attribute set. we conservatively estimate its value as 0.1: Sample Query Matching Keys for Query Attributes 1 and 2. [1. these bit densities are summed to arrive at an eﬀective bit density of the resultant OR bitmap.[1]2 [2]3 [1.

Then we start processing the query in order of how promising the prospective indexes were an continue until all query attributes are covered (line 15). In [12]. we initialize a set that indicates the attributes covered by the query so far and initialize the query result to all 1’s. we propose the use of a padword to avoid accessing all bitmaps to insert a new tuple but only to update the bitmaps corresponding to the inserted values. We get the next best index. In lines 14 and 15. the number of bitmaps that need to be updated when a new tuple is appended in a table with A attributes is A as opposed to ΣA ci . 4. is suitable when the update rate is not very high and we need to maintain the index up-to-date in an online manner.3 Index Updates Bitmap indexes are typically used in write-once read-mostly databases such as data warehouses. If not. we sort the potential indexes in terms of this saved value. We update the covered attribute set and OR together all bitcolumns that match the queries relevant to the query and the index. Appends of new data are often done in batches and the whole index is rebuilt from scratch once the update is completed. and check if it’s attributes have already been covered (line 17. while those that are more costly will get a negative value. We AND together this result to build the subqueries into the overall query result.18). Bitmap updates are expensive as all columns in the bitmap index need to be accessed. In line 13. the query is complete and we return the result. This approach. In other words. we will process a subquery using the index.3. 121 . When we have covered all query attributes. which clearly outperforms the alternative.for saved. where ci is the number of bitmap values (or cardinality) i=1 for attribute Ai .

It has 12 attributes. A 2 bitmaps 4. The bitmap index selection.12 attributes. attributes 2. They are characterized as follows: • synthetic . bitmap creation and query execution algorithms were implemented in Java.5. • uniform .7. and 12 follow Zipf distribution with s parameter set to 2. Data Several data sets are used during experiments.4. each with cardinality 10. 4-6 have cardinality 5. 7-9 have cardinality 10. attributes 1-3 have cardinality 2. 1M records. and 10-12 have cardinality 25. If multi-attributes bitmaps are also padded we only need to access the bitmaps corresponding to the combined values (attributes) that are being inserted. 122 . Attributes 1.87GHz HP Pavilion PC with 2GB RAM and Windows Vista operating system.4 4.9. attributes 3. as only need to be accessed.uniformly distributed random data set. the cost of updating the multi-attribute bitmaps is half of the cost of updating the traditional bitmaps.8.4. consider the previous table with A attributes where multi-attribute bitmaps are built over pair of attributes.and 11 follow Zipf distribution with s parameter set to 1.Updating multi-attribute bitmaps is even more eﬃcient than updating traditional bitmaps.6. For example. 1M records.1 Experiments Experimental Setup Experiments were executing using a 1.10 follow uniform random distribution.

This data set is made up of more than 2 million records. so that it is guaranteed that there will be at least one match for each query. each cardinality 10. number of bitmaps involved in answering a 123 . The data set can be obtained from [47] • HEP . The cardinality per dimensional varies between 2 and 12.a bin representation of High-Energy Physics data. For those cases where range queries are used. The discretized dataset comes from the UCI Machine Learning Repository. where only single attribute bitmaps are available. The raw dataset is from the (U. This data set is not uniformly distributed. • Landsat . It has 12 attributes. The data set is 60 attributes. against our proposed approach in which both single and multi-attribute bitmaps are available. The cardinality of these attributes varies from 2 to 16. Queries Query sets were generated by selecting random points from the data and constructing point queries from those points. We present comparisons in terms of query execution time.S. It contains over 275000 records. There are a total of 122 bitcolumns for this data set. Department of Commerce) Census Bureau website.• census . Metrics Our goal is to compare performance of a system reﬂecting the current state of the art. the range is expanded around a selected point so that the range is guaranteed to contain at least one match.a discretized version of the USCensus1990raw data set. These points remain in the data set. We used the ﬁrst 1 million instances and 44 attributes.a bin representation of satellite imagery data.

no degradation at higher number of attributes). The proposed set of indexes is [7.5.7] is selected ﬁrst. and total bitmap size accessed to answer a query.e.query. and the values are independent.47ms. The time for using recommended index set includes an average overhead of 0. Attributes 10. we instead get 2 attribute recommendations.25ms to explore the richer set of options and select the best query execution plan. and [3. The other combinations have greater inter-attribute dependency because they follow Zipf distributions. Further performance enhancement would be possible during query execution if the required bitmaps need to be read from disk. After executing index selection with parameters probQT and probSR set to 0.4.4. Query execution assumes the bitmap indexes are in main memory. The attribute sets that were selected to combine together were [1. 124 . The average query execution of 100 point queries using only single attribute bitmaps was 41. This solution makes sense in the context of the performance metric of weighted beneﬁt.9]. We would expect more expected beneﬁt for combinations for which the number of value combinations approaches the cardinality limit.4. because a sparser bitmap can be more eﬀectively compressed. We set the cardLimit parameter to 100 and set the query probability parameters to 1 (i. [2.5.7]. and 12 do not combine with any others in this scenario because there are no free attributes that would be within cardinality constraints. 4.2 Experimental Results Index Selection We executed the index selection algorithm over the synthetic data set. Therefore set [1. because it’s value combination cardinality is 100.39ms using the recommended indexes.8].8]. We would also expect more expected beneﬁt from combining attributes that are independent.6. while it was 17. 11.

[4. the ‘Only M-Dim’ label refers to the 125 . The queries are detailed in Table 4. A range of 1.6]. Each of these has a value combination cardinality greater than or equal to 50 and selection order is dependent on cardinality and then inter-attribute independence. [1. means that the query for that attribute is a point query. [2. Query Execution Controlled Query Types. Any range bigger than ﬁve could take the complement of the non-queried range to answer the query. for example.11]. and Q3 selects one value from the ﬁrst attribute and ﬁve values from the second attribute. We ran 6 query types over bitmaps built over 2 attributes with uniform random distribution and cardinality 10.2: Query Deﬁnitions for 2 Attributes with cardinality 10 and uniform random distribution.2 and varied in the length of the range. [5.9].12].8 show the query execution time. Figures ?? and 4. 5 is the worst case scenario. number of bitmaps involved in answering each query type. and [3. Q1.10]. In these ﬁgures. Note that in this case. respectively. The ‘Only 1-Dim’ label refers to the case where only Single Attribute Bitmaps are available. is a point query on both attributes. and total size of the bitmaps involved in answering the query.Range Length Query A1 A2 Q1 1 1 Q2 1 2 Q3 1 5 Q4 2 5 Q5 2 5 Q6 5 5 Table 4.

case where the queries are forced to use Multi-Attribute Bitmaps, and the ‘Proposed’ label refers to the case where the query execution plan is selected considering both indexes available. For queries Q1-Q4, the multi-attribute bitmaps are always selected, and they are signiﬁcantly faster than the single-attribute options. For queries Q5-Q6, if single and multi-attribute bitmaps are available, single attribute is always selected. However, the best is always slightly faster than the proposed technique due to the overhead of evaluating and selecting the best option. For these queries, the overhead was on the order of 0.2ms. The proposed technique usually selects the option that involves the shortest original bitmaps required to answer the query. The exception here is Q5. The single dimension index is used because it involves fewer bitmap operations and involves less total bitmap operand length over the set of operations that need to take place.

Figure 4.6: Query Execution Time for each query type.

126

Figure 4.7: Number of bitmaps involved in answering each query type.

Query Execution for Point Queries. Figure 4.9 demonstrates the potential query speedup associated with our proposed query processing technique when multiattribute bitmaps are available to answer queries. The single-D bar shows the average query execution time required to answer 100 point queries using traditional bitmap index query execution, where points are randomly selected from the data set (there is at least one query match). For the multi-attribute test, indexes were selected and generated with a maximum set size of 2. So a suite of bitcolumns representing 2attribute combinations is available for query execution as well the single attribute bitmaps. Each single attribute is represented in exactly one attribute pair. The results show signiﬁcant speedup in query execution for point queries. Point queries over the uniform, landsat, and hep datasets experience speedups of factors of 5.01, 3.23, and 3.17 respectively.

127

Figure 4.8: Total size of the bitmaps involved in answering each query type.

Similar results were achieved for the census data set. Indexes were selected without a dimensionality limit and with a cardinality limit of 100. The query execution time using only single attribute bitmaps was 83.01 ms while the query execution time was only 24.92 ms when both, single attribute and multi-attribute bitmaps were available. Query Execution Over Range Queries. Figure 4.10 shows the impact of using the single-attribute bitmaps instead of the proposed model as the queries trend from being pure point queries to gradually more range-like queries. Average query execution times over 100 queries are shown. The queries are constructed by making the matching range of an attribute either a single value, or half of the attribute cardinality. Whether an attribute is a point or a range is randomly determined. The x-axis shows the probability that a given pair of attributes in the query represents a range (e.g .1 means that there is 90% probability that 2 query attributes are both over single values). The graph shows signiﬁcant speedup for point queries. Although the 128

Even though the multi-attribute 129 . The upper portion of the bar shows the size of the suite of 2-attribute bitmaps (each attribute represented exactly once in the suite). This is because there are still some cases where using the multiattribute bitmap is useful.9: Query Execution Times. the multi-attribute alternative maintains a performance advantage until nearly all the query attributes are over ranges. Traditional versus MultiAttribute Query Processing speedup ratio decreases as the queries become more range-like.11 shows bitmap index sizes for each data set. The entire bar represents the space required for both the single and 2-attribute bitmaps. Point Queries. The lower portion of the bar shows the total size of the bitmaps for all of the single attribute indexes.Figure 4. Index Size Figure 4.

the single dimension landsat are covered by 600 bitcolumns. while its 2-attribute counterpart is represented by 3000 bitmaps (30 attribute pairs.10: Query Execution Times as Query Ranges Vary. 100 bitmaps per pair)). the compression associated with the combined bitmaps means that the space overhead for the multi-attribute representations is not as severe as may be expected.g. Landsat Data bitmaps represent many more columns than the single attribute bitmaps(e.Figure 4. 130 .

it is valuable to know the relative impacts of each one. The goodness using best greedily selects non-intersecting 131 . For each listed strategy. The other 2 strategies use Algorithm 3 to generate goodness factors for each combination based on bitmap length saved and weighted by result density. a suite of size 2 attribute sets is determined diﬀerently. Randomly selected attributes simply combines random pairs of attributes. reduction in bitmap operations and reduction in bitmap length. To this end.11: Single and Multi-Attribute Bitmap Index Size Comparison Bitmap Operations versus Bitmap Lengths Since query execution enhancement is derived from two sources.Figure 4.3 shows the query speedup for point queries using diﬀerent strategies to determine the attributes that should be combined together. we tested query execution after performing index selection using diﬀerent strategies. The speedup is measured over 100 random point queries taken from HEP data. Table 4.

5 Conclusion In this paper. The query execution takes advantage of the multi-attribute bitmaps when it is beneﬁcial to do so while introducing very little overhead for cases when such bitmaps are not useful. We introduce a query execution model when both traditional and multi-attribute bitmaps are available to answer a query. HEP Data.67 Beneﬁcial Combinations Table 4. We introduce an index selection technique in order to determine which attributes are appropriate to combine as multi-attribute bitmaps.15 Combinations 2 Randomly Selected At2. It takes into account data distribution and attribute correlations. It can also be tuned to avoid undesirable conditions such as an excessive number of bitmaps or excessive bitmap dimensionality. The results show that signiﬁcant speedup can be achieved simply by combining reasonable attributes together.Attribute Combination Strategy Query Speedup 1 Algorithm 1. Utilizing our index selection 132 . Using Diﬀerent Attribute Combination Strategies pairs starting from the top of the list.79 tributes 3 Algorithm 1 Selecting Worst 2.3: Query Speedup. The technique is driven by a performance metric that is tied to actual query execution time. Selecting Best 3. we propose the use of multi-attribute bitmap indexes to speed up the query execution time of point and range queries over traditional bitmap indexes. while the other technique greedily selects non-overlapping pairs from the bottom. 4. and that query execution can be further enhanced by careful selection of the indexes.

we were able to consistently achieve up to 5 times query execution speedup for point queries while introducing less than 1% overhead for those queries where multi-attribute indexing was not useful. 133 .and query execution techniques.

query patterns over time. a method to determine when the current index set is no longer eﬀective is also provided. In order to maximize query performance. The optimization query framework presented herein describes an I/O-optimal query processing technique to process any query within the broad class of queries that can be stated as a convex optimization query. For such applications. and the distribution and characteristics of the data itself. the query processing follows a access structure traversal path that matches the function being optimized while respecting additional problem constraints. the possible query execution plans evaluated by a data management system are limited by available indexes. While resolving queries. the query processing and access structures should be tailored to match the characteristics of the application. It considers both the query patterns and the data distribution to determine the beneﬁt of diﬀerent alternatives.CHAPTER 5 Conclusions Data retrieval for a given application is characterized by the query types. The online high-dimensional index recommendation system provides a cost-based technique to identify access structures that will be beneﬁcial for query processing over a query workload. Since workloads can shift over time. 134 .

Query processing selects a query execution plan to reduce the overall number of operations. Both the run-time query processing and design-time access methods are targeted as potential sources of query time improvement. improved query execution time is achieved. By ensuring that the run-time traversal and processing of available access structures matches the characteristic of a given query and that the access structures available match the overall query workload and data patterns. The index selection task ﬁnds attribute combinations that lead to the greatest expected query execution improvement. The primary end goal of the research presented in this thesis is to improve query execution time.The multi-attribute bitmap indexes presents both query processing and index selection for the traditionally single attribute bitmap index. 135 .

S. USA. 2001. Adaptive Control Processes: A Guided Tour. Keim. Cambridge University Press. Chaudhuri. Optimal selection of secondary indexes. S. Berchtold. 1997. 33(3):322–373. pages 1110–1121. C. Berchtold. 1997. Narasayya. In VLDB. B¨hm. and D. Sprugnoli. pages 28–39. AI. A cost model for query processing in high dimensional data spaces. [10] S. Agrawal. Langley. Surv. A. Keim. Byte-aligned bitmap compression. Syamala. Database tuning advisor for microsoft sql server 2005. R. V.. Kriegel. [11] N.BIBLIOGRAPHY [1] S. Bohm. Nashua. D. In Proceedings of the International Conference on Very Large Data Bases. [2] G. and H. 2000. Bellman. and a M. Bohm. In Data Compression Conference. 1961. Kriegel. New York. Selection of relevant features and examples in machine learning. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Trans. [3] E. Koll´r. NY. 1995.. Bruno and S. Marathe. 1996. IEEE Transactions on Software Engineering. 2006. L. The x-tree: An index structure for highdimensional data. Berchtold. Convex Optimization. 136 . NH. Chaudhuri. R. pages 78–86. 2004. [4] R. [7] A.-P. Blum and P. [8] C. Boyd and L. 1990. A. PODS. Princeton University Press. To tune or not to tune? a lightweight physical design alerter. Pinzani. Database Syst. A cost model for nearest o neighbor search in high-dimensional data space. 2004. 25(2):129–178. D. A. Vandenberghe. Barucci. [9] C. and H. [6] S. pages 499–510. [5] S. Oracle Corp. P. and R. Antoshenkov. ACM Comput. In VLDB. Keim.

Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences. Information Week. Austria. Y. A. Data and Knowledge Engineering. 2000. [18] R. Chaudhuri and V. [15] S. Blanken. Exact and approximate algorithms for the index selection problem in physical database design. [14] Y. [19] A. ACM SIGMOD. In SSDBM. In The VLDB Journal. Sellis. L. pages 146–155. NY. E. An eﬃcient cost-driven index selection tool for microsoft SQL server. J. Narasayya. Fagin. [21] R. Ferhatosmanoglu. 2003. Lifschitz. and H. and N. [16] S. Ikinci. [23] H. In CIKM ’00. and M. C. 1992. In International Conference on Extending Database Technologhy (EDBT). L.-L. Narasayya. Analysis of object oriented spatial access methods. Smith. C. Bergman.[12] G. H. 2002. 1997. Navathe. Adaptive and automated index selection in rdbms. USA. The onion technique: indexing for linear optimization queries. New York. 1994. and D. In SIGMOD ’87: Proceedings of the 1987 ACM SIGMOD international conference on Management of data. and T. Update conscious bitmap indices. and A. Maio. Faloutsos. Edelstein.-C. Montevideo. Chaudhuri and V. Agrawal. M. 1998. Gibas. Vector approximation based indexing for non-uniform high dimensional data sets. R. Lotem. E. Tuncel. V. D. Costa and S. Fischetti. Index self-tuning with agent-based databases. 137 . 1993. and A. Erisik. Canahuate. TUBITAK Software Research and Development Center. Li. 1995. Ferhatosmanoglu. A. Castelli. pages 426–439. Chang. E. M. Dogac. T. On the selection of secondary indexes in relational databases. Lo. [20] H. Frank. An automated index selection tool for oracle7: Maestro 7. D. [13] A. December 1995. In Proceedings ACM SIG-MOD Conference. Uruguay. Autoadmin ’what-if’ index analysis utility. Omiecinski. Chang. Faster data warehouses. Abbadi. [24] M. In XXVIII Latin-American Conference on Informatics (CLIE). Roussopoulos. M. ACM Press. Naor. Technical Report LBNL/PUB-3161.-S. and R. Vienna. IEEE Transactions on Knowledge and Data Engineering. and S. [17] S. Capara. 1987. [22] C. Choenni. 2007. pages 367–378.

1983. 30(2):170–231. Gaede and O. Elissef. In SIGMOD Conference. Yin. L. 2000 ACM SIGMOD Intl. 2004. [26] A. Koudas. C. R. Sootland. 1995. Mining frequent patterns without candidate generation. UK. Naughton. John. Gionis. P. Kluwer (Nonconvex Optimization and its Applications series). Edinburgh. Guyon and A. Schallehn. Motwani. [33] I. Chen. Kai-Uwe. pages 83–95. On the selection of an optimal set of indexes. Inﬂuence sets based on reverse nearest neighbor queries. Hristidis. Wrappers for feature subset selection. Muthukrishnan.. Conf. Stanford University. 05 2000. Kohavi and G. and P. Samet. Portugal. and R. Journal of Machine Learning Research. [37] F. Indyk. An introduction to variable and feature selection. 2000. pages 1–12. http://www. and V. N. J. pages 518–529. The gc-tree: a high-dimensional index structure for similarity search in image databases. J. ACM Comput. IEEE Transactions on MultiMedia. A. Papakonstantinou. [38] M. AI. Informix decision support indexing for the enterprise data warehouse. Disciplined convex programming. Dordrecht. Multidimensional access methods. Guang-Ho Cha. Raghavan.[25] V. In Proceedings of the Int. PREFER: A system for the eﬃcient execution of multi-parametric ranked queries. [36] R.htm. September 1999. IEEE Transactions on Software Engineering.-W. and Y. 2005. In W. Grant. on Very Large Data Bases. Hjaltason and H. editors.zines/whiteidx. [34] M. Boyd.com/informix/corpinfo/. and Y. In International Database Engineering & Applications Symposium. Autonomous query-driven index tuning. [28] C. 2001. Coimbra. Saxton. Similarity searching in high dimensions via hashing. [29] I. Ye. Bernstein. Pei. 1998. [32] V. [30] J. Korn and S. and I. [27] M. and Y. Ranking in spatial databases. Robust and Convex Optimization with Applications in Finance. Lobo. E. [35] S. Geist. Gunther. In press.informix. 138 . Surv. The Department of Electrical Engineering. Conference on Management of Data. June 2002. 1997. [31] G. 2003. March 2000. S. In SIGMOD. 4(2):235–247. In Symposium on Large Spatial Databases. PhD thesis. Inc. Ip. Han. ACM Press. S.

DOE Oﬃce of Advanced Scientiﬁc Computing Research. Nesterov and A. Mount. Bernardo. 1999. Rotem. In Proceedings of the 19th IEEE Symposium on Mass Storage Systems and Tenth Goddard Conf. set. 2002. Shoshani. [46] M. and A. Airphoto dataset. [44] J. CLOSET: An eﬃcient algorithm for mining frequent closed itemsets. 139 . [47] U. P. Database Syst. Indexing and selection of data items in huge data sets by constructing and accessing tag collections. May 2000. and R. USSR Acad. Manjunath and W. D. Roussopoulos. Schenkel..ucsb. Mao. Texture features for browsing and retrieval of image data. L. J. Han. http://vivaldi. P. S. volume 2. Ma. The us census1990 data http://kdd. Centr. 1988. [42] Y. Y. Ponce. 1999. Philadelphia. Winslett. Pei. Sinha and M. 32(3):16.edu/databases/census1990/USCensus1990-desc. August 1996. Nemirovsky.[39] B. Manjunath. Nordberg. Econ. [50] R. Vincent. Inst. USSR. G. In Indexing Goes a New Direction. 1994. Repository. USA. Weikum.. Nesterov and A. Moscow. Nemirovsky.. M. page 70. Top-k query evaluation with probabilistic guarantees. and R. [40] B. Nearest neighbor queries. S. Theobald. R. and F. PA 19101. L. Sci. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.ics. ACM Trans. Vila. SIAM.uci. M. A general approach to polynomial-time algorithms design for convex programming. Hersch.html. Kelly. Multi-resolution bitmap indexes for scientiﬁc data. [49] A. In SIGMOD. and R. Ramakrishna.htm. VLDB. [43] Y. [48] N.edu/Manjunath/research. 1995. [41] R. 2000. & Math. 2007. pages 21–30. IEEE Transactions on Pattern Analysis and Machine Intelligence. H. S. In Proceedings of the 11th International Conference on Scientiﬁc and Statistical Data(SSDBM).. [51] M. on Mass Storage Systems and Technologies. Sim. The Oﬃce of Science Data-Management Challenge. 18(8):837–842. Interior-Point Polynomial Algorithms in Convex Programming. Multidimensional indexing and query coordination for tertiary storage management. March–May 2004. 2004.ece. [45] S. Technical report.

Zilio. Lang. In Proceedings International Conference Data Engineering. Index selection in relational databases.-J. and A.-J. and A. Chang. Otoo. In SSDBM. Lohman. Valentin. [54] R. In Proceedings of SSDBM. Shoshani. [53] R. Chen. 2003. In Proceedings of the 24th International Conference on Very Large Databases. Shoshani. Paper LBNL54673. Whang. Weber.. [55] K. and A. Weber. C.[52] G. H. Db2 advisor: An optimizer smart enough to recommend its own indexes. Zuliani. Schek. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Kyoto. Hwang. In VLDB. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Wu. W.-C. and S. 2002. and Y. Using bitmap index for interactive exploration of large datasets. E. [58] K. Optimizing bitmap indexes with eﬃcient compression. 2006. Blott. On the performance of bitmap indices for high cardinality attributes. Koegler. 140 . M. ACM Transactions on Database Systems. 1985. [59] K. Compressing bitmap indexes for faster search operations. 1998. 2006. H. 1998. In SIGMOD. Wu. D. Otoo. and A. Japan. S. Otoo. Chang. Wang. M. Wu. [60] Z. Skelley. Wu. March 2004. Shoshani. K. E. Schek. Lawrence Berkeley National Laboratory. [56] K. [57] K. pages 194–205. Zhang. J. In International Conference on Foundations on Data Organization (FODO). and S. C. Boolean + ranking: Querying a database by k-constrained optimization. and A. Shoshani. 2000. J. G. Blott. E.

Sign up to vote on this title

UsefulNot useful- Query Optimizing
- Distributed Cost Model
- DBMS_Lecture9
- ITERATIVE DYNAMIC PROGRAMMING SOLUTION FOR MULTIPLE QUERY OPTIMIZATIONS IN HOMOGENEOUS DISTRIBUTED DATABASES
- 10.1.1.31.7627
- Distributed Database Systems Chap 9
- Manual de MPL
- Local Search
- MOEAD
- auto_vehicle_guidance.pdf
- A Comparison of Constraint-Handling Methods for the Application of Particle (1)
- Constraint Handling
- Problem Solving Approaches in Computer Science
- Recitation 13 Slides -DP
- A Benchmark Study of Multi-Objective Optimization Methods
- 2004b-Lin and Miller- Tabu Search Algorithm for Chemical Process, 2287-2306
- The Simplex Algorithm
- 17
- sem6
- sdarticle 111.pdf
- 09 Chapter 4
- c 2014001
- 947961
- 04739671
- pbuc 12
- ajassp.2011.107.112[1]
- A review of recent advances in global optimization.pdf
- Benchmarking of Heuristic Optimization Methods
- Geometrical Problems, Optimization and Decision Making
- L22
- Gibas%20Michael%20A

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd