This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

### Publishers

Scribd Selects Books

Hand-picked favorites from

our editors

our editors

Scribd Selects Audiobooks

Hand-picked favorites from

our editors

our editors

Scribd Selects Comics

Hand-picked favorites from

our editors

our editors

Scribd Selects Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

P. 1

Gibas%20Michael%20A|Views: 41|Likes: 0

Published by Vikram Cheri

See more

See less

https://www.scribd.com/doc/50527613/Gibas-20Michael-20A

03/11/2011

text

original

- 1.1 Overall Technical Motivation
- 1.2 Novel Database Management Applications
- Modeling and Processing Optimization Queries
- 2.1 Introduction
- 2.2 Related Work
- 2.3 Query Model Overview
- 2.3.1 Background Deﬁnitions
- 2.3.2 Common Query Types Cast into Optimization Query
- 2.3.3 Optimization Query Applications
- 2.3.4 Approach Overview
- 2.4 Query Processing Framework
- 2.4.1 Query Processing over Hierarchical Access Structures
- 2.4.2 Query Processing over Non-hierarchical Access Struc-
- 2.4.3 Non-Covered Access Structures
- 2.5 I/O Optimality
- contour
- 2.6 Performance Evaluation
- 8-D Color Histogram, 100k pts
- Random, 50k Points
- 100k Points
- 2.7 Conclusions
- 3.1 Introduction
- 3.2 Related Work
- 3.2.1 High-Dimensional Indexing
- 3.2.2 Feature Selection
- 3.2.3 Index Selection
- 3.2.4 Automatic Index Selection
- 3.3 Approach
- 3.3.1 Problem Statement
- 3.3.2 Approach Overview
- 3.3.3 Proposed Solution for Index Selection
- 3.3.4 Proposed Solution for Online Index Selection
- 3.3.5 System Enhancements
- 3.4 Empirical Analysis
- 3.4.1 Experimental Setup
- 3.4.2 Experimental Results
- Dataset, synthetic Query Workload
- Dataset, hr Query Workload
- 3.5 Conclusions
- Multi-Attribute Bitmap Indexes
- 4.1 Introduction
- 4.2 Background
- 2M entries versus
- 4.3 Proposed Approach
- 4.3.1 Multi-attribute Bitmap Index Selection for Ad Hoc
- 4.3.2 Query Execution with Multi-Attribute Bitmaps
- Conclusions
- BIBLIOGRAPHY

**APPLICATION-DRIVEN PROCESSING AND RETRIEVAL
**

DISSERTATION

Presented in Partial Fulﬁllment of the Requirements for

the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Michael Albert Gibas, B.S., M.S.

* * * * *

The Ohio State University

2008

Dissertation Committee:

Hakan Ferhatosmanoglu, Adviser

Hui Fang

Atanas Rountev

Approved by

Adviser

Graduate Program in

Computer Science and

Engineering

ABSTRACT

The proliferation of massive data sets across many domains and the need to gain

meaningful insights from these data sets highlight the need for advanced data re-

trieval techniques. Because I/O cost dominates the time required to answer a query,

sequentially scanning the data and evaluating each data object against query crite-

ria is not an eﬀective option for large data sets. An eﬀective solution should require

reading as small a subset of the data as possible and should be able to address general

query types. Access structures built over single attributes also may not be eﬀective

because 1) they may not yield performance that is comparable to that achievable by

an access structure that prunes results over multiple attributes simultaneously and 2)

they may not be appropriate for queries with results dependent on functions involv-

ing multiple attributes. Indexing a large number of dimensions is also not eﬀective,

because either too many subspaces must be explored or the index structure becomes

too sparse at high dimensionalities. The key is to ﬁnd solutions that allow for much of

the search space to be pruned while avoiding this ‘curse of dimensionality’. This the-

sis pursues query performance enhancement using two primary means 1) processing

the query eﬀectively based on the characteristics of the query itself and 2) physically

organizing access to data based on query patterns and data characteristics. Query

performance enhancements are described in the context of several novel applications

ii

including 1) Optimization Queries, which presents an I/O-optimal technique to an-

swer queries when the objective is to maximize or minimize some function over the

data attributes, 2) High-Dimensional Index Selection, which oﬀers a cost-based ap-

proach to recommend a set of low dimensional indexes to eﬀectively address a set

of queries, and 3) Multi-Attribute Bitmap Indexes, which describes extensions to a

traditionally single-attribute query processing and access structure framework that

enables improved query performance.

iii

ACKNOWLEDGMENTS

First of all, I would like to thank my wife, Moni Wood, for her love and support.

She has made many sacriﬁces and expended much eﬀort to help make this dream a

reality. I would not have been to complete this dissertation without her.

I also would not have been able to accomplish this without the help of my family.

I thank my mother Marilyn, my sister Susan and my grandmother Margaret for their

support. I thank my late father Murray for providing an example to which I have

always aspired. Moni’s family was also understanding and supportive of the travails

of Ph.D research.

I want to express my appreciation to my adviser, Hakan Ferhatosmanoglu, for

his guidance and encouragement over the years. He has been generous with his

time, eﬀort, and advice during my graduate studies. I am thankful for the technical

discussions we have had and the opportunities he has given me to work on interesting

projects.

Many people in the department have helped me to get to this point. I speciﬁcally

want to thank my dissertation committee members, Dr. Atanas Rountev and Dr.

Hui Fang, for their eﬀorts and valuable comments on my research. I wish to thank

Dr. Bruce Weide for his help and support during my studies, Dr. Rountev for my

early exposure to research and paper writing, and Dr. Jeﬀ Daniels for giving me the

opportunity to work on an interdisciplinary geo-physics project.

iv

Many friends and members of the Database Research Group provided valuable

guidance, discussions, and critiques over the past several years. I want to especially

thank Guadalupe Canahuate for her help and friendship during this time. Also thanks

to her family for providing enough Dominican coﬀee to get through all the paper

deadlines.

My Ph.D. research was supported by NSF grant NSF IIS -0546713.

v

VITA

May 10, 1969 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Omaha, Nebraska

1993 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.S. Electrical Engineering,

Ohio State University, Columbus Ohio

2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Computer Science and Engineer-

ing,

Ohio State University, Columbus Ohio

1989-1992 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Electrical Engineering Co-op,

British Petroleum,

Various Locations

1994-2002 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Project Manager,

Acquisition Logistics Engineering,

Worthington, Ohio

2003-present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Assistant,

Oﬃce Of Research,

Department of Computer Science and

Engineering,

Department of Geology

Ohio State University,

Columbus, Ohio

PUBLICATIONS

Research Publications

Michael Gibas, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Online Index Recom-

mendations for High-Dimensional Databases Using Query Workloads. IEEE Trans-

actions on Knowledge and Data Engineering (TDKE) Journal, February 2008 (Vol.

20, no.2), pp. 246-260

vi

Michael Gibas, Hakan Ferhatosmanoglu, High-Dimensional Indexing, Encyclopedia of

Geographical Information Science, (E-GIS)

Michael Gibas, Ning Zheng, and Hakan Ferhatosmanoglu, A General Framework for

Modeling and Processing Optimization Queries, 33rd International Conference on

Very Large Data Bases (VLDB 07), Vienna, Austria, September, 2007, pp. 1069-

1080

Guadalupe Canahuate, Michael Gibas, Hakan Ferhatosmanoglu, Update Conscious

Bitmap Indexes. International Conference on Scientiﬁc and Statistical Database

Management (SSDBM), Banﬀ, Canada, 2007/July

Guadalupe Canahuate, Michael Gibas, Hakan Ferhatosmanoglu, Indexing Incomplete

Databases, International Conference on Extending Database Technology (EDBT),

Munich, Germany, 2006/March, pp. 884-901

Michael Gibas, Developing Indices for High Dimensional Databases Using Query Ac-

cess Patterns, Masters Thesis, June 2004

Atanas Rountev, Scott Kagan, and Michael Gibas, Static and Dynamic Analysis of

Call Chains in Java, ACM SIGSOFT International Symposium on Software Testing

and Analysis (ISSTA’04), pages 1-11, July 2004

Fatih Altiparmak, Michael Gibas, Hakan Ferhatosmanoglu, Relationship Preserving

Feature Selection for Unlabelled Time-Series, Ohio State Technical Report, OSU-

CISRC-10/07-TR74, 14 pp.

Michael Gibas, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Indexes for Databases

with Missing Data, Ohio State Technical Report, OSU-CISRC-4/06-TR45, 15 pp.

Guadalupe Canahuate, Tan Apaydin, Michael Gibas, Hakan Ferhatosmanoglu, Fast

Semantic-based Similarity Searches in Text Repositories, Conference on Using Meta-

data to Manage Unstructured Text, LexisNexis, Dayton, OH, 2005/October

Atanas Rountev, Scott Kagan, and Michael Gibas, Evaluating the Imprecision of

Static Analysis, ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Soft-

ware Tools and Engineering (PASTE’04), pages 14-16, June 2004

FIELDS OF STUDY

vii

Major Field: Computer Science and Engineering

viii

TABLE OF CONTENTS

Page

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Chapters:

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Overall Technical Motivation . . . . . . . . . . . . . . . . . . . . . 1

1.2 Novel Database Management Applications . . . . . . . . . . . . . . 2

2. Modeling and Processing Optimization Queries . . . . . . . . . . . . . . 6

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Query Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Background Deﬁnitions . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Common Query Types Cast into Optimization Query Model 13

2.3.3 Optimization Query Applications . . . . . . . . . . . . . . . 15

2.3.4 Approach Overview . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Query Processing Framework . . . . . . . . . . . . . . . . . . . . . 19

2.4.1 Query Processing over Hierarchical Access Structures . . . . 21

2.4.2 Query Processing over Non-hierarchical Access Structures . 26

2.4.3 Non-Covered Access Structures . . . . . . . . . . . . . . . . 28

2.5 I/O Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 33

2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

ix

3. Online Index Recommendations for High-Dimensional Databases using

Query Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2.1 High-Dimensional Indexing . . . . . . . . . . . . . . . . . . 54

3.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2.3 Index Selection . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2.4 Automatic Index Selection . . . . . . . . . . . . . . . . . . . 57

3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . 59

3.3.3 Proposed Solution for Index Selection . . . . . . . . . . . . 61

3.3.4 Proposed Solution for Online Index Selection . . . . . . . . 71

3.3.5 System Enhancements . . . . . . . . . . . . . . . . . . . . . 76

3.4 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 78

3.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 80

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4. Multi-Attribute Bitmap Indexes . . . . . . . . . . . . . . . . . . . . . . . 98

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.3.1 Multi-attribute Bitmap Index Selection for Ad Hoc Queries 110

4.3.2 Query Execution with Multi-Attribute Bitmaps . . . . . . . 115

4.3.3 Index Updates . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 121

4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 123

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

x

LIST OF FIGURES

Figure Page

2.1 Sample Optimization Objective Function and Constraints F(

θ, A) :

(x −1)

2

+(y −2)

2

, o(min), g

1

( α, A) : y −6 ≤ 0, g

2

( α, A) : −y +3 ≤ 0,

h

1

(

β, A) : x −5 = 0, k = 1 . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Optimization Query System Functional Block Diagram . . . . . . . . 20

2.3 Example of Hierarchical Query Processing . . . . . . . . . . . . . . . 25

2.4 Cases of MBRs with respect to R and optimal objective function contour 30

2.5 Cases of partitions with respect to R and optimal objective function

contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.6 Accesses, Random weighted kNN vs kNN, R*-tree, 8-D Histogram,

100k pts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.7 CPU Time, Random weighted kNN vs kNN, R*-tree, 8-D Histogram,

100k pts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.8 Accesses, Random weighted kNN vs kNN, VA-File, 60-D Sat, 100k pts 38

2.9 Page Accesses, Random Weighted Constrained k-LP Queries, R*-Tree,

8-D Color Histogram, 100k pts . . . . . . . . . . . . . . . . . . . . . . 40

2.10 Accesses vs. Index Type, Optimizing Random Function, 5-D Uniform

Random, 50k Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.11 Pages Accesses vs. Constraints, NN Query, R*-Tree, 8-D Histogram,

100k Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

xi

3.1 Index Selection Flowchart . . . . . . . . . . . . . . . . . . . . . . . . 61

3.2 Dynamic Index Analysis Framework . . . . . . . . . . . . . . . . . . . 73

3.3 Costs in Data Object Accesses for Ideal, Sequential Scan, AutoAdmin,

Naive, and the Proposed Index Selection Technique using Relaxed Con-

straints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.4 Baseline Comparative Cost of Adaptive versus Static Indexing, random

Dataset, synthetic Query Workload . . . . . . . . . . . . . . . . . . . 85

3.5 Baseline Comparative Cost of Adaptive versus Static Indexing, random

Dataset, clinical Query Workload . . . . . . . . . . . . . . . . . . . . 86

3.6 Baseline Comparative Cost of Adaptive versus Static Indexing, stock

Dataset, hr Query Workload . . . . . . . . . . . . . . . . . . . . . . . 87

3.7 Comparative Cost of Online Indexing as Support Changes, random

Dataset, synthetic Query Workload . . . . . . . . . . . . . . . . . . . 89

3.8 Comparative Cost of Online Indexing as Support Changes, stock Dataset,

hr Query Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3.9 Comparative Cost of Online Indexing as Major Change Threshold

Changes, random Dataset, synthetic Query Workload . . . . . . . . . 92

3.10 Comparative Cost of Online Indexing as Major Change Threshold

Changes, stock Dataset, hr Query Workload . . . . . . . . . . . . . . 93

3.11 Comparative Cost of Online Indexing as Multi-Dimensional Histogram

Size Changes, random Dataset, synthetic Query Workload . . . . . . 94

4.1 A WAH bit vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.2 Query Execution Example over WAH compressed bit vectors. . . . . 103

4.3 Query Execution Times vs Total Bitmap Length, ANDing 6 bitmaps,

2M entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.4 Query Execution Times vs Number of Attributes queried, Point Queries,

2M entries versus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

xii

4.5 Simple multiple attribute bitmap example showing the combination

two single attribute bitmaps. . . . . . . . . . . . . . . . . . . . . . . . 108

4.6 Query Execution Time for each query type. . . . . . . . . . . . . . . 125

4.7 Number of bitmaps involved in answering each query type. . . . . . . 126

4.8 Total size of the bitmaps involved in answering each query type. . . . 127

4.9 Query Execution Times, Point Queries, Traditional versus Multi-Attribute

Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.10 Query Execution Times as Query Ranges Vary, Landsat Data . . . . 129

4.11 Single and Multi-Attribute Bitmap Index Size Comparison . . . . . . 130

xiii

CHAPTER 1

Introduction

If Edison had a needle to ﬁnd in a haystack, he would proceed at once

with the diligence of the bee to examine straw after straw until he found

the object of his search... I was a sorry witness of such doings, knowing

that a little theory and calculation would have saved him ninety per cent

of his labor. - Nikola Tesla, 1931

In order to make better use of the massive quantities of data being generated by

and for modern applications, it is necessary to ﬁnd information that is in some respect

interesting within the data. Two primary challenges of this task are specifying the

meaning of interesting within the context of an application and searching the data

set to ﬁnd the interesting data eﬃciently. In the case of our proverbial haystack, we

can think of each piece of straw as a data object and needles as a data objects that

are interesting. While we could guarantee that we ﬁnd the most interesting needle by

examining each and every piece of straw, we would be much better served by limiting

our search to a small region within the haystack.

1.1 Overall Technical Motivation

Analogous to our physical haystack example, the query response time during data

exploration tasks using a database system is aﬀected both by the physical organization

1

of the data and by the way in which the data is accessed. As such, query performance

can be enhanced by traversing the data more eﬀectively during query resolution or

by organizing access to the data so that it is more suitable for the query. Depending

on the application, queries can vary widely on a per query basis. Therefore, it is

highly desirable to process generic queries at run-time in a way that is eﬀective for

each particular query. Physically reorganizing data or data access structures is time

consuming and therefore should be performed for a set of queries rather than by

trying to optimize an organization for a single query.

This thesis targets query performance enhancement through two sources 1) run-

time query processing such that the search paths and intermediate operations during

query resolution are appropriate for a given generic query, 2) data organization such

that the access structures available during query resolution are appropriate for the

overall query patterns and data distribution and allow for signiﬁcant data space prun-

ing during query resolution. Query performance enhancement is explored using three

introduced applications, motivated and described in the next section.

1.2 Novel Database Management Applications

Optimization Queries. Exploration of image, scientiﬁc, and business data usu-

ally requires iterative analyses that involve sequences of queries with varying parame-

ters. Besides traditional queries, similarity-based analyses and complex mathematical

and statistical operators are used to query such data [41]. Many of these data ex-

ploration queries can be cast as optimization queries where a linear or non-linear

function over object attributes is minimized or maximized (our needle). Addition-

ally, the object attributes queried, function parametric values, and data constraints

2

can vary on a per-query basis. Because scientiﬁc data exploration and business deci-

sion support systems rely heavily on function optimization tasks, and because such

tasks encompass such disparate query types with varying parameters, these applica-

tion domains can beneﬁt from a general model-based optimization framework. Such

a framework should meet the following requirements: (i) have the expressive power

to support solution of each query type, (ii) take advantage of query constraints to

prune search space, and (iii) maintain eﬃcient performance when the scoring criteria

or optimization function changes.

A general optimization model and query processing framework are proposed for

optimization queries, an important and signiﬁcant subset of data exploration tasks.

The model covers those queries where some function is being minimized or maximized

over object attributes, possibly within a set of constraints. The query processing

framework can be used in those cases where the function being optimized and the

set of constraints is convex. For access structures where the underlying partitions

are convex, the query processing framework is used to access the most promising

partitions and data objects and is proven to be I/O-optimal.

High-Dimensional Index Selection. With respect to the physical organization

of the data, access structures or indexes that cover query attributes can be used

to prune much of the data space and greatly reduce the required number of page

accesses to answer the query. However, due to the oft-cited curse of dimensionality,

indexes built over many dimensions actually perform worse than sequential scan. This

problem can be addressed by using sets of lower dimensional indexes that eﬀectively

answer a large portion of the incoming queries. This approach can work well if the

characteristics of future queries are well known, but in general this is not the case,

3

and an index set built based on historical queries may be completely ineﬀective for

new queries.

In the case of physical organization, query performance is aﬀected by creating a

number of low dimension indexes using the historical and incoming query workloads,

and data statistics. Historical query patterns are data mined to ﬁnd those attribute

sets that are frequently queried together. Potential indexes are evaluated to derive

an initial index set recommendation that is eﬀective in reducing query cost for a large

portion of queries such that the set of indexes is within some indexing constraint.

A set of indexes is obtained that in eﬀect reduces the dimensionality of the query

search space, and is eﬀective in answering queries. A control feedback mechanism

is incorporated so that the performance of newly arriving queries can be monitored,

and if the current index set is no longer eﬀective for the current query patterns, the

index set can be changed. This ﬂexible framework that can be tuned to maximize

indexing accuracy or indexing speed which allows it to be used for the initial static

index recommendation and also online dynamic index recommendation.

Multi-Attribute Bitmap Indexes. The nature of the database management

system used and the way in which data is physically accessed also aﬀects query execu-

tion times. Bitmap indexes have been shown to be eﬀective as an data management

system for many domains such a data warehousing and scientiﬁc applications. This

owes to the fact that they are a column-based storage structure and only the data

needed to address a query is ever retrieved from disk and that their base operations

are performed using hardware supported fast bitwise operations. Static bitmap in-

dexes have traditionally built over each attribute individually. The query model for

selection queries is simple and involves ORing together bitcolumns that match query

4

criteria within an attribute and ANDing inter-attribute intermediate results. As a re-

sult, the bitmap index structure is limited to those domains that it is naturally suited

for without change. As well query resolution does not take advantage of potential

multi-attribute pruning and possible reductions in the number of binary operations.

Multi-attribute bitmap indexes are introduced to extend the functionality and

applicability of traditional bitmap indexes. For many queries, query execution times

are improved by reducing size of bitmaps used in query execution and by reducing the

number of binary operations required to resolve the query without excessive overhead

for queries that do not beneﬁt from multi-attribute pruning. A query processing

model where multi-attribute bitmaps are available for query resolution is presented.

Methods to determine attribute combination candidates are provided.

5

CHAPTER 2

Modeling and Processing Optimization Queries

The most exciting phrase to hear in science, the one that heralds new

discoveries, is not ‘Eureka!’ (I found it!) but ‘That’s funny ...’ - Isaac

Asimov

One challenge associated with data management is determining a cost-eﬀective

way to process a given query. Given the time expense of I/O, minimizing the I/O

required to answer a query is a primary strategy in query optimization. Scientiﬁc

discovery and business intelligence activities are frequently exploring for data objects

that are in some way special or unexpected. Such data exploration can be performed

by iteratively ﬁnding those data objects that minimize or maximize some function over

the data attributes. While the answer to a given query can be found by computing

the function for all data objects and ﬁnding those objects that yield the most ex-

treme values, this approach may not be feasible for large data sets. The optimization

query framework presented in this chapter provides a method to generically handle

an important broad class of queries, while ensuring that, given an access structure,

the query is I/O-optimally processed over that structure.

6

2.1 Introduction

Motivation and Goal Optimization queries form an important part of real-

world database system utilization, especially for information retrieval and data anal-

ysis. Exploration of image, scientiﬁc, and business data usually requires iterative

analyses that involve sequences of queries with varying parameters. Besides tradi-

tional queries, similar-ity-based analyses and complex mathematical and statistical

operators are used to query such data [41]. Many of these data exploration queries

can be cast as optimization queries where a linear or non-linear function over object

attributes is minimized or maximized. Additionally, the object attributes queried,

function parametric values, and data constraints can vary on a per-query basis.

There has been a rich set of literature on processing speciﬁc types of queries, such

as the query processing algorithms for Nearest Neighbors (NN) [48] and its variants

[37], top-k queries [21, 51], etc. These queries can all be considered to be a type of

optimization query with diﬀerent objective functions. However, they either only deal

with speciﬁc function forms or require strict properties for the query functions.

Because scientiﬁc data exploration and business decision support systems rely

heavily on function optimization tasks, and because such tasks encompass such dis-

parate query types with varying parameters, these application domains can beneﬁt

from a general model-based optimization framework. Such a framework should meet

the following requirements: (i) have the expressive power to support existing query

types and the power to formulate new query types, (ii) take advantage of query con-

straints to proactively prune search space, and (iii) maintain eﬃcient performance

when the scoring criteria or optimization function changes.

7

Despite an extensive list of techniques designed speciﬁcally for diﬀerent types

of optimization queries, a uniﬁed query technique for a general optimization model

has not been addressed in the literature. Furthermore, application of specialized

techniques requires that the user know of, possess, and apply the appropriate tools

for the speciﬁc query type. We propose a generic optimization model to deﬁne a wide

variety of query types that includes simple aggregates, range and similarity queries,

and complex user-deﬁned functional analysis. Since the queries we are considering

do not have a speciﬁc objective function but only have a “model” for the objective

(and constraints) with parameters that vary for diﬀerent users/queries/iterations, we

refer to them as model-based optimization queries. In conjunction with a technique to

model optimization query types, we also propose a query processing technique that

optimally accesses data and space partitioning structures to answer the query. By

applying this single optimization query model with I/O-optimal query processing,

we can provide the user with a ﬂexible and powerful method to deﬁne and execute

arbitrary optimization queries eﬃciently. Example optimization query applications

are provided in Section 2.3.3.

Contributions The primary contributions of this work are listed as follows.

• We propose a general framework to execute convex model-based optimization

queries (using linear and non-linear functions) and for which additional con-

straints over the attributes can be speciﬁed without compromising query accu-

racy or performance.

• We deﬁne query processing algorithms for convex model-based optimization

queries utilizing data/space partitioning index structures without changing the

8

index structure properties and the ability to eﬃciently address the query types

for which the structures were designed.

• We prove the I/O optimality for these types of queries for data and space

partitioning techniques when the objective function is convex and the feasible

region deﬁned by the constraints is convex (these queries are deﬁned as CP

queries in this paper).

• We introduce a generic model and query processing framework that optimally

addresses existing optimization query types studied in the research literature as

well as new types of queries with important applications.

2.2 Related Work

There are few published techniques that cover some instances of model-based

optimization queries. One is the “Onion technique” [14], which deals with an un-

constrained and linear query model. An example linear model query selects the best

school by ranking the schools according to some linear function of the attributes of

each school. The solution followed in the Onion technique is based on constructing

convex hulls over the data set. It is infeasible for high dimensional data because the

computation of convex hulls is exponential with respect to the number of dimensions

and is extremely slow. It is also not applicable to constrained linear queries because

convex hulls need to be recomputed for each query with a diﬀerent set of constraints.

The computation of convex hulls over the whole data set is expensive and in the case

of high-dimensional data, infeasible in design time.

The Prefer technique [32] is used for unconstrained and linear top-k query types.

The access structure is a sorted list of records for an arbitrary linear function. Query

9

processing is performed for some new linear preference function by scanning the sorted

list until a record is reached such that no record below it could match the new query.

This avoids the costly build time associated with the Onion technique, but does not

provide a guarantee on minimum I/O, and still does not incorporate constraints.

Boolean + Ranking [60] oﬀers a technique to query a database to optimize boolean

constrained ranking functions. However, it is limited to boolean rather than general

convex constraints, addresses only single dimension access structures (i.e. can not

optimize on multiple dimensions simultaneously), and therefore largely uses heuristics

to determine a cost eﬀective search strategy.

While the Onion and Prefer techniques organize data to eﬃciently answer a speciﬁc

type of query (LP), our technique organizes data retrieval to answer more general

queries. In contrast with the other techniques listed, our proposed solution is proven

to be I/O-optimal for both hierarchical and space partitioned access structures with

convex partition boundaries, covers a broader spectrum of optimization queries, and

requires no additional storage space.

2.3 Query Model Overview

2.3.1 Background Deﬁnitions

Let Re(a

1

, a

2

, . . . , a

n

) be a relation with attributes a

1

, a

2

, . . . , a

n

. Without loss

of generality, assume all attributes have the same domain. Denote by A a subset

of attributes of Re. Let g

i

( α, A), 1 ≤ i ≤ u, h

j

(

β, A), 1 ≤ j ≤ v and F(

θ, A)

be certain functions over attributes in A, where u and v are positive integers and

θ = (θ

1

, θ

2

, . . . , θ

m

) is a vector of parameters for function F, and α and

β are vector

parameters for the constraint functions.

10

Deﬁnition 1 (Model-Based Optimization Query) Given a relation Re(a

1

, a

2

, . . . , a

n

),

a model-based optimization query Q is given by (i) an objective function F(

θ, A) , with

optimization objective o (min or max), (ii) a set of constraints (possibly empty) spec-

iﬁed by inequality constraints g

i

( α, A) ≤ 0, 1 ≤ i ≤ u (if not empty) and equality

constraints h

j

(

**β, A) = 0, 1 ≤ j ≤ v (if not empty), and (iii) user/query adjustable
**

objective parameters

θ, constraint parameters α and

β, and answer set size integer k.

Figure 2.1 shows a sample objective function with both inequality and equality

constraints. The objective function ﬁnds the nearest neighbor to point a (1,2). The

constraints limit the feasible region to the line passing through x = 5, where 3 ≤ y ≤

6. The points b and d represent the best and worst point within the constraints for

the objective function respectively. Point c could be the actual point in the database

that meets the constraints and minimizes the objective function.

The answer of a model-based optimization query is a set of tuples with maxi-

mum cardinality k satisfying all the constraints such that there exists no other tuple

with smaller (if minimization) or larger (if maximization) objective function value

F satisfying all constraints as well. This deﬁnition is very general, and almost any

type of query can be considered as a special case of model-based optimization query.

For instance, NN queries over an attribute set A can be considered as model-based

optimization queries with F(

**θ, A) as the distance function (e.g., Euclidean) and the
**

optimization objective is minimization. Similarly, top-k queries, weighted NN queries

and linear/non-linear optimization queries can all be considered as speciﬁc model-

based optimization queries. Without loss of generality, we will consider minimization

as the objective optimization throughout the rest of this chapter.

11

Figure 2.1: Sample Optimization Objective Function and Constraints F(

θ, A) : (x −

1)

2

+(y −2)

2

, o(min), g

1

( α, A) : y −6 ≤ 0, g

2

( α, A) : −y +3 ≤ 0, h

1

(

β, A) : x−5 = 0,

k = 1

Deﬁnition 2 (Convex Optimization (CP) Query) A model-based optimization

query Q is a Convex Optimization (CP) query if (i) F(

θ, A) is convex, (ii) g

i

( α, A), 1 ≤

i ≤ u (if not empty) are convex, and (iii) h

j

(

**β, A), 1 ≤ j ≤ v (if not empty) are linear
**

or aﬃne.

Notice that the deﬁnition of a CP query does not have any assumptions on the

speciﬁc form of the objective function. The only assumptions are that the queries can

be formulated into convex functions and the feasible regions deﬁned by the constraints

of the queries are convex (e.g., polyhedron regions). Therefore, users can ask any form

of queries with any coeﬃcients as long as these assumptions hold. The conditions (i),

(ii) and (iii) form a well known type of problem in Operations Research literature -

Convex Optimization problems. A function is convex if it is second diﬀerentiable and

12

its Hessian matrix is positive deﬁnite. More intuitively, if one travels in a straight

line from inside a convex region to outside the region, it is not possible to re-enter

the convex region.

2.3.2 Common Query Types Cast into Optimization Query

Model

Although appearing to be restricted in functions F, g

i

and h

j

, the set of CP prob-

lems is a superset of all least square problems, linear programming problems (LP)

and convex quadratic programming problems (QP) [27]. Therefore, they cover a wide

variety of linear and non-linear queries, including NN queries, top-k queries, linear

optimization queries, and angular similarity queries. The formulation of some com-

mon query types as CP optimization queries are listed and discussed below.

Euclidean Weighted Nearest Neighbor Queries - The Weighted Nearest Neigh-

bor query asking NN for a point (a

0

1

, a

0

2

, . . . , a

0

n

) can be considered as an optimization

query asking for a point (a

1

, a

2

, . . . , a

n

) such that an objective function is minimized.

For Euclidean WNN the objective function is

WNN(a

1

, a

2

, . . . , a

n

) =

w

1

(a

1

−a

0

1

)

2

+ w

2

(a

2

−a

0

2

)

2

+ . . . + w

m

(a

n

−a

0

n

)

2

,

where w

1

, w

2

, . . . , w

n

> 0 are the weights and can be diﬀerent for each query. Tradi-

tional Nearest Neighbor queries are the subset of weighted nearest neighbor queries

where the weights are all equal to 1.

Linear Optimization Queries - The objective function of a Linear Optimization

query is

L(a

1

, a

2

, . . . , a

n

) = c

1

a

1

+ c

2

a

2

+ . . . + c

n

a

n

13

where (c

1

, c

2

, . . . , c

n

) ∈ R

n

are coeﬃcients and can be diﬀerent for diﬀerent queries,

and the constraints form a polyhedron. Since linear functions are convex and poly-

hedrons are convex, linear optimization queries are also special cases of CP queries

with a parametric function form.

Top-k Queries - The objective functions of top-k queries are score (or ranking) func-

tions over the set of attributes. The common score functions are sum and average,

but could be any arbitrary function over the attributes. Clearly, sum and average

are special cases of linear optimization queries. If the ranking function in question is

convex, the top-k query is a CP query.

Range Queries - A range query asks for all data (a

1

, a

2

, . . . , a

n

) s.t. l

i

≤ a

i

≤ u

i

, for

some (or all) dimensions i, where [l

i

, u

i

]

n

is the range for dimension i. Constructing an

objective function as f(a

1

, a

2

, . . . , a

n

) = C, where C is any constant, and constraints

g

i

= l

i

−a

i

<= 0, g

n+i

= a

i

−u

i

<= 0, i = 1, 2, · · · , n

Any points that are within the constraints will get the objective value C. Points

that are not within the constraints are pruned by those constraints. Those points

that remain all have an objective value of C and because they all have the same

maximum(minimum) objective value, all form the solution set of the range query.

Therefore, range queries can be formed as special cases of CP queries with constant

objective functions applying the range limits as constraints. Also note that our model

can not only deal with the ‘traditional’ hyper-rectangular range queries as described

above, but also ‘irregular’ range queries such as l ≤ a

1

+a

2

≤ u and l ≤ 3a

1

−2a

2

≤ u

14

2.3.3 Optimization Query Applications

Any problem that can be rewritten as a convex optimization problem can be

cast to our model, and solved using our framework. In cases where the optimiza-

tion problem or objective function parameters are not known in advance, this generic

framework can be used to solve the problem by eﬃciently accessing available access

structures. We provide a short list of potential optimization query applications that

could be relevant during scientiﬁc data exploration or business decision support where

the problem being optimized is continuously evolving.

Weighted Constrained Nearest Neighbors - Weighted nearest neighbor queries

give the ability to assign diﬀerent weights for diﬀerent attribute distances for nearest

neighbor queries. Weighted Constrained Nearest Neighbors (WCNN) ﬁnd the clos-

est weighted distance object that exists within some constrained space. This type

of query would be applicable in situations where attribute distances vary in impor-

tance and some constraints to the result set are known. A potential application is

clinical trial patient matching where an administrator is looking to ﬁnd the best can-

didate patients to ﬁll a clinical trial. Some hard exclusion criteria is known and some

attributes are more important than others with respect to estimating participation

probability and suitability.

kNN with Adaptive Scoring - In some situations we may have an objective func-

tion that changes based on the characteristics of a population set we are trying to

ﬁll. In such a case, we will change the nearest neighbor scoring weights dependent on

the current population set. As an example, again consider the patient clinical trial

matching application where we want to maintain a certain statistical distribution over

the trial participants. In a case where we want half of the participants to be male and

15

half female, we can adjust weights of the objective optimization function to increase

the likelihood that future trial candidates will match the currently underrepresented

gender.

Queries over Changing Attributes - The attributes involved in optimization

queries can vary based on the iteration of the query. For example, a series of opti-

mization queries may search a stock database for maximum stock gains over diﬀerent

time intervals. The attributes involved in each query will be diﬀerent.

Weights, constraints, functional attributes, and optimization functions themselves

can all change on a per-query basis. A database system that can eﬀectively handle

the potential variations in optimization queries will beneﬁt data exploration tasks.

In the examples listed above, each query or each member of a set of queries can

be rewritten as an optimization query in our model. This demonstrates the power

and ﬂexibility that the user has to deﬁne data exploration queries and the examples

represent a small subset of the query types that are possible.

2.3.4 Approach Overview

We propose the use of Convex Optimization (CP) in order to traverse access

structures to ﬁnd optimal data objects. The goal in CP is optimize (minimize or

maximize) some convex function over data attributes. The solution can optionally

be subject to some convex criteria over the data attributes. Additionally, the data

attributes may be subject to lower and upper bounds.

All of the discussed applications can be stated as an objective function with a

objective of maximization or minimization. We can solve a continuous CP problem

over a convex partition of space by optimizing the function of interest within the

16

constraints of that space. Most access structures are built over convex partitions.

Particularly common partitions are Minimum Bounding Rectangles (MBRs). Rect-

angular partitions can be represented by lower and upper bounds over data attributes

and can be directly addressed by the CP problem bounds. Other convex partitions

can be addressed with data attribute constraints (such as x +y < 5). If the problem

under analysis has constraints in addition to the constraints imposed by the partition

boundaries, they can also be added to the CP problem as constraints.

Given the function, partition constraints, and problem constraints, we can use

CP to ﬁnd the optimal answer for the function within the intersection of the problem

constraints and partition constraints. If no solution exists (problem constraints and

partition constraints do not intersect), the partition is found to be infeasible for the

problem. The partition and its children can be eliminated from further consideration.

Given a function, problem constraints, and a set of partitions, the partition the yields

the best CP result with respect to the function under analysis is the one that contains

some space that optimizes the problem under analysis better than the other partitions.

These partition functional values can be used to order partitions according to how

promising they are with respect to optimizing the function within problem constraints.

With our technique we endeavor to minimize the I/O operations and access struc-

ture search computations required to perform arbitrary model-based optimization

queries. The access structure partitions to search and evaluate during query process-

ing are determined by solving CP problems that incorporate the objective function,

model constraints, and partition constraints. The solution of these CP problems al-

lows us to prune partitions and search paths that can not contain an optimal answer

during query processing.

17

The CP literature discusses the eﬃciency of CP problem solution. In [42], it is

stated that a CP problem can be very eﬃciently solved with polynomial-time algo-

rithms. The readers can refer to [43] for a detailed review on interior-point polynomial

algorithms for CP problems. Current implementations of CP algorithms can solve

problems with hundreds of variables and constraints in very short times on a personal

computer [38]. CP problems can be so eﬃciently solved that Boyd and Vandenberghe

stated “if you formulate a practical problem as a convex optimization problem, then

you have solved the original problem.” ([10] page 8). We are in eﬀect replacing the

bound computations from existing access structure distance optimization algorithms

with a CP problem that covers the objective function and constraints. Although

such computations can be more complex, we found very little time diﬀerence in the

functions we examined.

We focus on query processing techniques by presenting a technique for general CP

objective functions, which can be applied to existing space or data partitioning-based

indices without losing the capabilities of those indexing techniques to answer other

types of queries. The basic idea of our technique is to divide the query processing into

two phases: First, solve the CP problem without considering the database, (i.e. in

continuous space). This “relaxed” optimization problem can be solved eﬃciently in

the main memory using many possible algorithms in the convex optimization litera-

ture. It can be solved in polynomial time in the number of dimensions in the objective

functions and the number of constraints ([43]). Second, search through the index by

pruning the impossible partitions and access the most promising pages. Details for

our technique for diﬀerent index types are provided in the following section.

18

2.4 Query Processing Framework

By solving CP problems associated with the problem constraints, partition con-

straints, and objective function, we ﬁnd the best possible objective function value

achievable by a point in the partition. Thus, there exists a lower bound for each

partition in the database that can be used to prune the data space before retrieving

pages from disk. The bounds can be deﬁned for any type of convex partition (such

as minimum bounding regions) used for indexing the data.

Figure 2.2 is a functional block diagram of the optimization query processing sys-

tem that shows interactions between components. Query processing proceeds ﬁrst by

the user providing an objective function and constraints to the system. A lightweight

query compiler validates the user input, converts the function and constraints into

appropriate format for the query processor, and executes the appropriate query pro-

cessing algorithm using the speciﬁed access structure. The query processor invokes

a CP solver to ﬁnd bounds for partitions that could potentially contain answers to

the query. By solving CP subproblems using index partition boundaries as additional

constraints, we can ensure that we access pages in the order of how promising they

are. The user gets back those data object(s) in the database that answer the query

within the constraints.

The generic query processing framework for a CP query is described below. This

framework can be applied to indexing structures in which the partitions on which

they are built are convex because the partition boundaries themselves are used to

form the new CP problems.

A popular taxonomy of access structures divides the structures into the categories

of hierarchical and non-hierarchical structures. Hierarchical structures are usually

19

Figure 2.2: Optimization Query System Functional Block Diagram

partitioned based on data objects while non-hierarchical structures are usually based

on partitioning space. These two families have important diﬀerences that aﬀect how

queries are processed over them. Therefore we provide algorithms for hierarchical

structures in Subsection 2.4.1 and non-hierarchical structures in 2.4.2.

The overall idea is to traverse the access structure and solve the problem in se-

quential order of how good partitions and search paths could be. We terminate

when we come to a partition that could not contain an optimal answer to the query.

The implementations described address a minimization objective function and prune

based on comparison to minimum values. Solutions to maximization problems can be

performed by using maximums (i.e. “worst” or upper bounds) in place of minimums.

20

The number of results that are returned by the query processing is dependent on

an input k. If the user desires only the single best result, k = 1. It can however, be

an arbitrary value in order to return the k best data objects. For range queries, if all

objects that meet the range constraints are desired, the value of k should be set to n,

the number of data points. Only the data points within the constrained region will be

returned, and only the points that are within the lowest-level partitions that intersect

the constrained region will be examined during query processing. Therefore, the query

processing will be at least as I/O-eﬃcient as standard range query processing over a

given access structure.

Now, we describe the implementations of this framework based on hierarchical

and non-hierarchical indexing techniques. In Section 2.5 we show that both of these

algorithms are in fact optimal, i.e., no unnecessary MBRs (or partitions) given an

access structure are accessed during the query processing.

2.4.1 Query Processing over Hierarchical Access Structures

Algorithm 1 shows the query processing over hierarchical access structures. This

algorithm is applicable to any hierarchical access structure where the partitions at

each tree level are convex. A partial list of structures covered by this algorithm

includes the R-tree, R*-tree, R+-tree, BSP-tree, Quad-tree, Oct-tree, k-d-B-tree,

B-tree, B+-tree, LSD-tree, Buddy-tree, P-tree, SKD-tree, GBD-tree, Extended k-D-

Tree, X-tree, SS-tree, and SR-tree [25, 9].

21

The algorithm is analogous to existing optimal NN processing [31], except that

partitions are traversed in order of promise with respect to an arbitrary convex func-

tion, rather than distance. Additionally, our algorithm also prunes out any partitions

and search paths that do not fall within problem constraints.

We solve a new CP problem for each partition that we traverse using the objective

function, original problem constraints, and the constraints imposed by the partition

boundaries. The algorithm takes as input the convex optimization function of interest

OF, the set of convex constraints C, and a desired result set size k. In lines 1-3, we

initialize the structures that we maintain throughout a query. We keep a vector of

the k best points found so far as well as a vector of the objective function values for

these points. We keep a worklist of promising partitions along with the minimum

objective value for the partition. We initialize the worklist to contain the tree root

partition.

We then process entries from the worklist until the worklist is empty. In line 5,

we remove the ﬁrst promising partition. If this partition is a leaf partition, then we

access the points in the partition and compute their objective function values. If a

point is not within the original problem constraints, then it will not have a feasible

solution and will be discarded from further consideration. Points with objective

function values lower than the objective function value of the k

th

best point so far

will cause the best point and best point objective function value vectors to be updated

accordingly.

If the partition under analysis is not a leaf, we ﬁnd its child partitions. For each

of these child partitions, we solve the CP problem corresponding to the objective

function, original problem constraints, and partition constraints (line 11). This yields

22

CPQueryHierarchical (Objective Function OF,

Constraints C,k)

notes:

b[i] - the i

th

best point found so far

F(b[i]) - the objective function value of the i

th

best point

L - worklist of (partition, minObjVal) pairs

1: b[1]...b[k] ←null.

2: F(b[1])...F(b[k]) ← +∞

3: L ← (root)

4: while L = ∅

5: remove ﬁrst partition P from L

6: if P is a leaf

7: access and compute functions for points in

P within constraints C

8: update vectors b and F if better points

discovered

9: else

10: for each child P

child

of P

11: minObjV al = CPSolve(OF,P

child

,C)

12: if minObjV al < F(b[k])

13: insert P

child

into L according to

minObjV al

14: end for

15: end if

16: prune partitions from L with minObjV al >

F(b[k])

17: end while

18: return vector b

Algorithm 1: I/O Optimal Query Processing for Hierarchical Index Structures.

the best possible objective function value achievable within the intersection of the

original problem constraints and partition constraints. If the minimum objective

function value is less than the k

th

best objective function value (i.e. it is possible for

a point within the intersection of the problem constraints and partition constraints to

23

beat the k

th

best point), then the partition is inserted into the worklist according to

its minimum objective value. If there is no feasible solution for the partition within

problem constraints, the partition will be dropped from consideration.

After a partition is processed, we may have updated our k

th

best objective function

value. We prune any partitions in the worklist that can not beat this value.

Query Processing Example As an example of query processing, consider Fig-

ure 2.3 with an optimization objective of minimizing the distance to the query point

within the constrained region (i.e. constrained 1-NN query). We initialize our work-

list with the root and start by processing it. Since it is not a leaf, we examine its child

partitions A and B. We solve for the minimum objective function value for A and do

not obtain a feasible solution because A and the constrained region do not intersect.

We do not process A further. We ﬁnd the minimum objective function for partition B

and get the distance from the query point to the point where the constrained region

and the partition B intersect, point c. We insert partition B into our worklist and

since it is the only partition in the worklist, we process it next. Since B is not a

leaf, we examine its child partitions, B.1, B.2, and B.3. We ﬁnd minimum objective

function values for each and obtain the distances from the query point to point d, e,

and f respectively. Since point e is closest, we insert partitions into the worklist in

the order B.2, B.1, and B.3.

We process B.2 and since it is a leaf, we compute objective function values for

the points in B.2. Point B.2.a is the closer of the two. This point is designated

as the best point thus far, and the best point objective function is updated. After

processing B.2, we know we can do no worse than the distance from the query point

to B.2.a. Since partition B.3 can not do better than this, we prune it from the

24

worklist. We examine points in B.1. Point B.1.b is not a feasible solution (it is not

in the constrained region) so it is dropped. Point B.1.a gets a value which is better

than the current best point. We update the best point to be B.1.a. Since the worklist

is now empty, we have completed the query and return the best point.

The optimal point for this optimization query this query is B.1.a. We examine

only points in partitions that could contain points as good as the best solution.

Figure 2.3: Example of Hierarchical Query Processing

25

2.4.2 Query Processing over Non-hierarchical Access Struc-

tures

In this section, we show that the general framework for CP queries can also be

implemented on top of non-hierarchical structures. Algorithm 2 can be applied to

non-hierarchical access structures where the partitions are convex. A partial list of

structures where the algorithm can be applied is Linear Hashing, Grid-File, EXCELL,

VA-File, VA+-File, and Pyramid Technique [25, 9, 53, 23].

We use a two stage approach detailed in Algorithm 2 to determine which data

objects optimize the query within the constraints. This technique is similar to the

technique identiﬁed in [53] for ﬁnding NN for a non-hierarchical structure, but we use

the objective function rather than a distance and also consider original problem con-

straints to ﬁnd lower bounds. We take the objective function, OF, original problem

constraints C, and result set size k as inputs. We initialize the structures to be used

during query processing (lines 1-3). In stage 1, we process each populated partition

vin the structure by solving the CP problem that corresponds to the original objec-

tive function and constraints combined with the additional constraints imposed by the

partition boundaries (line 5). Those partitions that do not intersect the constraints

will be bypassed. For those that do have a feasible solution, the lower bound of the

CP problem is recorded. At the end of stage 1, we place unique unpruned partitions

in increasing sorted order based on their lower bound. In stage 2, we process the

unpruned partitions by accessing objects contained by the ﬁrst partition and com-

puting actual objective values. We maintain a sorted list of the top-k objects we have

found so far, and continue accessing points contained by subsequent partitions until

a partition’s lower bound is greater than the k

th

best value. By accessing partitions

26

in order of how good they could be, according to the CP problem, we access only

those points in cell representations that could contain points as good as the actual

solution.

CPQueryNonHierarchical (Objective Function OF,

Constraints C,k)

notes:

b[i] - the i

th

best point found so far

F(b[i]) - the objective function value of the i

th

best point

L - worklist of (partition, minObjVal) pairs

Initialize:

1: b[1]...b[k] ←null.

2: F(b[1])...F(b[k]) ← +∞

3: L ← ∅

Stage 1:

4: for each populated partition P

5: minObjV al = CPSolve(OF, P, C)

6: if minObjV al < +∞

7: insert (P, minObjV al) into L according to

minObjV al

Stage 2:

8: for i = 1 to |L|

9: if minObjV al of partition L[i] > F(b[k])

10: break

11: Access objects that partition L[i] contains

12: for each object o

13: if OF(o) < F(b[k])

14: insert o into b

15: update F(b)

Algorithm 2: I/O Optimal Query Processing over Non-hierarchical Structures.

27

2.4.3 Non-Covered Access Structures

A review of access structure surveys yielded few structures that are not covered by

the algorithms. These include structures that either do not guarantee spatial locality

(i.e. space ﬁlling curves which are used for approximate ordering). They include

structures built over values computed for a speciﬁc function (such as distance in M-

trees. Since attribute information is lost, they can not be applied to general functions

over the original attributes). They include structures that contain non-convex regions

(BANG-File, hB-tree, BV-tree) [25, 9]. Note that M-trees would still work to answer

queries for the functions they are built over and the structures built with non-convex

leaves could still be optimally traversed down to the level of non-convexity.

2.5 I/O Optimality

In this section, we will show that the proposed query processing technique achieves

optimal I/O for each implementation in Section 2.4. We denote the objective func-

tion contour which goes through the actual optimal data point in the database as the

optimal contour. This contour also goes through all other points in continuous space

that yield the same objective function value as the optimal point. The k

th

optimal

contour would pass through all points in continuous space that yield the same objec-

tive function value as the k

th

best point in the database. In the following arguments,

we use the optimal contour, but could replace them with the k

th

optimal contour.

Following the traditional deﬁnition in [5], we deﬁne the optimality as follows.

Deﬁnition 3 (Optimality) An algorithm for optimization queries is optimal iﬀ it

retrieves only the pages that intersect the intersection of the constrained region R and

the optimal contour.

28

For hierarchical tree based query processing, the proof is given in the following.

Lemma 1 The proposed query processing algorithm is I/O optimal for CP queries

under convex hierarchical structures.

Proof. It is easy to see that any partition intersecting the intersection of optimal

contour and feasible region R will be accessed since they are not pruned in any phase

of the algorithm. We need to prove only these partitions are accessed, i.e., other

partitions will be pruned without accessing the pages.

Figure 2.4 shows diﬀerent cases of the various position relations between a parti-

tion, R and the optimal contour under 2 dimensions. a is an optimal data point in the

data base. We will use minObjV al(Partition, R) to denote the minimum objective

function value that is computed for the partition and constraints R.

A: intersects neither R nor the optimal contour.

B: intersects R but not the optimal contour.

C: intersects the optimal contour but not R.

D: intersects the optimal contour and R, but not the intersection of optimal con-

tour and R.

E: intersects the intersection of the optimal contour and R.

29

Figure 2.4: Cases of MBRs with respect to R and optimal objective function contour

A and C will be pruned since they are infeasible regions, when the CP for these

partitions are solved, they will be eliminated from consideration since they do not

intersect the constrained region and minObjV al(A, R) = minObjV al(C, R) = +∞.

B is pruned because minObjV al(B, R) > F(a) and E will be accessed earlier than

B in the algorithm because minObjV al(E, R) < minObjV al(B, R). We shall show

that a page in case D is pruned by the algorithm. Let CP

0

be the data partition that

contains an optimal data point, CP

1

be the partition that contains CP

0

, . . ., and CP

k

be the root partition that contains CP

0

, CP

1

, , CP

k−1

. Because the objective function

is convex, we have

F(a) ≥ minObjV al(CP

0

, R) ≥ minObjV al(CP

1

, R) ≥ . . .

≥ minObjV al(CP

k

, R)

30

and

minObjV al(D, R) > F(a) ≥ minObjV al(CP

0

, R) ≥

minObjV al(CP

1

, R) ≥ . . . ≥ minObjV al(CP

k

, R)

During the search process of our algorithm, CP

k

is replaced by CP

k−1

, and CP

k−1

by CP

k−2

, and so on, until CP

0

is accessed. If D will be accessed at some point

during the search, then D must be the ﬁrst in the MBR list at some time. This

only can occur after CP

0

has been accessed because minObjV al(D, R) is larger than

constrained function value of any partition containing the optimal data point. If CP

0

is accessed earlier than D, however, the algorithm prunes all partitions which have a

constrained function value larger than F(a), including D. This contradicts with the

assumption that D will be accessed.

The I/O-optimality relates to accessing data from leaf partitions. However, the

algorithm is also optimal in terms of the access structure traversal, that is, we will

never retrieve the objects within an MBR (leaf or non-leaf), unless that MBR could

contain an optimal answer. The proof applies to any partition, whether it contains

data or other partitions.

Similarly, we can prove the I/O optimality of proposed algorithm over non-hierarchical

structures.

Lemma 2 The proposed query processing algorithm is I/O optimal for CP queries

over non-hierarchical convex structures.

Proof. Figure 2.5 shows diﬀerent possible cases of the partition position. Partitions

A, B, C, D, and E are the same as deﬁned for Figure 2.4.

31

Figure 2.5: Cases of partitions with respect to R and optimal objective function

contour

Without loss of generality, assume that only these ﬁve partitions contain some

data points, and partition E contains an optimal data point, a, for a query F(x, y).

Then an optimal algorithm retrieves only partition E.

Partitions C and A will not be retrieved because they are infeasible and our

algorithm will never retrieve infeasible partitions. Partition B will not be retrieved

because partition E will be retrieved before partition B (because minObjV al(E, R) <

minObjV al(B, R)), and the best data point found in E will prune partition B. Thus,

we only need to show that partition D will not be retrieved.

32

Because partition D does not intersect the intersection of the optimal contour

and R but E does, minObjV al(D, R) > minObjV al(E, R). Therefore, partition

E must be retrieved before partition D since the algorithm always retrieves the

most promising partition ﬁrst, and the data point a is found at that time. Sup-

pose partition D is also retrieved later. This means partition D cannot be pruned by

data point a, i.e., F(a) ≥ minObjV al(D, R), which also means the virtual solution

corresponding to (minObjV al(D, R)) is contained by the optimal contour (because

of the convexity of function F ). Therefore, the virtual solution corresponding to

(minObjV al(D, R)) ∈ D ∩ R∩optimal contour. Hence, we can ﬁnd at least one vir-

tual point x

∗

(without considering the database) in partition D such that x

∗

is both

in R and the optimal contour. This contradicts the assumption that partition D does

not intersect the intersection of R and the optimal contour.

2.6 Performance Evaluation

The goals of the experiments are to show i) that the proposed framework is general

and can be used to process a variety of queries that optimize some objective ii) that

the generality of the framework does not impose signiﬁcant cpu burden over those

queries that already have an optimal solution, and iii) that non-relevant data subspace

can be eﬀectively pruned during query processing.

We performed experiments for a variety of optimization query types including

kNN, Weighted kNN, Weighted Constrained kNN, and Constrained Weighted Linear

Optimization. Where possible, we compare cpu times against a non-general optimiza-

tion technique. We index and perform que-ries over four real data sets. One of these

is Color Histogram data, a 64-dimensional color image histogram dataset of 100,000

33

data points. The vectors represent color histograms computed from a commercial

CD-ROM. The second is Satellite Image Texture (Landsat) data, which consists of

100,000 60-dimensional vectors representing texture feature vectors of Landsat images

[40]. These datasets are widely used for performance evaluation of index structures

and similarity search algorithms [39, 26]. We used a clinical trials patient data set

and a stock price data set to perform real world optimization queries.

Weighted kNN Queries Figure 2.6 shows the performance of our query pro-

cessing over R*-trees for the ﬁrst 8-dimensions of the histogram data for both kNN

and k weighted NN (k-WNN) queries. The index is built over the ﬁrst 8 dimensions

of the histogram data and the queries are generated to ﬁnd the kNN or k-WNN to a

random point in the data space over the same 8 dimensions. We execute 100 queries

for each trial, and randomly assign weights between 0 and 1 for the k-WNN queries.

We vary the value of k between 1 and 100. We assume the index structure is in

memory and we measure the number of additional page accesses required to access

points that could be kNN according to the query processing. Because the weights

are 1 for the kNN queries, our algorithm matches the I/O-optimal access oﬀered by

traditional R-tree nearest neighbor branch and bound searching.

The ﬁgure shows that we achieve nearly the same I/O performance for weighted

kNN queries using our query processing algorithm over the same index that is ap-

propriate for traditional non-weighted kNN queries. For unconstrained, unweighted

nearest neighbor queries, the optimal contour is a hyper-sphere around the query point

with a radius equal to the distance of the nearest point in the data set. If instead, the

queries are weighted nearest neighbor queries, the optimal contour is a hyper-ellipse.

Intuitively, as the weights of a weighted nearest neighbor query become more skewed,

34

Figure 2.6: Accesses, Random weighted kNN vs kNN, R*-tree, 8-D Histogram, 100k

pts

the hyper-volume enclosed by the optimal contour increases, and the opportunity to

intersect with access structure partitions increases. Results for the weighted queries

track very closely to the non-weighted queries. This indicates that these weighted

functions do not cause the optimal contour to cross signiﬁcantly more access struc-

ture partitions than their non-weighted counterparts. For a weighted nearest neighbor

query, we could generate an index based on attribute values transformed based on

the weights and yield an optimal contour that was hyper-spherical. We could then

traverse the access structure using traditional nearest neighbor techniques. However,

this would require generating an index for every potential set of query weights, which

is not practical for applications where weights are not typically known prior to the

35

query. For similar applications where query weights can vary, and these weights are

not known in advance, we would be better served to process the query over a single

reasonable access structure in an I/O-optimal way than by building speciﬁc access

structures for a speciﬁc set of weights.

Figure 2.7 shows the average cpu times required per query to traverse the access

structure using convex problem solutions for the weighted kNN queries compared to

the standard nearest neighbor traversal used for unweighted kNN queries.

Figure 2.7: CPU Time, Random weighted kNN vs kNN, R*-tree, 8-D Histogram,

100k pts

36

The ﬁgure shows that the generic CP-driven solution takes slightly longer to per-

form than a specialized solution for nearest neighbor queries alone. Part of this

diﬀerence is due to the slightly more complicated objective function (weighted ver-

sus unweighted Euclidean distance), part is due to the additional partitions that

the optimal contour crosses, while the remaining part is due to incorporating data

set constraints (unused in this example) into computing minimum partition function

values.

Hierarchical data partitioning access structures are not eﬀective for high-dimensional

data sets. As data set dimensionality increases, the partitions overlap with each other

more. As partitions overlap more, the probability that a point’s objective function

value can prune other partitions decreases. Therefore, the same characteristics that

make high-dimensional R-tree type structures ineﬀective for traditional queries also

make them ineﬀective for optimization queries. For this reason, we perform similar

experiments for high-dimensional data using VA-ﬁles.

Figure 2.8 shows page access results for kNN and k-WNN queries over 60-dimensional,

5-bit per attribute, VA-ﬁle built over the landsat dataset. The experiments assume

the VA-ﬁle is in memory and measure the number of objects that need to be accessed

to answer the query.

Similar to the R*-tree case, the results show that the performance of the algorithm

is not signiﬁcantly aﬀected by re-weighting the objective function. Sequential scan

would result in examining 100,000 objects. If the dataspace were more uniformly

populated, we would expect to see object accesses much closer to k. However, much

of the data in this data set is clustered, and some vector representations are heavily

37

Figure 2.8: Accesses, Random weighted kNN vs kNN, VA-File, 60-D Sat, 100k pts

populated. If a best answer comes from one of these vector representations, we need

to look at each object assigned the same vector representation.

It should be noted that while the framework results in optimal access of a given

structure, it can not overcome the inherent weaknesses of that structure. For example,

at high dimensionality, hierarchical data structures are ineﬀective, and the framework

can not overcome this. A consequence of VA-File access structures is that, without

some other data reorganization, a computation needs to occur for every approximated

object. As distance computations need to be performed for each vector in traditional

VA-ﬁle nearest neighbor algorithms, following the VA-ﬁle processing model so too

must a CP problem be solved for each vector for general optimization queries. Times

38

for this experiment reﬂect this fact. It takes longer to perform each query but the

ratio of the general optimization query times to the traditional kNN CPU times is

similar to the R*-tree case, about 1.006 (i.e. very little computational burden is

added to allow query generality).

Constrained Weighted Linear Optimization Queries We also explore Con-

strained Weighted-Linear Optimization queries. Figure 2.9 shows the number of page

accesses required to answer randomly weighted LP minimization queries for the his-

togram data set using the 8-D R*-Tree access structure. We vary k and we vary the

constrained area region ﬁxed at the origin as a 8-D hypercube with the indicated

value. We generated 100 queries for each data point in the form of a summation of

weighted attributes over the 8 dimensions. Weights are uniformly randomly generated

and are between the values of -10 and 10.

The graph shows interesting results. The number of page accesses is not directly

related to the constraint size. The rate of increase in page accesses to accommodate

diﬀerent values of k does vary with the constraint size. When weights can be negative,

the corner corresponding to the minimum constraint value for the positively weighted

attributes, and the maximum constrained value for the negatively weighted attributes

will yield the optimal result within the constrained region (e.g. corner [0,1,0] would

be an optimal minimization answer for a constrained unit cube and a linear function

with a weight vector [+,-,+]). For this data set, many points are clustered around

the origin, leading to denser packing of access structure partitions near the origin.

Because of the sparser density of partitions around the point that optimizes the query,

we access fewer partitions when this particular problem is less constrained. However,

the radius of the k

th

-optimal contour increases more in these less dense regions than

39

Figure 2.9: Page Accesses, Random Weighted Constrained k-LP Queries, R*-Tree,

8-D Color Histogram, 100k pts

more clustered regions, leading to a faster increase in the number of page accesses

required when k increases.

Note that the Onion technique is not feasible for this dataset dimensionality,

and it, as well as the other competing techniques are not appropriate for convex

constrained areas. For comparative purposes, the sequential scan of this data set is

834 pages.

We should also note what happens when there are less than k optimal answers in

the data set. This means that there are less than k objects in our constrained region.

For our query processing, we only need to examine those points that are in partitions

40

that intersect the constrained region. A technique that did not evaluate constraints

prior to or during query processing will need to examine every point.

Queries with Real Constraints The previous example showed results with

synthetically generated constraints. We also wanted to explore that constrained op-

timization problems for real applications. We performed Weighted Constrained kNN

experiments to emulate a patient matching data exploration task described in Sec-

tion 2.3.3. To test this, we performed a number of queries over a set of 17000 real

patient records. We established constraints on age and gender and queried to ﬁnd the

10 weighted nearest neighbors in the data set according to 7 measured blood analytes.

Using the VA-ﬁle structure with 5 bits assigned per attribute and 100 randomly se-

lected, randomly weighted queries, we achieved an average of 11.5 vector accesses to

ﬁnd the k = 10 results. This means that on average the number of vectors within the

intersection of the constrained space and 10

th

-optimal contour is 11.5. It corresponds

to a vector access ratio of 0.0007.

Arbitrary Functions over Diﬀerent Access Structures In order to show that

the framework can be used to ﬁnd optimal data points for arbitrary convex functions,

we generated a set of 100 random functions. Unlike nearest neighbor and linear

optimization type queries, the continuous optimal solution is not intuitively obvious

by examining the function. Functions contain both quadratic and linear terms and

were generated over 5 dimensions in the form

¸

5

i=1

(random() ∗ x

2

i

−random() ∗ x

i

),

where x

i

is the data attribute value for the i

th

dimension. A sample function, where

coeﬃcients are rounded to two decimal places, is to minimize 0.91x

2

1

+0.49x

2

2

+0.4x

2

3

+

0.55x

2

4

+ 0.76x

2

5

−0.09x

1

−0.49x

2

−0.28x

3

−0.91x

4

−0.51x

5

.

41

We explored the results of our technique over 5 index structures. These include

3 hierarchical structures, including the R-tree, the R*-tree, and the VAM-split bulk

loaded R*-tree. We explored 2 non-hierarchical structures, the grid ﬁle and the VA-

ﬁle. For each of these index types, we modiﬁed code to include an optimization

function that uses CP to traverse the index structure using the algorithm appropriate

for the structure. We added a new set of 50000 uniform random data points within

the 5-dimension hypercube and built each of the index structures over this data set.

We ran the set of random functions over each structure and measured the number

of objects in the partitions that we retrieve. For the hierarchical structures these are

leaf partitions and for the non-hierarchical structures they are grid cells. For each

structure we try to keep the number partitions relatively close to each other. The

number of partitions for the hierarchical structures is between 12000 and 13000, the

grid ﬁle 16000, and the VA-File 32000.

Figure 2.10 shows results over the 100 functions. It shows the minimum, maxi-

mum, and average object access ratio to guarantee discovery of the optimal answer

over the set of functions. Each index yielded the same correct optimal data point for

each function, and these optimal answers were scattered throughout the data space.

Each structure performed well for these optimization queries, the average ratio of the

data points accessed is below 0.0025 for all the access structures. Even the worst

performing structure over the worst case function still prunes over 99% of the search

space.

Results between hierarchical access structures are as expected. The R*-tree looks

to avoid some of the partition overlaps produced by the R-tree insertion heuristics

and it shows better performance. The VAM-split R*-tree bulk loads the data into the

42

Figure 2.10: Accesses vs. Index Type, Optimizing Random Function, 5-D Uniform

Random, 50k Points

index and can avoid more partition overlaps than an index built using dynamic inser-

tion. As expected, this yields better performance than the dynamically constructed

R*-tree.

The VA-File has greater resolution than the grid-ﬁle in this particular test, and

therefore achieves better access ratios.

Incorporating Problem Constraints during Query Processing The pro-

posed framework processes queries in conjunction with any set of convex problem

constraints. We compare the proposed framework to alternative methods for han-

dling constraints in optimization query processing. For a constrained NN type query

we could process the query according to traditional NN techniques and when we

ﬁnd a result, check it against the constraints until we ﬁnd a point that meets the

43

constraints. Alternatively, we could perform a query on the constraints themselves,

and then perform distance computations on the result set and select the best one.

Figure 2.11 shows the number of pages accessed using our technique compared to

these two alternatives as the constrained area varies. Constrained kNN queries were

performed over the 8-D R*-tree built for the Color Histogram data set.

Figure 2.11: Pages Accesses vs. Constraints, NN Query, R*-Tree, 8-D Histogram,

100k Points

In the ﬁgure, the R*-tree line shows the number of page accesses required when

we use traditional NN techniques, and then check if the result meets the constraints.

As the problem becomes less constrained, the more likely a discovered NN will fall

within the constraints, and we can terminate the search. The Selectivity line is a

44

lower bound on the number of pages that would be read to obtain the points within

the constraints. Clearly, as the problem becomes more constrained, the fewer pages

will need to be read to ﬁnd potential answers. The CP line shows the results for our

framework, which processes original problem constraints as part of the search process

and does not further process any partitions that do not intersect constraints.

We demonstrate better results with respect to page accesses except in cases where

the constraints are very small (and we can prune much of the space by performing a

query over the constraints ﬁrst) or when the NN problem is not constrained (where

our framework reduces to the traditional unconstrained NN problem). When there

are problem constraints, we prune search paths that can not lead to feasible solutions.

Our search drills down to the leaf partition that either contains the optimal answer

within the problem constraints, or yields the best potential answer within constraints.

2.7 Conclusions

In this chapter we present a general framework to model optimization queries.

We present a query processing framework to answer optimization queries in which

the optimization function is convex, query constraints are convex, and the underlying

access structure is made up of convex partitions. We provide speciﬁc implementations

of the processing framework for hierarchical data partitioning and non-hierarchical

space partitioning access structures. We prove the I/O optimality of these implemen-

tations. We experimentally show that the framework can be used to answer a number

of diﬀerent types of optimization queries including nearest neighbor queries, linear

function optimization, and non-linear function optimization queries.

45

The ability to handle general optimization queries within a single framework comes

at the price of increasing the computational complexity when traversing the access

structure. We show a slight increase in CPU times in order to perform convex pro-

gramming based traversal of access structures in comparison with equivalent non-

weighted and unconstrained versions of the same problem.

Since constraints are handled during the processing of partitions in our framework,

and partitions and search paths that do not intersect with problem constraints are

pruned during the access structure traversal, we do not need to examine points that

do not intersect with the constraints. Furthermore, we only access points in partitions

that could potentially have better answers than the actual optimal database answers.

This results in a signiﬁcant reduction in required accesses compared to alternative

methods in the cases where the inclusion of constraints makes these alternatives less

eﬀective.

The proposed framework oﬀers I/O-optimal access of whatever access structure is

used for the query. This means that given an access structure we will only read points

from partitions that have minimum objective function values lower than the k

th

ac-

tual best answer. Because the framework is built to best utilize the access structure,

it captures the beneﬁts as well as the ﬂaws of the underlying structure. Essentially,

it demonstrates how well the particular access structure isolates the optimal contour

within the problem constraints to access structure partitions. If the optimal contour

and problem constraints can be isolated to a few leaf partitions, the access structure

will yield nice results. However, if the optimal contour crosses many partitions, the

performance will not be as good. As such, the framework can be used to measure

page access performance associated with using diﬀerent indexes and index types to

46

answer certain classes of optimization queries, in order to determine which struc-

tures can most eﬀectively answer the optimization query type. Database researchers

and administrators can use this technique as a benchmarking tool to evaluate the

performance of a wide range of index structures available in the literature.

To use this framework, one does not need to know the objective function, weights,

or constraints in advance. The system does not need to compute a queried function

for all data objects in order to ﬁnd the optimal answers. The technique provides

optimization of arbitrary convex functions, and does not incur a signiﬁcant penalty

in order to provide this generality. This makes the framework appropriate for ap-

plications and domains where a number of diﬀerent functions are being optimized

or when optimization is being performed over diﬀerent constrained regions and the

exact query parameters are not known in advance.

47

CHAPTER 3

Online Index Recommendations for High-Dimensional

Databases using Query Workloads

When solving problems, dig at the roots instead of just hacking at the

leaves. - Anthony J. D’Angelo.

In order to eﬀectively prune search space during query resolution, access struc-

tures that are appropriate for the query must be available to the query processor

at query run-time. Appropriate access structures should match well with query se-

lection criteria and allow for a signiﬁcant reduction in the data search space. This

chapter describes a framework to determine a meaningful set of appropriate access

structures for a set of queries considering query cost savings estimated from both

query attributes and data characteristics. Since query patterns can change over time,

rendering statically determined access structures ineﬀective, a method for monitoring

performance and recommending access structure changes is provided.

3.1 Introduction

An increasing number of database applications, such as business data warehouses

and scientiﬁc data repositories, deal with high dimensional data sets. As the number

of dimensions/attributes and overall size of data sets increase, it becomes essential to

eﬃciently retrieve speciﬁc queried data from the database in order to eﬀectively utilize

48

the database. Indexing support is needed to eﬀectively prune out signiﬁcant portions

of the data set that are not relevant for the queries. Multi-dimensional indexing,

dimensionality reduction, and RDBMS index selection tools all could be applied to the

problem. However, for high-dimensional data sets, each of these potential solutions

has inherent problems.

To illustrate these problems, consider a uniformly distributed data set of 1,000,000

data objects with several hundred attributes. Range queries are consistently executed

over ﬁve of the attributes. The query selectivity over each attribute is 0.1, so the

overall query selectivity is 1/10

5

(i.e. the answer set contains about 10 results).

An ideal solution would allow us to read from disk only those pages that contain

matching answers to the query. We could build a multi-dimensional index over the

data set so that we can directly answer any query by only using the index. However,

performance of multi-dimensional index structures is subject to Bellman’s curse of

dimensionality [4] and degrades rapidly as the number of dimensions increases. For

the given example, such an index would perform much worse than sequential scan.

Another possibility would be to build an index over each single dimension. The

eﬀectiveness of this approach is limited to the amount of search space that can be

pruned by a single dimension (in the example the search space would only be pruned

to 100,000 objects).

For data-partitioning indexes such as the R-tree family of indexes, data is placed

in a partition that contains the data point and could overlap with other partitons.

To answer a query, all potentially matching search paths must be explored. As the

dimensionality of the index increases, the overlaps between partitions increase and

49

at high enough dimensions the entire data space needs to be explored. For space-

partitioning structures, where partitions do not overlap and data points are associated

with cells which contain them (e.g. grid ﬁles), the problem is the exponential explosion

of the number of cells. A 100 dimensional index with only a single split per attribute

results in 2

100

cells. The index structure can become intractably large just to identify

the cells. This phenomenon is thoroughly described in [54].

Another possible solution would be to use some dimensionality reduction tech-

nique, index the reduced dimension data space, and transform the query in the

same way that the data was transformed. However, the dimensionality reduction

approaches are mostly based on data statistics, and perform poorly especially when

the data is not highly correlated. They also introduce a signiﬁcant overhead in the

processing of queries.

Another possible solution is to apply feature selection to keep the most important

attributes of the data according to some criteria and index the reduced dimension-

ality space. However, traditional feature selection techniques are based on selecting

attributes that yield the best classiﬁcation capabilities. Therefore, they also select

attributes based on data statistics to support classiﬁcation accuracy rather than focus-

ing on query performance and workload in a database domain. As well, the selected

features may oﬀer little or no data pruning capability given query attributes.

Commercial RDBMS’s have developed index recommendation systems to identify

indexes that will work well for a given workload. These tools are optimized for the

domains for which these systems are primarily employed and the indexes that the

systems provide. They are targeted towards lower dimension transactional databases

and do not produce results optimized for single high-dimensional tables.

50

Our approach is based on the observation that in many high dimensional database

applications, only a small subset of the overall data dimensions are popular for a

majority of queries and recurring patterns of dimensions queried occur. For example,

Large Hadron Collider (LHC) experiments are expected to generate data with up to

500 attributes at the rate of 20 to 40 per second [45]. However, the search criterion

is expected to consist of 10 to 30 parameters. Another example is High Energy

Physics (HEP) experiments [49] where sub-atomic particles are accelerated to nearly

the speed of light, forcing their collision. Each such collision generates on the order

of 1-10MBs of raw data, which corresponds to 300TBs of data per year consisting

of 100-500 million objects. The queries are predominantly range queries and involve

mostly around 5 dimensions out of a total of 200.

We address the high dimensional database indexing problem by selecting a set of

lower dimensional indexes based on joint consideration of query patterns and data

statistics. This approach is also analogous to dimensionality reduction or feature

selection with the novelty that the reduction is speciﬁcally designed for reducing query

response times, rather than maintaining data energy as is the case for traditional

approaches. Our reduction considers both data and access patterns, and results in

multiple and potentially overlapping sets of dimensions, rather than a single set. The

new set of low dimensional indexes is designed to address a large portion of expected

queries and allow eﬀective pruning of the data space to answer those queries.

Query pattern evolution over time presents another challenging problem. Re-

searchers have proposed workload based index recommendation techniques. Their

long term eﬀectiveness is dependent on the stability of the query workload. However,

query access patterns may change over time becoming completely dissimilar from the

51

patterns on which the index set were originally determined. There are many common

reasons why query patterns change. Pattern change could be the result of periodic

time variation (e.g. diﬀerent database uses at diﬀerent times of the month or day), a

change in the focus of user knowledge discovery (e.g. a researcher discovery spawns

new query patterns), a change in the popularity of a search attribute (e.g. current

events cause an increase in queries for certain search attributes), or simply random

variation of query attributes. When the current query patterns are substantially dif-

ferent from the query patterns used to recommend the database indexes, the system

performance will degrade drastically since incoming queries do not beneﬁt from the

existing indexes. To make this approach practical in the presence of query pattern

change, the index set should evolve with the query patterns. For this reason, we

introduce a dynamic mechanism to detect when the access patterns have changed

enough that either the introduction of a new index, the replacement of an existing

index, or the construction of an entirely new index set is beneﬁcial.

Because of the need to proactively monitor query patterns and query performance

quickly, the index selection technique we have developed uses an abstract represen-

tation of the query workload and the data set that can be adjusted to yield faster

analysis. We generate this abstract representation of the query workload by mining

patterns in the workload. The query workload representation consists of a set of

attribute sets that occur frequently over the entire query set that have non-empty

intersections with the attributes of the query, for each query. To estimate the query

cost, the data set is represented by a multi-dimensional histogram where each unique

value represents an approximation of data and contains a count of the number of

52

records that match that approximation. For each possible index for each query, the

estimated cost of using that index for the query is computed.

Initial index selection occurs by traversing the query workload representation and

determining which frequently occurring attribute set results in the greatest beneﬁt

over the entire query set. This process is iterated until some indexing constraint is met

or no further improvement is achieved by adding additional indexes. Analysis speed

and granularity is aﬀected by tuning the resolution of the abstract representations.

The number of potential indexes considered is aﬀected by adjusting data mining

support level. The size of the multi-dimensional histogram aﬀects the accuracy of the

cost estimates associated with using an index for a query.

In order to facilitate online index selection, we propose a control feedback system

with two loops, a ﬁne grain control loop and a coarse control loop. As new queries

arrive, we monitor the ratio of potential performance to actual performance of the

system in terms of cost and based on the parameters set for the control feedback

loops, we make major or minor changes to the recommended index set.

The contributions of this work can be summarized as follows:

1. Introduction of a ﬂexible index selection technique designed for high-dimensional

data sets that uses an abstract representation of the data set and query work-

load. The resolution of the abstract representation can be tuned to achieve

either a high ratio of index-covered queries for static index selection or fast

index selection to facilitate online index selection.

2. Introduction of a technique using control feedback to monitor when online

query access patterns change and to recommend index set changes for high-

dimensional data sets.

53

3. Presentation of a novel data quantization technique optimized for query work-

loads.

4. Experimental analysis showing the eﬀects of varying abstract representation

parameters on static and online index selection performance and showing the

eﬀects of varying control feedback parameters on change detection response.

The rest of the chapter is organized as follows. Section 3.2 presents the related

work in this area. Section 3.3 explains our proposed index selection and control

feedback framework. Section 3.4 presents the empirical analysis. We conclude in

Section 3.5.

3.2 Related Work

High-dimensional indexing, feature selection, and DBMS index selection tools are

possible alternatives for addressing the problem of answering queries over a subspace

of high-dimensional data sets. As described below, each of these methods provides

less than ideal solutions for the problem of fast high-dimensional data access. Our

work diﬀers from the related index selection work in that we provide a index selection

framework that can be tuned for speed or accuracy. Our technique is optimized

to take advantage of multi-dimensional pruning oﬀered by multi-dimensional index

structures. It takes into consideration both data and query characteristics and can

be applied to perform real-time index recommendations for evolving query patterns.

3.2.1 High-Dimensional Indexing

A number of techniques have been introduced to address the high-dimensional

indexing problem, such as the X-tree [6], and the GC-tree [28]. While these index

54

structures have been shown to increase the range of eﬀective dimensionality, they still

suﬀer performance degradation at higher index dimensionality.

3.2.2 Feature Selection

Feature selection techniques [7, 36, 29] are a subset of dimensionality reduction

targeted at ﬁnding a set of untransformed attributes that best represent the overall

data set. These techniques are also focused on maximizing data energy or classiﬁca-

tion accuracy rather than query response. As a result, selected features may have no

overlap with queried attributes.

3.2.3 Index Selection

The index selection problem has been identiﬁed as a variation of the Knapsack

Problem and several papers proposed designs for index recommendations [34, 55, 3,

24, 17, 13] based on optimization rules. These earlier designs could not take advantage

of modern database systems’ query optimizer. Currently, almost every commercial

Relational Database Management System (RDBMS) provides the users with an index

recommendation tool based on a query workload and using the query optimizer to

obtain cost estimates. A query workload is a set of SQL data manipulation state-

ments. The query workload should be a good representative of the types of queries

an application supports.

Microsoft SQL Server’s AutoAdmin tool [15, 16, 1] selects a set of indexes for use

with a speciﬁc data set given a query workload. In the AutoAdmin algorithm, an

iterative process is utilized to ﬁnd an optimal conﬁguration. First, single dimension

candidate indexes are chosen. Then a candidate index selection step evaluates the

queries in a given query workload and eliminates from consideration those candidate

55

indexes which would provide no useful beneﬁt. Remaining candidate indexes are eval-

uated in terms of estimated performance improvement and index cost. The process

is iterated for increasingly wider multi-column indexes until a maximum index width

threshold is reached or an iteration yields no improvement in performance over the

last iteration.

Costs are estimated using the query optimizer which is limited to considering those

physical designs oﬀered by the DBMS. In the case of SQL Server, single and multi-level

B+-trees are evaluated. These index structures can not achieve the same level of result

pruning that can be oﬀered by an index technique that indexes multiple dimensions

simultaneously (such as R-tree or grid ﬁle). As a result the indexes suggested by the

tool often do not capture the query performance that could be achieved for multi-

dimensional queries.

MAESTRO (METU Automated indEx Selection Tool)[19] was developed on top

of Oracle’s DBMS to assist the database administrator in designing a complete set of

primary and secondary indexes by considering the index maintenance costs based on

the valid SQL statements and their usage statistics automatically derived using SQL

Trace Facility during a regular database session. The SQL statements are classiﬁed by

their execution plan and their weights are accumulated. The cost function computed

by the query optimizer is used to calculate the beneﬁt of using the index.

For DB2, IBM has developed the DB2Adviser [52] which recommends indexes

with a method similar to AutoAdmin with the diﬀerence that only one call to the

query optimizer is needed since the enumeration algorithm is inside the optimizer

itself.

56

These commercial index selection tools are coupled to physical design options

provided by their respective query optimizers and therefore, do not reﬂect the pruning

that could be achieved by indexing multiple dimensions together.

3.2.4 Automatic Index Selection

The idea of having a database that can tune itself by automatically creating new

indexes as the queries arrive has been proposed [35, 18]. In [35] a cost model is used

to identify beneﬁcial indexes and decide when to create or drop an index at runtime.

[18] proposes an agent-based database architecture to deal with automatic index

creation. Microsoft Research has proposed a physical design alerter [11] to identify

when a modiﬁcation to the physical design could result in improved performance.

3.3 Approach

3.3.1 Problem Statement

In this section we deﬁne the problem of index selection for a multi-dimensional

space using a query workload.

A query workload W consists of a set of queries that select objects within a

speciﬁed subspace in the data domain. More formally, we deﬁne a workload as follows:

Deﬁnition 4 A workload W is a tuple W = (D, DS, Q), where D is the domain,

DS ⊆ D is a ﬁnite subset (the data set), and Q (the query set) is a set of subsets of

DS.

In our case, the domain is

d

, where d is the dimensionality of the dataset, the

instance DS is a set of n tuples t = {(a

1

, a

2

, ..., a

d

)|a

i

∈ }, and the query set Q is a

set of range queries.

57

Finding the answers to a query, when no index is present, reduces to scanning all

the points in the dataset and testing whether the query conditions are met. In this

scenario we can deﬁne the cost of answering the query as the time it takes to scan the

dataset, i.e. the time to retrieve the data pages from disk. The assumption is that

the time spend doing I/O dominates the time required to perform the simple bound

comparisons. In the case that an index is present, the cost of answering the query

can be lower. The index can identify a smaller set potential matching objects and

only those data pages containing these objects need to be retrieved from disk. The

degree to which an index prunes the potential answer set for a query determines its

eﬀectiveness for the query.

Our problem can be deﬁned as ﬁnding a set of indexes I, given a multi-dimensional

dataset DS, a query workload W, an optional indexing constraint C, an optional

analysis time constraint t

a

, that provides the best estimated cost over W. In the

context of this problem an index is considered to be the set of attributes that can

be used to prune subspace simultaneously with respect to each attribute. Therefore,

attribute order has no impact on the amount of pruning possible.

The overall goal of this work is to develop a ﬂexible index selection framework

that can be tuned to achieve eﬀective static and online index selection for high-

dimensionsal data under diﬀerent analysis constraints.

For static index selection, when no constraints are speciﬁed, we are aiming to

recommend the set of indexes that yields the lowest estimated cost for every query

in a workload for any query that can beneﬁt from an index. In the case when a

constraint is speciﬁed, either as the minimum number of indexes or a time constraint,

we want to recommend a set of indexes within the constraint, from which the queries

58

can beneﬁt the most. When there is a time-constraint, we need to automatically

adjust the analysis parameters to increase the speed of analysis.

For online index selection, we are striving to develop a system that can recommend

an evolving set of indexes for incoming queries over time, such that the beneﬁt of

index set changes outweighs the cost of making those changes. Therefore, we want an

online index selection system that diﬀerentiates between low-cost index set changes

and higher cost index set changes and also can make decisions about index set changes

based on diﬀerent cost-beneﬁt thresholds.

3.3.2 Approach Overview

In order to measure the beneﬁt of using a potential index over a set of queries, we

need to be able to estimate the cost of executing the queries, with and without the

index. Typically, a cost model is embedded into the query optimizer to decide on the

query plan, whether the query should be answered using a sequential scan or using

an existing index. Instead of using the query optimizer to estimate query cost, we

conservatively estimate the number of matches associated with using a given index

by using a multi-dimensional histogram abstract representation of the dataset. The

histogram captures data correlations between only those attributes that could be rep-

resented in a selected index. The cost associated with an index is calculated based on

the number of estimated matches derived from the histogram and the dimensionality

of the index. Increasing the size of the multi-dimensional histogram enhances the

accuracy of the estimate at the cost of abstract representation size.

While maintaining the original query information for later use to determine esti-

mated query cost, we apply one abstraction to the query workload to convert each

59

query into the set of attributes referenced in the query. We perform frequent itemset

mining over this abstraction and only consider those sets of attributes that meet a

certain support to be potential indexes. By varying the support, we aﬀect the speed

of index selection and the ratio of queries that are covered by potential indexes. We

further prune the analysis space using association rule mining by eliminating those

subsets above a certain conﬁdence threshold. Lowering the conﬁdence threshold im-

proves analysis time by eliminating some lower dimensional indexes from considera-

tion but can result in recommending indexes that cover a strict superset of the queried

attributes.

Our technique diﬀers from existing tools in the method we use to determine the

potential set of indexes to evaluate and in the quantization-based technique we use

to estimate query costs. All of the commercial index wizards work in design time.

The Database Administrator (DBA) has to decide when to run this wizard and over

which workload. The assumption is that the workload is going to remain static over

time and in case it does change, the DBA would collect the new workload and run the

wizard again. The ﬂexibility aﬀorded by the abstract representation we use allows

it to be used for infrequent, index selection considering a broader analysis space or

frequent, online index selection.

In the following two sections we present our proposed solution for index selection

which is used for static index selection and as a building block for the online index

selection.

60

Figure 3.1: Index Selection Flowchart

3.3.3 Proposed Solution for Index Selection

The goal of the index selection is to minimize the cost of the queries in the workload

given certain constraints. Given a query workload, a dataset, the indexing constraints,

and several analysis parameters, our framework produces a set of suggested indexes

as an output. Figure 3.1 shows a ﬂow diagram of the index selection framework.

Table 3.1 provides a list of the notations used in the descriptions.

We identify three major components in the index selection framework: the ini-

tialization of the abstract representations, the query cost computation, and the index

selection loop. In the following subsections we describe these components and the

dataﬂow between them.

Initialize Abstract Representations

The initialization step uses a query workload and the dataset to produce a set of

Potential Indexes P, a Query Set Q, and a Multi-dimensional Histogram H, according

to the support, conﬁdence, and histogram size speciﬁed by the user. The description

of the outputs and how they are generated is given below.

• Potential Index Set P

61

Symbol Description

P Potential Set of Indexes, the set of attribute sets under consid-

eration as a suggested indexes

Q Query Set, a representation of the query workload, for each

query. It consists of the attribute sets in P that intersect with

the query attributes, query ranges, and estimated query costs

H Multi-Dimensional Histogram, used as a workload-optimized ab-

stract representation of the data set

S Suggested Indexes, the set of attribute sets currently selected as

recommended indexes

i attribute set currently under analysis

support the minimum ratio of occurrence of an attribute in a query work-

load to be included in P

confidence the maximum ratio of occurrence of an attribute subset to the

occurrence of a set before the subset is pruned from P

histogram size the number of bits used to represent a quantized data object

Table 3.1: Index Selection Notation List

The potential index set P is a collection of attribute sets that could be beneﬁcial

as an index for the queries in the input query workload. This set is computed using

traditional data mining techniques. Considering the attributes involved in each query

from the input query workload to be a single transaction, P consists of the sets of

attributes that occur together in a query at a ratio greater than the input support.

Formally, support of a set of attributes A is deﬁned as:

S

A

=

n

¸

i=1

1 if A ⊆ Q

i

0 otherwise

n

where Q

i

is the set of attributes in the i

th

query and n is the number of queries.

For instance, if the input support is 10%, and attributes 1 and 2 are queried

together in greater than 10 percent of the queries, then a representation of the set

62

of attributes {1,2} will be included as a potential index. Note that because a subset

of an attribute set that meets the support requirement will also necessarily meet the

support, all subsets of attribute sets meeting the support will also be included as a

potential index (in the example above both the sets {1} and {2} will be included). As

the input support is decreased, the number of potential indexes increases. Note that

our particular system is built independently from a query optimizer, but the sets of

attributes appearing in the predicates from a query optimizer log could just as easily

be substituted for the query workload in this step.

If a set occurs nearly as often as one of its subsets, an index built over the subset

will likely not provide much beneﬁt over the query workload if an index is built over

the attributes in the set. Such an index will only be more eﬀective in pruning data

space for those queries that involve only the subset’s attributes. In order to enhance

analysis speed with limited eﬀect on accuracy, the input conﬁdence is used to prune

analysis space. Conﬁdence is the ratio of a set’s occurrence to the occurrence of a

subset.

While data mining the frequent attribute sets in the query workload in determin-

ing P, we also maintain the association rules for disjoint subsets and compute the

conﬁdence of these association rules. The conﬁdence of an association rule is deﬁned

as the ratio that the antecedent (Left Hand Side of the rule) and consequent (Right

Hand Side of the rule) appear together in a query given that the antecedent appears

in the query. Formally, conﬁdence of an association rule {set of attributes A}→{set

of attributes B}, where A and B are disjoint, is deﬁned as:

63

C

A→B

=

n

¸

i=1

1 if (A ∪ B) ⊆ Q

i

0 otherwise

n

¸

i=1

1 if A ⊆ Q

i

0 otherwise

where Q

i

is the set of attributes in the i

th

query and n is the number of queries.

In our example, if every time attribute 1 appears, attribute 2 also appears then

the conﬁdence of {1}→{2} = 1.0. If attribute 2 appears without attribute 1 as many

times as it appears with attribute 1, then the conﬁdence {2}→{1} = 0.5. If we have

set the conﬁdence input to 0.6, then we will prune the attribute set {1} from P, but

we will keep attribute set {2}.

We can also set the conﬁdence level based on attribute set cardinality. Since

the cost of including extra attributes that are not useful for pruning increases with

increased indexed dimensionality, we want to be more conservative with respect to

pruning attribute subsets. The conﬁdence can contain more than one value depending

on set cardinality.

While the apriori algorithm was appropriate for the relatively low attribute query

sets in our domain, a more eﬃcient algorithm such as the FP-Tree [30] could be applied

if the attribute sets associated with queries are too large for the apriori technique to

be eﬃcient. While it is desirable to avoid examining a high-dimensional index set

as a potential index, another possible solution in the case where a large number of

attributes are frequent together would be to partition a large closed frequent itemset

into disjoint subsets for further examination. Techniques such as CLOSET [44] could

be used to arrive at the initial closed frequent itemsets.

• Query Set Q

64

The query set Q is the abstract representation of the query workload is initialized

by associating the potential indexes that could be beneﬁcial for each query with

that query. These are the indexes in the potential index set P that share at least

one common attribute with the query. At the end of this step, each query has an

identiﬁed set of possible indexes for that query.

• Multi-Dimensional Histogram H

An abstract representation of the data set is created in order to estimate the query

cost associated with using each query’s possible indexes to answer that query. This

representation is in the form of a multi-dimensional histogram H. A single bucket

represents a unique bit representation across all the attributes represented in the

histogram. The input histogram size dictates the number of bits used to represent

each unique bucket in the histogram. These bits are designated to represent only

the single attributes that met the input support in the input query workload. If a

single attribute does not meet the support, then it can not be part of an attribute

set appearing in P. There is no reason to sacriﬁce data representation resolution for

attributes that will not be evaluated. The number of bits that each of the represented

attributes gets is proportional to the log of that attribute’s support. This gives more

resolution to those attributes that occur more frequently in the query workload.

Data for an attribute that has been assigned b bits is divided into 2

b

buckets.

In order to handle data sets with uneven data distribution, we deﬁne the ranges of

each bucket so that each bucket contains roughly the same number of points. The

histogram is built by converting each record in the data set to its representation

in bucket numbers. As we process data rows, we only aggregate the count of rows

with each unique bucket representation because we are just interested in estimating

65

query cost. Note that the multi-dimensional histogram is based on a scalar quantizer

designed on data and access patterns, as opposed to just data in the traditional case.

A higher accuracy in representation is achieved by using more bits to quantize the

attributes that are more frequently queried.

For illustration, Table 3.2 shows a simple multi-dimensional histogram example.

This histogram covers 3 attributes and uses 1 bit to quantize attributes 2 and 3, and

2 bits to quantize attribute 1, assuming it is queried more frequently than the other

attributes. In this example, for attributes 2 and 3 values from 1-5 quantize to 0, and

values from 6-10 quantize to 1. For attribute 1, values 1 and 2 quantize to 00, 3 and

4 quantize to 01, 5-7 quantize to 10, and 8 and 9 quantize to 11. The .’s in the ’value’

column denote attribute boundaries (i.e. attribute 1 has 2 bits assigned to it).

Note that we do not maintain any entries in the histogram for bit representa-

tions that have no occurrences. So we can not have more histogram entries than

records and will not suﬀer from exponentially increasing the number of potential

multi-dimensional histogram buckets for high-dimensional histograms.

Sample Dataset Histogram

A

1

A

2

A

3

Encoding Value Count

2 5 5 0000 00.0.0 2

4 8 3 0110 00.0.1 1

1 4 3 0000 01.0.0 1

6 7 1 1010 01.1.0 1

3 2 2 0100 01.1.1 1

2 2 6 0001 10.1.0 2

5 6 5 1010 11.0.0 1

8 1 4 1100 11.0.1 1

3 8 7 0111

9 3 8 1101

Table 3.2: Histogram Example

66

Query Cost Calculation

Once generated, the abstract representations of the query set Q and the multi-

dimensional histogram H are used to estimate the cost of answering each query using

all possible indexes for the query. For a given query-index pair, we aggregate the

number of matches we ﬁnd in the multi-dimensional histogram looking only at the

attributes in the query that also occur in the index (bits associated with other at-

tributes are considered to be don’t cares in the query matching logic). To estimate

the query cost, we then apply a cost function based on the number of matches we

obtain using the index and the dimensionality of the index. At the end of this step,

our abstract query set representation has estimated costs for each index that could

improve the query cost. For each query in the query set representation, we also keep

a current cost ﬁeld, which we initialize to the cost of performing the query using

sequential scan. At this point, we also initialize an empty set of suggested indexes S.

• Cost Function

A cost function is used to estimate the cost associated with using a certain index

for a query. The cost function can be varied to accurately reﬂect a cost model for the

database system. For example, one could apply a cost function that amortized the

cost of loading an index over a certain number of queries or use a function tailored

to the type of index that is used. Many cost functions have been proposed over the

years. For an R-Tree, which is the index type used for this work, the expected number

of data page accesses is estimated in [22] by:

A

nn,mm,FBF

=

d

1

C

eff

+ 1

d

67

where d is the dimensionality of the dataset and C

eff

is the number of data objects

per disk page. However, this formula assumes the number of points N approaches to

inﬁnity and does not consider the eﬀects of high dimensionality or correlations.

A more recently proposed cost model is given in [8] where the expected number

of pages accesses is determined as:

A

r,em,ui

(r) =

2r ·

d

N

C

eff

+ 1 −

1

C

eff

d

where r is the radius of the range query, d is the dataset dimensionality, N is the

number of data objects, and C

eff

is the capacity of a data page.

While these published cost estimates can be eﬀective to estimate the number of

page accesses associated with using a multi-dimensional index structure under certain

conditions, they have certain characteristics that make them less than ideal for the

given situation. Each of the cost estimate formulas require a range radius. Therefore

the formulas break down when assessing the cost of a query that is an exact match

query in one or more of the query dimensions. These cost estimates also assume that

data distribution is independent between attributes, and that the data is uniformly

distributed throughout the data space.

In order to overcome these limitations, we apply a cost estimate that is based on

the actual matches that occur over the multi-dimension histogram over the attributes

that form a potential index. The cost model for R-trees we use in this work is given

by

(d

(d/2)

∗ m)

where d is the dimensionality of the index and m is the number of matches returned for

query matching attributes in the multi-dimensional histogram. Using actual matches

68

eliminates the need for a range radius. It also ties the cost estimate to the actual

data characteristics (i.e. incorporates both data correlation between attributes and

data distribution, while the published models will produce results that are dependent

only on the range radius for a given index structure). The cost estimate provided

is conservative in that it will provide a result that is at least as great as the actual

number of matches in the database.

By evaluating the number of matches over the set of attributes that match the

query, the multi-dimensional subspace pruning that can be achieved using diﬀerent

index possibilities is taken into account. There is additional cost associated with

higher dimensionality indexes due to the greater number of overlaps of the hyperspaces

within the index structure, and additional cost of traversing the higher dimension

structure. A penalty is imposed on a potential index by the dimensionality term.

Given equal ability to prune the space, a lower dimensional index will translate into

a lower cost.

The cost function could be more complicated in order to more accurately model

query costs. It could model query cost with greater accuracy, for example by crediting

complete attribute coverage for coverage queries. It could also reﬂect the appropriate

index structures used in the database system, such as B+-trees. We used this partic-

ular cost model because the index type was appropriate for our data and query sets,

and we assumed that we would retrieve data from disk for all query matches.

Index Selection Loop

After initializing the index selection data structures and updating estimated query

costs for each potentially useful index for a query, we use a greedy algorithm that takes

into account the indexes already selected to iteratively select indexes that would be

69

appropriate for the given query workload and data set. For each index in the potential

index set P, we traverse the queries in query set Q that could be improved by that

index and accumulate the improvement associated with using that index for that

query. The improvement for a given query-index pair is the diﬀerence between the

cost for using the index and the query’s current cost. If the index does not provide

any positive beneﬁt for the query, no improvement is accumulated. The potential

index i that yields the highest improvement over the query set Q is considered to

be the best index. Index i is removed from potential index set P and is added to

suggested index set S. For the queries that beneﬁt from i, the current query cost is

replaced by the improved cost.

After each i is selected, a check is made to determine if the index selection loop

should continue. The input indexing constraints provides one of the loop stop criteria.

The indexing constraint could be any constraint such as the number of indexes, total

index size, or total number of dimensions indexed. If no potential index yields further

improvement, or the indexing constraints have been met, then the loop exits. The

set of suggested indexes S contains the results of the index selection algorithm.

At the end of a loop iteration, when possible, we prune the complexity of the

abstract representations in order to make the analysis more eﬃcient. This includes

actions such as eliminating potential indexes that do not provide better cost estimates

than the current cost for any query and pruning from consideration those queries

whose best index is already a member of the set of suggested indexes. The overall

speed of this algorithm is coupled with the number of potential indexes analyzed,

so the analysis time can be reduced by increasing the support or decreasing the

conﬁdence.

70

Diﬀerent strategies can be used in selecting a best index. The strategy provided

assumes a indexing constraint based on the number of indexes and therefore uses

the total beneﬁt derived from the index as the measure of index ’goodness’. If the

indexing constraint is based on total index size, then beneﬁt per index size unit

may be a more appropriate measure. However, this may result in recommending a

lower-dimensioned index and later in the algorithm a higher-dimensioned index that

always performs better. The recommendation set can be pruned in order to avoid

recommending an index that is non-useful in the context of the complete solution.

3.3.4 Proposed Solution for Online Index Selection

The online index selection is motivated by the fact that query patterns can change

over time. By monitoring the query workload and detecting when there is a change on

the query pattern that generated the existing set of indexes, we are able to maintain

good performance as query patterns evolve. In our approach, we use control feedback

to monitor the performance of the current set of indexes for incoming queries and

determine when adjustments should be made to the index set. In a typical control

feedback system, the output of a system is monitored and based on some function

involving the input and output, the input to the system is readjusted through a

control feedback loop. Our situation is analogous but more complex than the typical

electrical circuit control feedback system in several ways:

1. Our system input is a set of indexes and a set of incoming queries rather than

a simple input, such as an electrical signal.

2. The system output must be some parameter that we can measure and use to

make decisions about changing the input. Query performance is the obvious

71

parameter to monitor. However, because lower query performance could be

related to other aspects rather than the index set, our decision making control

function must necessarily be more complex than a basic control system.

3. We do not have a predictable function to relate system input and output because

of the non-determinism associated with new incoming queries. For example, we

may have a set of attributes that appears in queries frequently enough that our

system indicates that it is beneﬁcial to create an index over those attributes,

but there is no guarantee that those attributes will ever be queried again.

Control feedback systems can fail to be eﬀective with respect to response time.

The control system can be too slow to respond to changes, or it can respond too

quickly. If the system is too slow, then it fails to cause the output to change based

on input changes in a timely manner. If it responds too quickly, then the output

overshoots the target and oscillates around the desired output before reaching it.

Both situations are undesirable and should be designed out of the system.

Figure 3.2 represents our implementation of dynamic index selection. Our system

input is a set of indexes and a set of incoming queries. Our system simulates and

estimates costs for the execution of incoming queries. System output is the ratio of

potential system performance to actual system performance in terms of database page

accesses to answer the most recent queries. We implement two control feedback loops.

One is for ﬁne grain control and is used to recommend minor, inexpensive changes

to the index set. The other loop is for coarse control and is used to avoid very

poor system performance by recommending major index set changes. Each control

feedback loop has decision logic associated with it.

72

Figure 3.2: Dynamic Index Analysis Framework

System Input

The system input is made up of new incoming queries and the current set of indexes

I, which is initialized to be the suggested indexes S from the output of the initial

index selection algorithm. For clarity, a notation list for the online index selection is

included as Table 3.3.

System

The system simulates query execution over a number of incoming queries. The

abstract representation of the last w queries stored as W , where w is an adjustable

window size parameter. W is used to estimate performance of a hypothetical set of

indexes I

new

against the current index set I. This representation is similar to the one

kept for query set Q in the static index selection. In this case, when a new query

q arrives, we determine which of the current indexes in I most eﬃciently answers

73

Symbol Description

I Current set of attribute sets used as indexes

I

new

Hypothetical set of attribute sets used as indexes

w window size

W abstract represention of the last w queries

q the current query under analysis

P Current Potential Indexes, the set of attribute sets in consider-

ation to be indexes before q arrives

P

new

New Potential Indexes, the set of attribute sets in consideration

to be indexes after q arrives

i

q

the attribute set estimated to be the best index for query q

Table 3.3: Notation List for Online Index Selection

this query and replace the oldest query in W with the abstract representation of q.

We also incrementally compute the attribute sets that meet the input support and

conﬁdence over the last w queries. This information is used in the control feedback

loop decision logic. The system also keeps track of the current potential indexes P,

and the current multi-dimensional histogram H.

System Output

In order to monitor the performance of the system, we compare the query per-

formance using the current set of indexes I to the performance using a hypothetical

set of indexes I

new

. The query performance using I is the summation of the costs of

queries using the best index from I for the given query. Consider the possible new

indexes P

new

to be the set of attribute sets that currently meet the input support

and conﬁdence over the last w queries. The hypothetical cost is calculated diﬀerently

based on the comparison of P and P

new

, and the identiﬁed best index i

q

from P or

P

new

for the new incoming query:

74

1. P = P

new

and i is in I. In this case we bypass the control loops since we could

do no better for the system by changing possible indexes.

2. P = P

new

and i is not in I. We recompute a new set of suggested indexes I

new

over the last w queries. The hypothetical cost is the cost over the last w queries

using I

new

.

3. P = P

new

and i is in I. In this case we bypass the control loops since we could

do no better for the system by changing possible indexes.

4. P = P

new

and i is not in I. We traverse the last w queries and determine those

queries that could beneﬁt from using a new index from P

new

. We compute the

hypothetical cost of these queries to be the real number of matches from the

database. Hypothetical cost for other queries is the same as the real cost.

The ratio of the hypothetical cost, which indicates potential performance, to the

actual performance is used in the control loop decision logic.

Fine Grain Control Loop

The ﬁne grain control loop is used to recommend low cost, minor changes to index

set. This loop is entered in case 2 as described above when the ratio of hypotheti-

cal performance to actual performance is below some input minor change threshold.

Then the indexes are changed to I

new

, and appropriate changes are made to update

the system data structures. Increasing the input minor change threshold causes the

frequency of minor changes to also increase.

75

Coarse Control Loop

The coarse control loop is used to recommend more costly, but changes with

greater impact on future performance to the index set. This loop is entered in case 4 as

described above when the ratio of hypothetical performance to actual performance is

below some input major change threshold. Then the static index selection is performed

over the last w queries, abstract representations are recomputed, and a new set of

suggested indexes I

new

is generated. Appropriate changes are made to update the

system data structures to the new situation. Increasing the input major change

threshold increases the frequency of major changes.

3.3.5 System Enhancements

In the following subsections, we present two system enhancements that provide

further robustness and scalability to our framework.

Self Stabilization

Control feedback systems can be either too slow or too responsive in reacting

to input changes. In our application, a system that is slow to respond results in

recommending useful indexes long after they ﬁrst could have a positive eﬀect. It

could also fail to recommend potentially useful indexes if thresholds are set so that

the system is insensitive to change. A system that is too responsive can result in

system instability, where the system continously adds and drops indexes.

The system performance and rate of index change can be monitored and used

in order to tune the control feedback system itself. If the actual number of query

results is much lower than the estimated cost using the recommended index set over

a window of queries, this indicates a non-responsive system. When this condition is

76

detected, the system can be made more responsive by cutting the window size. This

increases the probability that P

new

will be diﬀerent from P, and we can reperform to

analysis applying a reduced support. This gives us a more complete P with respect

to answering a greater portion of queries.

If the frequency of index change is too high with little or no improvement in query

performance, an oversensitive system or unstable query pattern is indicated. We can

reduce the sensitivity by increasing the window size or increasing the support level

during recomputation of new recommended indexes.

Partitioning the Algorithm

The system can also be applied to environments where the database is parti-

tioned horizontally or vertically at diﬀerent locations. A solution at one extreme

is to maintain a centralized index recommendation system. This would maintain

the abstract representation and collect global query patterns over the entire system.

Recommended indexes would be determined based on global trends. This approach

would allow for the creation of indexes that maximize pruning over the global set of

dimensions. However, it would not optimize query performance at the site level. A

single set of indexes would be recommended for the entire system.

At the other extreme would be to perform the index recommendation at each

site. In this case, a diﬀerent abstract representation would be built based on the

data at each speciﬁc site given requests to the data at that site. Indexes would

be recommended based on the query traﬃc to the data at that site. This allows

tight control of the indexes to reﬂect the data at each site. However, index-based

pruning based on attributes and data at multiple locations would not be possible.

77

This approach also requires traﬃc to each site that contains potentially matching

data.

A hybrid approach can provide the beneﬁts of index recommendation based on

global data and query patterns while optimizing the index set at diﬀerent requesting

locations. A global set of indexes can be centrally recommended based on a global

abstract representation of the data set with low support thresholds (the centralized

location will store a larger set of globally good indexes). The query patterns emanat-

ing from each site can be mined in order to ﬁnd which of those global indexes would

be appropriate to maintain at the local site, eliminating some traﬃc.

3.4 Empirical Analysis

3.4.1 Experimental Setup

Data Sets

Several data sets were used during the performance of experiments. The variation

in data sets is intended to show the applicability of our algorithm to a wide range of

data sets and to measure the eﬀect that data correlation has on results. Data sets

used include:

• random - a set of 100,000 records consisting of 100 dimensions of uniformly

distributed integers between 0 and 999. The data is not correlated.

• stocks - a set of 6500 records consisting of 360 dimensions of daily stock market

prices. This data is extremely correlated.

78

• mlb - a set of 33619 records of major league pitching statistics from between the

years of 1900 and 2004 consisting of 29 dimensions of data. Some dimensions

are correlated with each other, while others are not at all correlated.

Analysis Parameters

The eﬀect of varying several analysis input parameters including support, multi-

dimensional histogram size, and online indexing control feedback decision thresholds

was analyzed. Unless otherwise speciﬁed, the confidence parameter for the experi-

ments is 1.0.

Query Workloads

It is desirable to explore the general behavior of database interactions without

inadvertently capturing the coupling associated with using a speciﬁc query history

on the same database. Therefore, query workload ﬁles were generated by merging

synthetic query histories and query histories from real-world applications with dif-

ferent data sets. Histories and data were merged by taking a random record from

a data set and the numerical identiﬁer of the attributes involved in the synthetic or

historical query in order to generate a point query. So, if a historical query involved

the 3rd and 5th attribute, and the n

th

record was randomly selected from the data

set, a SELECT type query is generated from the data set where the 3rd attribute

is equal to the value of the 3rd attribute of the n

th

record and the 5th attribute is

equal to the value of 5th attribute of the n

th

record. This gives a query workload that

reﬂects the attribute correlations within queries, and has a variable query selectivity.

Unless otherwise stated each query in experiments is a point query with respect to

79

the attributes covered by the query. Attributes that are not covered by the query can

be any value and still match the query.

These query histories form the basis for generating the query workloads used in

our experiments:

1. synthetic - 500 randomly generated queries. The distribution of the queries over

the ﬁrst 200 queries is 20% involve attributes {1,2,3,4} together, 20% {5,6,7},

20%, {8,9}, and the remaining queries involve between 1 and 5 attributes that

could be any attribute. Over the last 300 queries, the distribution shifts to

20% covering attributes {11,12,13,14}, 20% {15,16,17}, 20% {18,19}, and the

remaining 40% are between 1 to 5 attributes that could be any attribute.

2. clinical - 659 queries executed from a clinical application. The query distribution

ﬁle has 64 distinct attributes.

3. hr - 35,860 queries executed from a human resources application. The query

distribution ﬁle has 54 attributes. Due to the size of this query set, some initial

portion of the queries are used for some experiments.

3.4.2 Experimental Results

Index Selection with Relaxed Constraints

Figure 3.3 shows how accurately the proposed static index selection technique can

perform with relaxed analysis constraints. For this scenario, a low support value,

0.01, is used. There are no constraints placed on the number of selected indexes.

And 256 bits are used for the multi-dimension histogram. Figure 3.3 shows the cost

of performing a sequential scan over 100 queries using the indicated data sets, the

estimated cost of using recommended indexes, and the true number of answers for

80

Figure 3.3: Costs in Data Object Accesses for Ideal, Sequential Scan, AutoAdmin,

Naive, and the Proposed Index Selection Technique using Relaxed Constraints

the set of queries. For comparative purposes, costs for indexes proposed by AutoAd-

min and a naive algorithm are also provided. The naive algorithm uses indexes for

the ﬁrst n most frequent itemsets in the query patterns, where n is the number of

indexes suggested by the proposed algorithm. Note that the cost for sequential scan

is discounted to account for the fact that sequential access is signiﬁcantly cheaper

than random access. A factor of 10 is applied as the ratio of random access cost

to sequential access cost. While the actual ratio of a given environment is variable,

regardless of any reasonable factor used, the graph will show the cost of using our

indexes much closer to the ideal number of page accesses than the cost of sequential

scan.

Table 3.4 compares the proposed index selection algorithm with relaxed con-

straints against SQL Server’s index selection tool, AutoAdmin [15], using the data

sets and query workloads indicated.

81

Data Set/ Analysis % Queries Number of

Workload Tool Time(s) Improved Indexes

stock/ AutoAdmin 450 100 23

clinical Proposed 110 100 18

stock/ AutoAdmin 338 100 20

hr Proposed 160 100 16

mlb/ AutoAdmin 15 0 0

clinical Proposed 522 87 16

Table 3.4: Comparison of proposed index selection algorithm with AutoAdmin in

terms of analysis time and % queries improved

For the stock dataset using both the clinical and hr workloads, both algorithms

suggest indexes which will improve all of the queries. Since the selectivity of these

queries is low (the queries return a low number of matches using any index that

contains a queried attribute), the amount of the query improvement will be very

similar using either recommended index set. The proposed algorithm executes in

less time and generates indexes which are more representative of the query patterns

in the workload and allow for greater subspace pruning. For example, the most

common query in the clinical workload is to query over both attributes 11 and 44

together. SQL Server’s index recommendations are all single-dimension indexes for

all attributes that appear in the workload. However, our ﬁrst index recommendation

is a 2-dimension index built over attributes 11 and 44.

For the mlb dataset, SQL Server quickly recommended no indexes. Our index

selection takes longer in this instance, but ﬁnds indexes that improve 87 % of the

queries. These are all the queries that have selectivities low enough that an index

can be beneﬁcial. It should be noted that the indexes selected by AutoAdmin are

appropriate based on the cost models for the underlying index structure used in

82

SQL Server. Since the underlying index type for SQL Server is B+-trees, which do

not index multiple dimensions concurrently, the single dimension indexes that are

recommended are the least costly possible indexes to use for the query set. For this

particular experiment, a single dimensional index is not able to prune the subspace

enough to make the application of that index worthwhile compared to sequential scan.

Eﬀect of Support and Conﬁdence

Table 3.5 presents results on analysis complexity and expected query improvement

as support and confidence are varied. The results are shown for the stock data

set over the clinical query workload. Results show the total number of query-index

pairs analyzed over the query set in the index selection loop and the total estimated

improvement in query performance in terms of data objects accessed over the query

set as the index selection parameters vary. As confidence decreases, we maintain

fewer potential indexes in P and need to analyze fewer attribute sets per index. This

decreases analysis time but shows very little eﬀect on overall performance.

Using a conﬁdence level of 0% is equivalent to using the maximal itemsets of at-

tributes that meet support criteria as recommended indexes. For this example, the

strategy yields nearly identical estimated cost although only 34-44% of the query/in-

dex pairs need to be evaluated.

Baseline Online Indexing Results

A number of experiments were performed to demonstrate the eﬀectiveness and

characteristics of the adaptive system. In each of the experiments that show the

performance of the adaptive system over a sequence of queries, an initial set of indexes

is recommended based on the ﬁrst 100 queries given the stated analysis parameters.

83

Support Query/Index Pairs Total Improvement

Conﬁdence 100 50 0 100 50 0

2 229 229 90 57710 57710 57603

4 198 198 85 55018 55018 55001

6 185 185 73 46924 46924 46907

8 178 178 66 42533 42533 42516

10 170 170 58 37527 37527 37510

Table 3.5: Comparison of Analysis Complexity and Query Performance as support

and confidence vary, stock dataset, clinical workload

New queries are evaluated, and depending on the parameters of the adaptive system,

changes are made to index set. These index set changes are considered to take place

before the next query. The estimated cost of the last 100 queries given the indexes

available at the time of the query are accumulated and presented. This demonstrates

the evolving suitability of the current index set to incoming queries.

Figures 3.4 through 3.6 show a baseline comparison between using the query

pattern change detection and modifying the indexes and making no change to the

initial index set. A number of data set and query history combinations are examined.

The baseline parameters used are 5% support, 128 bits for each multi-dimension

histogram entry, a window size of 100, an indexing constraint of 10 indexes, a major

change threshold of 0.9 and a minor change threshold of 0.95.

Figure 3.4 shows the comparison for the random data set using the synthetic query

workload. At query 200, this workload drastically changes, and this is evident in the

query cost for the static system. The performance gets much worse and does not get

better for the static system. For the online system, the performance degrades a little

84

Figure 3.4: Baseline Comparative Cost of Adaptive versus Static Indexing, random

Dataset, synthetic Query Workload

when the query patterns are changing and then improve again once the indexes have

changed to match the new query patterns.

The synthetic query workload was generated speciﬁcally to show the eﬀect of a

changing query pattern. Figure 3.5 shows a comparison of performance for the random

data set using the clinical query workload. This real query workload also changes

substantially before reverting back to query patterns similar to those on which the

static index selection was performed. Performance for the static system degrades

substantially when the patterns change until the point that they change back. The

adaptive system is better able to absorb the change.

85

Figure 3.5: Baseline Comparative Cost of Adaptive versus Static Indexing, random

Dataset, clinical Query Workload

86

Figure 3.6: Baseline Comparative Cost of Adaptive versus Static Indexing, stock

Dataset, hr Query Workload

Figure 3.6 shows the comparison for the stock data set using the ﬁrst 2000 queries

of the hr query workload. The adaptive system shows consistently lower cost than

the static system and in places signiﬁcantly lower cost for this real query workload.

The eﬀect of range queries was also explored. The random data set was examined

using the synthetic workload using ranges instead of points. As the ranges for a given

attribute increased from a point value to 30% selectivity, the cost saved for top 3

proposed indexes (which were the index sets [1,2,3,4], [5,6,7], and [8,9] for all ranges),

the overall cost saved decreased. This is due to the increase in the size of the answer

87

set. When the range for each attribute reached 30%, no index was recommended

because the result set became large enough that sequential scan was a more eﬀective

solution.

Online Index Selection versus Parametric Changes

Changing online index selection parameters changes the adaptive index selection

in the following ways:

Support - decreasing the support level has the eﬀect of increasing the potential

number of times the index set will be recalculated. It also makes the recalculation of

the recommended indexes themselves more costly and less appropriate for an online

setting. Figure 3.7 shows the eﬀect of changing support on the online indexing results

for the random data set and synthetic query workload, while Figure 3.8 shows the

eﬀect on the stock data set using the ﬁrst 600 queries of hr query workload. These

graphs were generated using the baseline parameters and varying support levels. As

expected, lower support levels translate to lower initial costs because more of the ini-

tial query set are covered by some beneﬁcial index. For the synthetic query workload,

when the patterns changed, but do not change again (e.g. queries 100-200 in Figure

3.7), lower support levels translated to better performance. However, for the hr query

workload, the 10% support threshold yields better performance than the 6% and 8%

thresholds for some query windows. The frequent itemsets change more frequently

for lower support levels, and these changes dictate when decisions occur. These runs

made decisions at points that turned out to be poor decisions, such as eliminating

an index that ended up being valuable. Little improvement in cost performance is

achieved between 4% support and 2% support. Over-sensitive control feedback can

88

Figure 3.7: Comparative Cost of Online Indexing as Support Changes, random

Dataset, synthetic Query Workload

degrade actual performance, independent of the extra overhead that oversensitive

control causes.

Online Indexing Control Feedback Decision Thresholds - increasing the

thresholds decreases the response time of aﬀecting change when query pattern change

does occur. It also has the eﬀect of increasing the number of times the costly reanalysis

occurs. In a real system, one would need to balance the cost of continued poor

performance against the cost of making an index set change. Figure 3.9 shows the

eﬀect of varying the major change threshold in the coarse control loop for the random

89

Figure 3.8: Comparative Cost of Online Indexing as Support Changes, stock Dataset,

hr Query Workload

90

data set and synthetic query workload. Figure 3.10 shows the eﬀect of changing

the major change threshold for the stock data set and hr query workload. These

graphs were generated using the baseline parameters (except that the random graph

uses a support of 10%), and only varying the major change threshold in the coarse

control loop. The major change threshold is varied between 0 and 1. Here, a value

of 0 translates to never making a change, and a value of 1 means making a change

whenever improvement is possible. The graphs show no real beneﬁt once the major

change threshold increases (and therefore frequency of major changes) beyond 0.6.

Figure 3.9 shows that a value of 1 or 0.9 are best in terms of response time when a

change occurs, but a value of 0.3 shows the best performance once the query patterns

have stabilized. This indicates that this value should be carefully tuned based on the

expected frequency of query pattern changes.

Multi −dimensional Histogram Size - increasing the number of bits used to

represent a bucket in the multi-dimensional histogram improves analysis accuracy at

the cost of histogram generation time and space. Experiments demonstrated that

if the representation quality of the histogram was suﬃcient, very little beneﬁt was

achieved through greater resolution. Figure 3.11 shows an example of the eﬀect of

varying the multi-dimensional histogram size. Using a 32 bit size for each unique

histogram bucket yields higher costs than by using greater histogram resolution. One

reason for this is that artiﬁcially higher costs are calculated because of the lower

histogram resolution. As well, some beneﬁcial index changes are not recommended

because of these overly conservative cost estimates.

91

Figure 3.9: Comparative Cost of Online Indexing as Major Change Threshold

Changes, random Dataset, synthetic Query Workload

92

Figure 3.10: Comparative Cost of Online Indexing as Major Change Threshold

Changes, stock Dataset, hr Query Workload

93

Figure 3.11: Comparative Cost of Online Indexing as Multi-Dimensional Histogram

Size Changes, random Dataset, synthetic Query Workload

94

3.5 Conclusions

A ﬂexible technique for index selection is introduced that can be tuned to achieve

diﬀerent levels of constraints and analysis complexity. A low constraint, more complex

analysis can lead to more accurate index selection over stable query patterns. A more

constrained, less complex analysis is more appropiate to adapt index selection to

account for evolving query patterns. The technique uses a generated multi-dimension

histogram to estimate cost, and as a result is not coupled to the idiosyncrasies of

a query optimizer, which may not be able to take advantage of knowledge about

correlations between attributes. Indexes are recommended in order to take advantage

of multi-dimensional subspace pruning when it is beneﬁcial to do so.

These experiments have shown great opportunity for improved performance using

adaptive indexing over real query patterns. A control feedback technique is introduced

for measuring performance and indicating when the database system could beneﬁt

from an index change. By changing the threshold parameters in the control feedback

loop, the system can be tuned to favor analysis time or pattern change recognition.

The foundation provided here will be used to explore this tradeoﬀ and to develop

an improved utility for real-world applications. The proposed technique aﬀords the

opportunity to adjust indexes to new query patterns. A limitation of the proposed

approach is that if index set changes are not responsive enough to query pattern

changes then the control feedback may not aﬀect positive system changes. However,

this can be addressed by adjusting control sensitivity or by changing control sensitivity

over time as more knowledge is gathered about the query patterns.

95

From initial experimental results, it seems that the best application for this ap-

proach is to apply the more time consuming no-constraint analysis in order to de-

termine an initial index set and then apply a lightweight and low control sensitivity

analysis for the online query pattern change detection in order to avoid or make the

user aware of situations where the index set is not at all eﬀective for the new incoming

queries.

In this thesis, the frequency of index set change has been aﬀected through the use

of the online indexing control feedback thresholds. Alternatively, the frequency of

performance monitoring could be adjusted to achieve similar results and to appropri-

ately tune the sensitivity of the system. This monitoring frequency could be in terms

of either a number of incoming queries or elapsed time.

The proposed online change detection system utilized two control feedback loops in

order to diﬀerentiate between inexpensive and more time consuming system changes.

In practice, the ﬁne grain control threshold was not triggered unless we contrived a

situation such that it would be triggered. The kinds of low cost changes that this

threshold would trigger are not the kinds of changes that make enough impact to

be that much better than the existing index set. This would change if the indexing

constraints were very low, and one potential index is now more valuable than a

suggested index.

Index creation is quite time-consuming. It is not feasible to perform real-time

analysis of incoming queries and generate new indexes when the patterns change.

Potential indexes could be generated prior to receiving new queries, and when indi-

cated by online analysis, moved to active status. This could mean moving an index

from local storage to main memory, or from remote storage to local storage depending

96

on the size of the index. Additionally, the analysis could prompt a server to create a

potential index as the analysis becomes aware that such an index is useful, and once

it is created, it could be called on by the local machine.

97

CHAPTER 4

Multi-Attribute Bitmap Indexes

The signiﬁcant problems we have cannot be solved at the same level of

thinking with which we created them. - Albert Einstein (attributed).

An access structure built over multiple attributes aﬀords the opportunity to elim-

inate more of a potential result set within a single traversal of the structure than a

single attribute structure. However, this added pruning power comes with a cost.

Query processing becomes more complex as the number of potential ways in which

a query can be executed increases. The multi-attribute access structure may not

be eﬀective for general queries as a multi-attribute index can only prune results for

those attributes that match selection criteria of the query. This chapter presents

multi-attribute bitmap indexes. It describes the changes required when modifying a

traditionally single attribute access structure to handle multiple attributes in terms

of enhancing the query processing model and adding cost-based index selection.

4.1 Introduction

Bitmap indexes are widely used in data warehouses and scientiﬁc applications to

eﬃciently query large-scale data repositories. They are successfully implemented in

commercial Database Management Systems, e.g., Oracle [2], IBM [46], Informix [33],

98

Sybase [20], and have found many other applications such as visualization of scientiﬁc

data [56]. Bitmap indexes can provide very eﬃcient performance for point and range

queries thanks to fast bit-wise operations supported by hardware.

In their simplest and most popular form, equality encoded bitmaps also known

as projection indexes, are constructed by generating a set of bit vectors where a ’1’

in the i

th

position of a bit vector indicates that the i

th

tuple maps to the attribute-

value associated with that particular bit vector. In general, the mapping can be used

to encode a speciﬁc value, category, or range of values for a given attribute. Each

attribute is encoded independently.

Queries are then executed by performing the appropriate bit operations over the

bit vectors needed to answer the query. A range query over a value-encoded bitmap

can be answered by ORing together the bit vectors that correspond to the range

for each attribute, and then ANDing together these intermediate results between

attributes. The resultant bit vector has set bits for the tuples that fall within the

range queried.

One disadvantage of bitmap indexes is the amount of space they require. Com-

pression is used to reduce the size of the overall bitmap index. General compression

techniques are not desirable as they require bitmaps to be decompressed before ex-

ecuting the query. Run length encoders, on the contrary, allow the query execution

over the compressed bitmaps. The two most popular are Byte-Aligned Bitmap Code

(BBC) [2] and Word Aligned Hybrid (WAH) [57]. The CPU time required to an-

swer queries is related the total length of the compressed bitmaps used in the query

computation.

99

These traditional bitmap indexes are analogous to single dimension indexes in

traditional indexing domains. However, just as traditional multi-attribute indexes

can be more eﬀective in pruning data search space and reducing query time, multi-

attribute bitmap indexes could be applied to more eﬃciently answer queries. This

cost savings comes from two sources. First, the number of bitwise operations is

potentially reduced as more than one attribute are evaluated simultaneously. Second,

the bitvectors built over two or more attributes are sparser, which allows for greater

compression. The resulting smaller bitmaps makes the processing of the bitmap faster

which translates into faster query execution time.

As an example, consider a 2-dimensional query over a large dataset. In order to

avoid sequential scan over the data we build a index access structure that can direct

the search by pruning the space. If single dimensional indexes, such as B+-Trees or

projection indexes, are available over the two attributes queried, the query would ﬁnd

candidates for each attribute and then combine the candidate sets to obtain the ﬁnal

answer. In the case of B+-Trees, the set intersection operation is used to combine

the sets. In the case of projection indexes, the AND operation is used to combine the

bitmaps. However, if a 2-dimensional index such as an R-tree or Grid File exists over

the two queried attributes, then the data space can be pruned simultaneously over

both dimensions. This results in a smaller result set if the two dimensions indexed are

not highly correlated. Also a set intersection operation is not required. Similarly, a

multi-dimensional bitmap over a combination of attributes and values identiﬁes those

records that match the query combination and avoids a further bit operation in order

to answer the query criteria.

100

In this thesis, the use of Multi-Attribute Bitmaps is proposed to improve the

performance of queries in comparison with traditional single attribute indexes. The

goals of this research are to:

• Evaluate of query performance of multi-attribute bitmap enhanced data man-

agement system compared to the current bitmap index query resolution model.

• Support ad hoc queries where information about expected queries is not avail-

able and support the experimental analysis provided herein. We propose an

expected cost-improvement based technique to evaluate potential attribute set

combinations.

• Given a selection query over a set of attributes and a set of single and multiple

attribute bitmaps available for those attributes, determine the bitmap query

execution steps that results in the best query performance with minimal over-

head.

The proposed approaches to attribute combination and query processing are based

on measuring the expected query cost savings of options. Although the techniques

incorporate heuristics in order to control analysis search space and query execution

overhead, experiments show signiﬁcant opportunity for query execution performance

improvement.

4.2 Background

Bitmap Compression. Bitmap tables need to be compressed to be eﬀective

on large databases. The two most popular compression techniques for bitmaps are

the Byte-aligned Bitmap Code (BBC) [2] and the Word-Aligned Hybrid (WAH) code

101

128 bits 1,20*0,4*1,78*0,30*1

31-bit groups 1,20*0,4*1,6*0 62*0 10*0,21*1 9*1

groups in hex 400003C0 00000000 00000000 001FFFFF 000001FF

WAH (hex) 400003C0 80000002 001FFFFF 000001FF

Figure 4.1: A WAH bit vector.

[59]. Unlike traditional run length encoding, these schemes mix run length encoding

and direct storage. BBC stores the compressed data in Bytes while WAH stores it in

Words. WAH is simpler because it only has two types of words: literal words and ﬁll

words. The most signiﬁcant bit indicates the type of word we are dealing with. Let

w denote the number of bits in a word, the lower (w−1) bits of a literal word contain

the bit values from the bitmap. If the word is a ﬁll, then the second most signiﬁcant

bit is the ﬁll bit, and the remaining (w −2) bits store the ﬁll length. WAH imposes

the word-alignment requirement on the ﬁlls. This requirement is key to ensure that

logical operations only access words.

Figure 4.1 shows a WAH bit vector representing 128 bits. In this example, we

assume 32 bit words. Under this assumption, each literal word stores 31 bits from the

bitmap, and each ﬁll word represents a multiple of 31 bits. The second line in Figure

4.1 shows how the bitmap is divided into 31-bit groups, and the third line shows the

hexadecimal representation of the groups.

The last line shows the values of WAH words. Since the ﬁrst and third groups do

not contain greater than a multiple of 31 of a single bit value, they are represented as

literal words (a 0 followed by the actual bit representation of the group). The fourth

group is less than 31 bits and thus is stored as a literal. The second group contains a

multiple of 31 0’s and therefore is represented as a ﬁll word (a 1, followed by the 0 ﬁll

102

B1 400003C0 80000002 001FFFFF 000001FF

B2 00000100 C0000003 00001F08

B1 AND B2 00000100 80000002 001FFFFF 00000108

Figure 4.2: Query Execution Example over WAH compressed bit vectors.

value, followed by the number of ﬁlls in the last 30 bits). The ﬁrst three words are

regular words, two literal words and one ﬁll word. The ﬁll word 80000002 indicates a

0-ﬁll of two-word long (containing 62 consecutive zero bits). The fourth word is the

active word, it stores the last few bits that could not be stored in a regular word.

Another word with the value nine, not shown, is needed to stores the number of

useful bits in the active word. Logical operations are performed over the compressed

bitmaps resulting in another compressed bitmap.

Bit operations over the compressed WAH bitmap ﬁle have been shown to be faster

than BBC (2-20 times) [57] while BBC gives better compression ratio. In this paper

we use WAH to compress the bitmaps due to the impact on query execution times.

Relationship Between Bitmap Index Size and Bitmap Query Execution

Time. Using WAH compression, queries are executed by performing logical opera-

tions over the compressed bitmaps which result in another compressed bitmap. As

an example, consider the two WAH compressed bitvectors in Figure 4.2.

First, the ﬁrst word of each bitvector is decoded, and it is determined that they

are both literal words. Therefore, the logical AND operation is performed over the

two words and the resultant word is stored as the ﬁrst word in the result bitmap.

When the next word is read from both bitmaps, we have two ﬁll words. The ﬁll word

from B1 is a zero-ﬁll word compressing 2 words, and the ﬁll word from B2 is a one-ﬁll

103

word compressing 3 words. In this case we operate over 2 words simultaneously by

ANDing the ﬁlls and appending a ﬁll word into the result bitmap that compresses

two words (the minimum between the word counts). Since we still have one word

left from B2, we only read a word from B1 this time, and AND it with the all 1

word from B2. Finally, the last two words are read from the two bitmaps and are

ANDed together as they are literal words. A full description of the query execution

process using WAH can be found in [59]. Query execution time is proportional to

the size of the bitmap involved in answering the query [56]. To illustrate the eﬀect

of bit density and consequently bitmap size on query times over compressed bitmaps

we perform the following experiment. We generated 6 uniformly random bitmaps

representing 2 million entries for a number of diﬀerent bit densities and ANDed the

bitmaps together. Figure 4.3 shows average query time against the total bitmap

length of the bitmaps used as operands in the series of operations.

The expected size of a random bitmap with N bits compressed using WAH is

given by the formula [59]:

m

R

(d) ≈

N

w −1

1 −(1 −d

B

)

2w−2

−d

2w−2

**where w is the word size, and d is the bit density.
**

The lower the bit density, the more quickly and more eﬀectively intermediate

results can be compressed as the number of AND operations increases. Figure 3 shows

a clear relationship between query times and bitmap length. The diﬀerent slopes of

the curve are due to changes in frequency of the type of bit operations performed

over the word size chunks of the bitmap operands. At low bit densities, there is a

greater portion of ﬁll word to ﬁll word operations. As the bit density increases, more

ﬁll word to literal word operations occur. At high bit densities, literal word to literal

104

Figure 4.3: Query Execution Times vs Total Bitmap Length, ANDing 6 bitmaps, 2M

entries

word operations become more frequent. As these inter-word operations have diﬀerent

costs, the overall query time to bitmap length relationship is not linear. However it

is clear that query execution time is related to overall bitmap length.

Relationship Between Number of Bitmap Operations and Bitmap Query

Execution Time. Figure 4.4 shows AND query times over WAH-compressed bitmaps

as the number of bitmaps ANDed varies. Again the bitmaps are randomly generated

with the reported bit densities for 2 million entries and the number of bitmaps in-

volved in a AND query is varied. There is clear relationship between query times and

the number of bitmap operations.

105

Figure 4.4: Query Execution Times vs Number of Attributes queried, Point Queries,

2M entries versus

Since query performance is related to the total bitmap length and number of

bitmap operations, a logical question is ”can we modify the representation of the data

in such a way that we can decrease these factors?” One could combine attributes and

generate bitmaps over multiple attributes in order to accomplish this. This is because

a combination over multiple attributes has a lower bit density and is potentially more

compressible, and already has an operation that may be necessary for query execution

built into its construction. However, several research questions arise in this scenario

such as which attributes should be combined, how can the additional indexes be

utilized, and what is the additional burden added by the richer index sets.

106

Technical Motivation for Multi-Attribute Bitmaps. Figure 4.5 shows a

simple example combining two single attributes with cardinality 2, into one 2-attribute

bitmap with cardinality 4. In the table, the combined column labeled b1.b2 means

that the value of attribute 1 maps to b1 and the value for attribute 2 maps to b2.

Such a representation can have the eﬀect of reducing the size of the bitmaps that are

used to answer a query and can also reduce the number of operations. As attributes

are combined, bitcolumns become sparser, and the opportunity for compressing that

column increases, potentially decreasing the size of a bitcolumn. Additionally, if a bit

operation is already incorporated in the multi-attribute bitmap construction, then it

does not need to be evaluated as part of the query execution.

One could think of multi-attribute bitmaps as the AND bitmap of two or more

bitmaps from diﬀerent attributes. One could argue that multi-attribute bitmaps are

reducing the dimensionality of the dataset but increasing the cardinality of the at-

tributes involved in the new dataset. Although bitmap indexes have historically been

discouraged for high-cardinality attributes, the advances in compression techniques

and query processing over compressed bitmaps have made bitmaps an eﬀective solu-

tion for attributes with cardinalities on the order of several hundreds [58].

As an illustrative example of enhancing compression from combining attributes,

consider a bitmap index built over two attributes, each with cardinality 10, where

the data is uniformly randomly distributed independently for each attribute for 5000

tuples. Using WAH compression, there are very few instances where there are long

enough runs of 0’s to achieve any compression (since out of 10 random tuples, we

would expect 1 set bit for a value). In such a test which matched this scenario, very

107

Attribute 1 Attribute 2 Combined

Tuple b1 b2 b1 b2 b1.b1 b1.b2 b2.b1 b2.b2

t

1

1 0 0 1 0 1 0 0

t

2

0 1 1 0 0 0 1 0

t

3

1 0 1 0 1 0 0 0

t

4

1 0 0 1 0 1 0 0

t

5

0 1 0 1 0 0 0 1

t

6

0 1 1 0 0 0 1 0

Figure 4.5: Simple multiple attribute bitmap example showing the combination two

single attribute bitmaps.

little compression was achieved and bitmaps representing each of the 20 attribute-

value pairs were between 640 and 648 bytes long.

As an alternative, we can generate a bitmap index for the combination of the two

attributes. Now the cardinality of the combined attributes is 100, and the expected

number of set bits in a run of 100 tuples for a single value is 1. We can expect much

greater compression for this situation. In the given scenario, the generated size of the

100 2-D bitmaps was between 220 and 400 bytes.

Now consider a range query such that attribute 1 must be between values 1 and

2, and attribute 2 must be equal to 3. The operations using the traditional bitmap

processing to answer the query is ((ATT1=1) OR (ATT1=2)) AND (ATT2=3). The

time required is dependent on the total length of the bitmaps involved. Each of

these bitmaps as well as the resultant intermediate bitmap from the OR operation is

648 bytes, for a total of 2592 bytes. For the multi-attribute case, the query can be

answered by (ATT1=1,ATT2=3) OR (ATT1=2,ATT2=3). These bitmaps are 268

and 260 bytes respectively, for a total of 528 bytes. In this scenario, we can reduce

108

both the size of the bitmaps involved in the query execution and also the number of

bitmaps involved.

This example demonstrates the opportunity to improve query execution by rep-

resenting multiple attribute values from separate attributes in a bitmap. In general,

point queries would always beneﬁt from the multi-attribute bitmaps as long as all the

combined attributes are queried. However, if only some of the attributes were queried

or a large range from one or more attributes were queried, then the single attribute

bitmaps would perform better than the multi-attribute bitmaps. This is the reason

why we propose to supplement the single attribute bitmaps with the multi-attribute

bitmaps and not to replace the single attribute bitmaps. Multi-attribute bitmaps

would only be used in the queries that result in an improved query execution time.

Supplementing single attribute bitmaps with additional bitmaps that improve

query execution time has also been done in [50], where lower resolution bitmaps were

used in addition to the single attribute bitmaps or high resolution bitmaps. One could

think of this approach as storing the OR of several bitmaps from the same attribute

together.

4.3 Proposed Approach

In this section we present our proposed solution for the problems presented in the

previous section. The primary overall tasks can be summarized as multi-attribute

bitmap index selection and query execution. In index selection, we determine which

attributes are appropriate or the most appropriate to combine together. This can be

performed with varying degrees of knowledge about a potential query set. Once can-

didate attribute combinations are determined, the combined bitmaps are generated.

109

In query execution, the candidate pool of bitmaps are explored to determine a plan

for query execution.

4.3.1 Multi-attribute Bitmap Index Selection for Ad Hoc

Queries

In many cases the attributes that should be combined together as multi-attribute

bitmaps are obvious given the context of the data and queries. Whenever multiple

attributes are frequently queried together over a single bitmap value, that combination

of attributes is a candidate for multi-attribute indexing. As an example, consider a

database system behind a car search web form. If the form has categorical entries

for color and model, then a multi-attribute bitmap built over these two attributes

can directly ﬁlter those data objects that meet both search criteria without further

processing.

In order to support situations where no contextual information is available about

queries and in order to allow for comparative performance evaluation, we provide a

cost improvement based candidate evaluation system. The goal behind bitmap index

selection is to ﬁnd sets of attributes that, if combined, can yield a large beneﬁt in terms

of query execution time. The analogy within the traditional multi-attribute index

selection domain would be ﬁnding attributes that are not individually eﬀective in

pruning data space, but can eﬀectively prune the data space when combined together.

An eﬀective index selection technique for bitmap indexes should be cognizant of

the relationships between attributes. As well, the selectivity estimated for a given

attribute combination should accurately reﬂect the true data characteristics.

Since query execution time is dependent both on the total size of the operands

involved in a bitmap computation and also the number of bitmap operations, we use

110

these measures to determine the potential beneﬁt for a given attribute combination.

Algorithm 3 shows how candidate multi-attribute indexes are evaluated and selected

in terms of potential beneﬁt.

Determining Index Selection Candidates

SelectCombinationCandidates (Ordered set of attributes A,

dLimit, cardLimit)

notes:

attSets - is a list of Attribute Sets and corresponding goodness

measures

selected - are a set of recommended attribute sets to build indexes

over

1: attSets = {[{a

i

}, 0] |a

i

∈ A}

2: currentDim = 1

3: while (currentDim < dLimit AND newSetsAdded)

4: unset newSetsAdded

5: for each as

i

∈ attSets, s.t. |as

i

| = currentDim

6: for each attribute a

j

, j = (maxAtt(as

i

)+1) to |A|

7: if (as

i

.card*a

j

.card¡cardLimit)

8: as

j

= [as

i

∪ {a

j

}, detGoodness(as

i

, a

j

)]

9: if goodness(as

j

) > goodness(as

i

)

10: add as

j

to attSets; set newSetsAdded

11: currentDim + +

12: sort attSets by goodness

13: ca = ∅

14: while ca = A

15: attribute set as = next set from attSets

16: if as ∩ ca = ∅

17: ca = ca ∪ as.attributes

18: include as in selected

Algorithm 3: Selecting Attribute Combination Candidates.

111

The algorithm is somewhat analogous to the Apriori method for frequent itemset

mining in that a given candidate set is only evaluated if it’s subset is considered bene-

ﬁcial. The algorithm takes several parameters as input. These include an array, A, of

the attributes under analysis. The parameter dLimit provides a maximum attribute

set combination size. The maximum cardinality allowed for a given prospective com-

bination can be controlled by the cardLimit parameter.

The algorithm evaluates increasingly larger attribute sets (loop starting at line 3).

Each unique combination is evaluated (lines 5-6). Initially all size 1 attribute sets are

combined with each other size 1 attribute set such that each unique size 2 attribute

set is evaluated. In the next iteration of the outer loop, each potentially beneﬁcial

size 2 set is evaluated with relevant size 1 attributes (note that by evaluating only

those singleton sets of attributes numbered larger than the largest in the current set,

we can evaluate all unique set instances). If the number of bitmaps generated by

this combination is excessive (line 7), the set will not be evaluated. The card value

associated with an attribute set is the number of value combinations possible for the

attribute set, and reﬂects the number of bitmaps that would have to be generated to

index based on the set. A goodness measure is evaluated for the proposed combination

(line 8). If the goodness of the combination exceeds the goodness of its ancestor, then

the combination set is added to the list of attribute sets for further evaluation during

the next loop iteration. The term newSetsAdded in line 3 is a boolean condition

indicating if any attribute set was added during the last loop iteration.

Once the loop terminates, the sets in attSets are sorted by their goodness mea-

sures (line 12). Then sets of attributes that are disjoint from the covered attributes,

112

ca, are greedily selected from the sorted sets and added to the recommended bitmap

indexes.

Estimating Performance Gain for a Potential Index

detGoodness (attribute set as

i

, attribute a

j

)

1: set probQT and probSR

2: goodness=0

3: for each bitcolumn bc

k

in as

i

4: for each bitcolumn bc

j

in a

j

5: bitcolumn cb = AND(bc

k

, bc

j

)

6: savings = bc

k

.length + bc

j

.length+

bc

k

.creationLength −cb.length

7: if savings > 0

8: goodness+ = savings ∗ cb.matches

9: goodness∗ = (probQT ∗ probSR)

|as

i

|+1

10: return goodness

Algorithm 4: Determine goodness of including an attribute in a multi-attribute

bitmap.

The goodness of a particular attribute set is computed by Algorithm 4. The

variables, probQT and probSR describe global query characteristics and are used to

estimate diminishing beneﬁts as index dimensionality increases. The probQT pa-

rameter represents the probability that an attribute is queried. As with traditional

multi-attribute index, a 2-attribute index is not as useful for a single attribute query.

The probSR parameter is the probability that the query range for an attribute will

be small enough such that a multi-attribute index over the query criteria will still be

beneﬁcial.

113

Each unique value combination of the entire attribute set (combinations generated

in lines 4-5) is evaluated in terms of bitmap length savings over a corresponding

single attribute bitmap solution. The goodness of an attribute set is measured as the

weighted average of the bitlength savings of each value combination (line 8). The

savings of a particular combination is measured as the sum of the compressed lengths

of the bitmap operands and the compressed length of the bitmaps needed to create

the ﬁrst operand, minus the compressed length of the combined bitmap. For size 1

attribute sets, the bitcolumn creation length is 0. However, for a multiple attribute

bitmap this value is the length of all the constituent bitmaps used to create it. The

total savings reﬂects the diﬀerence in total bitmap operand length to answer the

combination under analysis.

If the savings for a particular combination is negative, then its negative savings

is not accumulated (line 7). This is because this option will not be selected during

the query execution for this combination and will not incur an additional penalty.

Each combination is weighted by its frequency of occurrence. This strategy assumes

that the query space is similar in distribution to the data space. The ﬁnal goodness

measure for an attribute set is also modiﬁed based on the probability that beneﬁt

will actually be realized for the query (line 9).

This technique ﬁnds attributes for which their combinations will lead to a reduc-

tion in the number of operations as well as a reduction in the constituent bitmap

lengths. Both of these factors can provide a performance gain depending on the

query criteria. Our approach takes into account the characteristics of the data by

weighting and measuring the eﬀectiveness of speciﬁc inter-attribute combinations.

The index selection solution space is controlled by limiting parameters and through

114

greedy selection of potential solutions. Although the indexes recommended are not

guaranteed to be optimal, they can be determined eﬃciently. Later experiments show

that the reduction in the number of total operations has a more substantial eﬀect on

query execution time than reduction of bitmap length, (i.e. seemingly less than ideal

attribute combinations can still provide performance gains).

The behavior of the algorithm can be aﬀected with slight alteration. For example,

if index space is critical, we can order the pairs in line 12 of the Select Combination

Candidates algorithm (Algorithm 3) by goodness over combined attribute size. If the

query is equally likely to occur in the data space, then the weighting in line 8 of the

determine Goodness algorithm (Algorithm 4) can be removed.

Once bitmap combinations are recommended, each combined bitmap can be gen-

erated by assigning an appropriate identiﬁer to the bitmap and ANDing together the

applicable single attribute bitmaps. Once multi-attribute bitmaps are available in the

system, we need a mechanism to decide when single attribute and/or multi-attribute

bitmaps would be used. In other words, we need to decide on a query execution plan

for each query.

4.3.2 Query Execution with Multi-Attribute Bitmaps

One issue associated with using multiple representations for data indexes, is that

the query execution becomes more complex. In order to take advantage of the most

useful representation, the potential execution plans must be evaluated and the op-

tion that is deemed better should be performed. To be eﬀective, the complexity

added by the additional representations can not be more costly than the beneﬁts

attained by incorporating them. We take a number of measures in order to limit the

115

overhead associated with selecting the best query execution option. We keep track

of those attribute sets that have bitmap representations (either single attribute or

multi-attribute). We use a unique mapping for bitmap identiﬁcation. We explore

potential query execution options using inexpensive operations.

Global Index Identiﬁcation

In order to identify the potential index search space, we maintain a data structure

of those attributes and combined attributes for which we have bitmap indexes con-

structed. If we have available a single attribute bitmap for the ﬁrst attribute, we will

include a set identiﬁed as [1] in this structure. If we have available a multi-attribute

bitmap built over attributes 1,2, and 3 then we include the set [1,2,3] in the struc-

ture. Upon receiving a query, we compare the set representation of the attributes

in the query to the sets in the index identiﬁcation structure. Those sets that have

non-empty intersections with the query set are candidates for use in query execution.

Bitmap Naming

We need to be able to easily translate a query to the set of bitmaps that match the

criteria of a query. In order to do this we devise a one-to-one mapping that directly

translates query criteria to bitmap coverage. The key is made up of the attributes

covered by the bitmap and the values for those attributes. For example, a multi-

attribute bitmap over attributes 9 and 10 where the value for attribute 9 is 5 and the

value for attribute 10 is 3 would get ”[9,10]5.3” as its key. The attributes are always

stored in numerical order so that there is only one possible mapping for various set

orders, and the values are given in the same order as the attributes. Finding relevant

bitmaps is a matter of translating the query into its key to access the bitmaps.

116

Query Execution

During query execution we need a lightweight method to check which potential

query execution will provide the fastest answer to the query. The ﬁrst step is to

compare the set of attributes in the query to the sets of attributes for which we have

constructed bitmap indexes. Those indexes which cover an attribute involved in the

query are candidates to be used as part of the solution for the query. Once candidate

indexes are determined, we compute the total length of the bitmaps required to answer

the query.

A recursive function is used to compute the total length of all matching bitmaps

in order to handle arbitrary sized attribute sets. The complete unique hash key is

generated in the base case of the recursive method and the length of the matching

bitmaps is obtained and returned to the calling caller. For the recursive case, the

unique key is built up and passed to the recursive call, and the callee length results

are tabulated.

We just need to sum the lengths of bitmaps that match the query criteria, which

we can compute using the recursive function. For example, if the total length of

matching bitmaps for the multi-attribute bitmap over attributes [1,2] is less than the

total length for both attributes [1] and [2], then the multi-attribute bitmap should be

used to answer the query. However, if the multi-attribute bitmaps have greater total

length, then the single attribute bitmaps should be used.

The logic applied to determine the query execution plan needs to be lightweight

so as to not add too much overhead to handle the increased complexity of bitmap

representation. So rather than incorporating logic that guarantees an optimal minimal

total bitmap length, we approximate this behavior using heuristics. We order the

117

potential attribute sets in descending order based on the amount of length saved by

using the index. As a baseline, all potential single attribute indexes get a value of

0 for the amount of length saved. Multi-attribute indexes get a value which is the

diﬀerence between the length of the single attribute indexes that would be required

to answer the query and the length of the multi-attribute index to answer the query.

Then we process the query by greedily selecting non-covered attribute sets from this

order until the query is answered. Those multi-attribute indexes that have positive

beneﬁt compared to single attribute indexes will have positive value, and will be

processed before any single attribute index. Those that have negative values, would

actually slow down the query, and will not be processed before their constituent single

attribute indexes.

Algorithm 5 provides the algorithm for executing selection queries when the entire

bitmap index has both single attribute and multi-attribute bitmap indexes available to

process the query. It involves two major steps. In the ﬁrst step, the potential indexes

are prioritized based on how promising they are with respect to saving total bitmap

length during query execution. The second step performs the bit operations required

to answer the query. The algorithm assumes that the bitmap indexes available are

identiﬁed and that we can access any bitcolumn through a unique hashkey.

In lines 1-5, we compute the total size of the bitmaps that match the query

considering those attributes that intersect between the query and the index attribute

set currently under analysis. For example, consider a query over values 1 and 2 for

attribute 1, and and over value 3 for attribute 2 where we have bitmaps available for

the attribute sets [1], [2], and [1,2] combined. Table 4.1 shows the query matching

keys for each of the potential indexes. At the end of this section, we will have a total

118

SelectionQueryExecution (Available Indexes AI,

BitColumn Hashtable BI, query q)

notes:

AI - contains identiﬁcation of sets of indexed attributes,

with length and saved ﬁelds and query matching

keys list

BI - provides access to bitcolumn using unique key

B1 - a bitcolumn of all 1’s

// Prioritize Query Plan

1: for each attribute set as in AI

2: if as.attributes ∩ q.attributes = ∅

3: generate query matching keys for as

4: attach query matching keys to as

5: sum query matching bitmap lengths

6: for each attribute set as in AI

7: if as.numAtts == 1

8: as.saved = 0

9: else

10: for each size 1 attribute subset a in as

11: as.saved += a.length

12: as.saved -= as.length

13: sort AI by saved ﬁeld

// Execute Query

14: attribute set ca = null

15: queryResult = B1

16: while ca = q.attributes

17: as = next attribute set from AI

18: if as.attributes ∩ ca = ∅

19: ca = ca ∪ as.attributes

20: subresult = BI.get(as.keys[0])

21: for i = 2 to |as.keys|

22: subresult = OR(subresult,BI.get(as.keys[i]))

23: queryResult = AND(queryResult, subresult)

24: return queryResult

Algorithm 5: Query Execution for Selection Queries when Combined Attribute

Bitmaps are Available.

119

query matching bitmap length for each available index. Note that the computation

of bitmap length in line 5 is not actually a simple sum. The bit density of each

matching bitmap is estimated based on its length. Since the matching bitmaps will

not overlap, these bit densities are summed to arrive at an eﬀective bit density of the

resultant OR bitmap. Length of the resultant bitmap is then derived using eﬀective

bit density. If the bit density is within the range of about 0.1 to 0.9, then its value

can not be determined. In these cases, we conservatively estimate its value as 0.5.

This does not adversely aﬀect query operation since the multi-attribute option would

not be selected anyway.

Available Index Matching Keys

[1] [1]1,[1]2

[2] [2]3

[1, 2] [1, 2]1.3 , [1, 2]2.3

Table 4.1: Sample Query Matching Keys for Query Attributes 1 and 2.

In lines 6-12, we use this initial information in order to compute the total bitmap

length saved by using a multi-attribute bitmap in relation to using the relevant single

attribute alternatives. For a multiple attribute index, we sum up the query matching

bitmap length associated with each single attribute subset of the multi-attribute

attribute set. The diﬀerence between this length and the total length of the query

matching bitmaps for the multi-attribute set is the amount of bitmap length saved by

using the multi-attribute set. All single attribute indexes are assigned a saved value

of 0. Those multi-attribute indexes that reﬂect a cost savings will get a positive value

120

for saved, while those that are more costly will get a negative value. In line 13, we

sort the potential indexes in terms of this saved value.

In lines 14 and 15, we initialize a set that indicates the attributes covered by the

query so far and initialize the query result to all 1’s. Then we start processing the

query in order of how promising the prospective indexes were an continue until all

query attributes are covered (line 15). We get the next best index, and check if it’s

attributes have already been covered (line 17,18). If not, we will process a subquery

using the index. We update the covered attribute set and OR together all bitcolumns

that match the queries relevant to the query and the index. We AND together this

result to build the subqueries into the overall query result. When we have covered all

query attributes, the query is complete and we return the result.

4.3.3 Index Updates

Bitmap indexes are typically used in write-once read-mostly databases such as

data warehouses. Appends of new data are often done in batches and the whole index

is rebuilt from scratch once the update is completed. Bitmap updates are expensive

as all columns in the bitmap index need to be accessed. In [12], we propose the use of

a padword to avoid accessing all bitmaps to insert a new tuple but only to update the

bitmaps corresponding to the inserted values. In other words, the number of bitmaps

that need to be updated when a new tuple is appended in a table with A attributes

is A as opposed to Σ

A

i=1

c

i

, where c

i

is the number of bitmap values (or cardinality)

for attribute A

i

. This approach, which clearly outperforms the alternative, is suitable

when the update rate is not very high and we need to maintain the index up-to-date

in an online manner.

121

Updating multi-attribute bitmaps is even more eﬃcient than updating traditional

bitmaps. If multi-attributes bitmaps are also padded we only need to access the

bitmaps corresponding to the combined values (attributes) that are being inserted.

For example, consider the previous table with A attributes where multi-attribute

bitmaps are built over pair of attributes, the cost of updating the multi-attribute

bitmaps is half of the cost of updating the traditional bitmaps, as only

A

2

bitmaps

need to be accessed.

4.4 Experiments

4.4.1 Experimental Setup

Experiments were executing using a 1.87GHz HP Pavilion PC with 2GB RAM

and Windows Vista operating system. The bitmap index selection, bitmap creation

and query execution algorithms were implemented in Java.

Data

Several data sets are used during experiments. They are characterized as follows:

• synthetic - 12 attributes, 1M records, attributes 1-3 have cardinality 2, 4-6 have

cardinality 5, 7-9 have cardinality 10, and 10-12 have cardinality 25. Attributes

1,4,7,10 follow uniform random distribution, attributes 2,5,8,and 11 follow Zipf

distribution with s parameter set to 1, attributes 3,6,9, and 12 follow Zipf

distribution with s parameter set to 2.

• uniform - uniformly distributed random data set. It has 12 attributes, each

with cardinality 10, 1M records.

122

• census - a discretized version of the USCensus1990raw data set. The raw dataset

is from the (U.S. Department of Commerce) Census Bureau website. The dis-

cretized dataset comes from the UCI Machine Learning Repository. We used the

ﬁrst 1 million instances and 44 attributes. The cardinality of these attributes

varies from 2 to 16. The data set can be obtained from [47]

• HEP - a bin representation of High-Energy Physics data. It has 12 attributes.

The cardinality per dimensional varies between 2 and 12. There are a total of

122 bitcolumns for this data set. This data set is not uniformly distributed.

This data set is made up of more than 2 million records.

• Landsat - a bin representation of satellite imagery data. The data set is 60

attributes, each cardinality 10. It contains over 275000 records.

Queries

Query sets were generated by selecting random points from the data and con-

structing point queries from those points. These points remain in the data set, so

that it is guaranteed that there will be at least one match for each query. For those

cases where range queries are used, the range is expanded around a selected point so

that the range is guaranteed to contain at least one match.

Metrics

Our goal is to compare performance of a system reﬂecting the current state of the

art, where only single attribute bitmaps are available, against our proposed approach

in which both single and multi-attribute bitmaps are available. We present compar-

isons in terms of query execution time, number of bitmaps involved in answering a

123

query, and total bitmap size accessed to answer a query. Query execution assumes

the bitmap indexes are in main memory. Further performance enhancement would

be possible during query execution if the required bitmaps need to be read from disk.

4.4.2 Experimental Results

Index Selection

We executed the index selection algorithm over the synthetic data set. We set the

cardLimit parameter to 100 and set the query probability parameters to 1 (i.e. no

degradation at higher number of attributes). The attribute sets that were selected

to combine together were [1,4,7], [2,5,8], and [3,6,9]. This solution makes sense in

the context of the performance metric of weighted beneﬁt. We would expect more

expected beneﬁt for combinations for which the number of value combinations ap-

proaches the cardinality limit, because a sparser bitmap can be more eﬀectively com-

pressed. We would also expect more expected beneﬁt from combining attributes that

are independent. Therefore set [1,4,7] is selected ﬁrst, because it’s value combination

cardinality is 100, and the values are independent. The other combinations have

greater inter-attribute dependency because they follow Zipf distributions. Attributes

10, 11, and 12 do not combine with any others in this scenario because there are

no free attributes that would be within cardinality constraints. The average query

execution of 100 point queries using only single attribute bitmaps was 41.47ms, while

it was 17.39ms using the recommended indexes. The time for using recommended

index set includes an average overhead of 0.25ms to explore the richer set of options

and select the best query execution plan.

After executing index selection with parameters probQT and probSR set to 0.5,

we instead get 2 attribute recommendations. The proposed set of indexes is [7,8],

124

Range Length

Query A1 A2

Q1 1 1

Q2 1 2

Q3 1 5

Q4 2 5

Q5 2 5

Q6 5 5

Table 4.2: Query Deﬁnitions for 2 Attributes with cardinality 10 and uniform random

distribution.

[1,10], [2,11], [4,9], [5,6], and [3,12]. Each of these has a value combination cardinality

greater than or equal to 50 and selection order is dependent on cardinality and then

inter-attribute independence.

Query Execution

Controlled Query Types. We ran 6 query types over bitmaps built over 2

attributes with uniform random distribution and cardinality 10. The queries are

detailed in Table 4.2 and varied in the length of the range. A range of 1, means that

the query for that attribute is a point query. Q1, for example, is a point query on

both attributes, and Q3 selects one value from the ﬁrst attribute and ﬁve values from

the second attribute. Note that in this case, 5 is the worst case scenario. Any range

bigger than ﬁve could take the complement of the non-queried range to answer the

query.

Figures ?? and 4.8 show the query execution time, number of bitmaps involved in

answering each query type, and total size of the bitmaps involved in answering the

query, respectively. In these ﬁgures, The ‘Only 1-Dim’ label refers to the case where

only Single Attribute Bitmaps are available, the ‘Only M-Dim’ label refers to the

125

case where the queries are forced to use Multi-Attribute Bitmaps, and the ‘Proposed’

label refers to the case where the query execution plan is selected considering both

indexes available.

For queries Q1-Q4, the multi-attribute bitmaps are always selected, and they are

signiﬁcantly faster than the single-attribute options. For queries Q5-Q6, if single and

multi-attribute bitmaps are available, single attribute is always selected. However,

the best is always slightly faster than the proposed technique due to the overhead of

evaluating and selecting the best option. For these queries, the overhead was on the

order of 0.2ms. The proposed technique usually selects the option that involves the

shortest original bitmaps required to answer the query. The exception here is Q5.

The single dimension index is used because it involves fewer bitmap operations and

involves less total bitmap operand length over the set of operations that need to take

place.

Figure 4.6: Query Execution Time for each query type.

126

Figure 4.7: Number of bitmaps involved in answering each query type.

Query Execution for Point Queries. Figure 4.9 demonstrates the potential

query speedup associated with our proposed query processing technique when multi-

attribute bitmaps are available to answer queries. The single-D bar shows the average

query execution time required to answer 100 point queries using traditional bitmap

index query execution, where points are randomly selected from the data set (there

is at least one query match). For the multi-attribute test, indexes were selected and

generated with a maximum set size of 2. So a suite of bitcolumns representing 2-

attribute combinations is available for query execution as well the single attribute

bitmaps. Each single attribute is represented in exactly one attribute pair. The

results show signiﬁcant speedup in query execution for point queries. Point queries

over the uniform, landsat, and hep datasets experience speedups of factors of 5.01,

3.23, and 3.17 respectively.

127

Figure 4.8: Total size of the bitmaps involved in answering each query type.

Similar results were achieved for the census data set. Indexes were selected without

a dimensionality limit and with a cardinality limit of 100. The query execution time

using only single attribute bitmaps was 83.01 ms while the query execution time was

only 24.92 ms when both, single attribute and multi-attribute bitmaps were available.

Query Execution Over Range Queries. Figure 4.10 shows the impact of

using the single-attribute bitmaps instead of the proposed model as the queries trend

from being pure point queries to gradually more range-like queries. Average query

execution times over 100 queries are shown. The queries are constructed by making

the matching range of an attribute either a single value, or half of the attribute

cardinality. Whether an attribute is a point or a range is randomly determined. The

x-axis shows the probability that a given pair of attributes in the query represents a

range (e.g .1 means that there is 90% probability that 2 query attributes are both over

single values). The graph shows signiﬁcant speedup for point queries. Although the

128

Figure 4.9: Query Execution Times, Point Queries, Traditional versus Multi-

Attribute Query Processing

speedup ratio decreases as the queries become more range-like, the multi-attribute

alternative maintains a performance advantage until nearly all the query attributes

are over ranges. This is because there are still some cases where using the multi-

attribute bitmap is useful.

Index Size

Figure 4.11 shows bitmap index sizes for each data set. The lower portion of the

bar shows the total size of the bitmaps for all of the single attribute indexes. The

upper portion of the bar shows the size of the suite of 2-attribute bitmaps (each at-

tribute represented exactly once in the suite). The entire bar represents the space

required for both the single and 2-attribute bitmaps. Even though the multi-attribute

129

Figure 4.10: Query Execution Times as Query Ranges Vary, Landsat Data

bitmaps represent many more columns than the single attribute bitmaps(e.g. the sin-

gle dimension landsat are covered by 600 bitcolumns, while its 2-attribute counterpart

is represented by 3000 bitmaps (30 attribute pairs, 100 bitmaps per pair)), the com-

pression associated with the combined bitmaps means that the space overhead for

the multi-attribute representations is not as severe as may be expected.

130

Figure 4.11: Single and Multi-Attribute Bitmap Index Size Comparison

Bitmap Operations versus Bitmap Lengths

Since query execution enhancement is derived from two sources, reduction in

bitmap operations and reduction in bitmap length, it is valuable to know the rel-

ative impacts of each one. To this end, we tested query execution after performing

index selection using diﬀerent strategies.

Table 4.3 shows the query speedup for point queries using diﬀerent strategies to

determine the attributes that should be combined together. The speedup is measured

over 100 random point queries taken from HEP data. For each listed strategy, a suite

of size 2 attribute sets is determined diﬀerently. Randomly selected attributes simply

combines random pairs of attributes. The other 2 strategies use Algorithm 3 to

generate goodness factors for each combination based on bitmap length saved and

weighted by result density. The goodness using best greedily selects non-intersecting

131

Attribute Combination Strategy Query Speedup

1 Algorithm 1, Selecting Best

Combinations

3.15

2 Randomly Selected At-

tributes

2.79

3 Algorithm 1 Selecting Worst

Beneﬁcial Combinations

2.67

Table 4.3: Query Speedup, HEP Data, Using Diﬀerent Attribute Combination Strate-

gies

pairs starting from the top of the list, while the other technique greedily selects

non-overlapping pairs from the bottom. The results show that signiﬁcant speedup

can be achieved simply by combining reasonable attributes together, and that query

execution can be further enhanced by careful selection of the indexes.

4.5 Conclusion

In this paper, we propose the use of multi-attribute bitmap indexes to speed up

the query execution time of point and range queries over traditional bitmap indexes.

We introduce an index selection technique in order to determine which attributes

are appropriate to combine as multi-attribute bitmaps. The technique is driven by a

performance metric that is tied to actual query execution time. It takes into account

data distribution and attribute correlations. It can also be tuned to avoid undesirable

conditions such as an excessive number of bitmaps or excessive bitmap dimensionality.

We introduce a query execution model when both traditional and multi-attribute

bitmaps are available to answer a query. The query execution takes advantage of the

multi-attribute bitmaps when it is beneﬁcial to do so while introducing very little

overhead for cases when such bitmaps are not useful. Utilizing our index selection

132

and query execution techniques, we were able to consistently achieve up to 5 times

query execution speedup for point queries while introducing less than 1% overhead

for those queries where multi-attribute indexing was not useful.

133

CHAPTER 5

Conclusions

Data retrieval for a given application is characterized by the query types, query

patterns over time, and the distribution and characteristics of the data itself. In order

to maximize query performance, the query processing and access structures should

be tailored to match the characteristics of the application.

The optimization query framework presented herein describes an I/O-optimal

query processing technique to process any query within the broad class of queries

that can be stated as a convex optimization query. For such applications, the query

processing follows a access structure traversal path that matches the function being

optimized while respecting additional problem constraints.

While resolving queries, the possible query execution plans evaluated by a data

management system are limited by available indexes. The online high-dimensional in-

dex recommendation system provides a cost-based technique to identify access struc-

tures that will be beneﬁcial for query processing over a query workload. It considers

both the query patterns and the data distribution to determine the beneﬁt of diﬀerent

alternatives. Since workloads can shift over time, a method to determine when the

current index set is no longer eﬀective is also provided.

134

The multi-attribute bitmap indexes presents both query processing and index

selection for the traditionally single attribute bitmap index. Query processing selects

a query execution plan to reduce the overall number of operations. The index selection

task ﬁnds attribute combinations that lead to the greatest expected query execution

improvement.

The primary end goal of the research presented in this thesis is to improve query

execution time. Both the run-time query processing and design-time access methods

are targeted as potential sources of query time improvement. By ensuring that the

run-time traversal and processing of available access structures matches the charac-

teristic of a given query and that the access structures available match the overall

query workload and data patterns, improved query execution time is achieved.

135

BIBLIOGRAPHY

[1] S. Agrawal, S. Chaudhuri, L. Koll´ar, A. P. Marathe, V. R. Narasayya, and

M. Syamala. Database tuning advisor for microsoft sql server 2005. In VLDB,

pages 1110–1121, 2004.

[2] G. Antoshenkov. Byte-aligned bitmap compression. In Data Compression Con-

ference, Nashua, NH, 1995. Oracle Corp.

[3] E. Barucci, R. Pinzani, and R. Sprugnoli. Optimal selection of secondary indexes.

IEEE Transactions on Software Engineering, 1990.

[4] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University

Press, 1961.

[5] S. Berchtold, C. B¨ohm, D. A. Keim, and H.-P. Kriegel. A cost model for nearest

neighbor search in high-dimensional data space. PODS, pages 78–86, 1997.

[6] S. Berchtold, D. Keim, and H. Kriegel. The x-tree: An index structure for high-

dimensional data. In Proceedings of the International Conference on Very Large

Data Bases, pages 28–39, 1996.

[7] A. Blum and P. Langley. Selection of relevant features and examples in machine

learning. AI, 1997.

[8] C. Bohm. A cost model for query processing in high dimensional data spaces.

ACM Trans. Database Syst., 25(2):129–178, 2000.

[9] C. Bohm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces:

Index structures for improving the performance of multimedia databases. ACM

Comput. Surv., 33(3):322–373, 2001.

[10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University

Press, New York, NY. USA, 2004.

[11] N. Bruno and S. Chaudhuri. To tune or not to tune? a lightweight physical

design alerter. In VLDB, pages 499–510, 2006.

136

[12] G. Canahuate, M. Gibas, and H. Ferhatosmanoglu. Update conscious bitmap

indices. In SSDBM, 2007.

[13] A. Capara, M. Fischetti, and D. Maio. Exact and approximate algorithms for

the index selection problem in physical database design. IEEE Transactions on

Knowledge and Data Engineering, 1995.

[14] Y.-C. Chang, L. Bergman, V. Castelli, C.-S. Li, M.-L. Lo, and R. J. Smith. The

onion technique: indexing for linear optimization queries. ACM SIGMOD, 2000.

[15] S. Chaudhuri and V. Narasayya. Autoadmin ’what-if’ index analysis utility. In

Proceedings ACM SIG-MOD Conference, pages 367–378, 1998.

[16] S. Chaudhuri and V. R. Narasayya. An eﬃcient cost-driven index selection tool

for microsoft SQL server. In The VLDB Journal, pages 146–155, 1997.

[17] S. Choenni, H. Blanken, and T. Chang. On the selection of secondary indexes

in relational databases. Data and Knowledge Engineering, 1993.

[18] R. L. D. C. Costa and S. Lifschitz. Index self-tuning with agent-based databases.

In XXVIII Latin-American Conference on Informatics (CLIE), Montevideo,

Uruguay, 2002.

[19] A. Dogac, A. Y. Erisik, and A. Ikinci. An automated index selection tool for

oracle7: Maestro 7. Technical Report LBNL/PUB-3161, TUBITAK Software

Research and Development Center, 1994.

[20] H. Edelstein. Faster data warehouses. Information Week, December 1995.

[21] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middle-

ware. Journal of Computer and System Sciences, 2003.

[22] C. Faloutsos, T. Sellis, and N. Roussopoulos. Analysis of object oriented spatial

access methods. In SIGMOD ’87: Proceedings of the 1987 ACM SIGMOD in-

ternational conference on Management of data, pages 426–439, New York, NY,

USA, 1987. ACM Press.

[23] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, and A. E. Abbadi. Vector approx-

imation based indexing for non-uniform high dimensional data sets. In CIKM

’00.

[24] M. Frank, E. Omiecinski, and S. Navathe. Adaptive and automated index selec-

tion in rdbms. In International Conference on Extending Database Technologhy

(EDBT), Vienna, Austria, 1992.

137

[25] V. Gaede and O. Gunther. Multidimensional access methods. ACM Comput.

Surv., 30(2):170–231, 1998.

[26] A. Gionis, P. Indyk, and R. Motwani. Similarity searching in high dimensions

via hashing. In Proceedings of the Int. Conf. on Very Large Data Bases, pages

518–529, Edinburgh, Sootland, UK, September 1999.

[27] M. Grant, S. Boyd, and Y. Ye. Disciplined convex programming. Kluwer (Non-

convex Optimization and its Applications series), Dordrecht, 2005. In press.

[28] C.-W. C. Guang-Ho Cha. The gc-tree: a high-dimensional index structure

for similarity search in image databases. IEEE Transactions on MultiMedia,

4(2):235–247, June 2002.

[29] I. Guyon and A. Elissef. An introduction to variable and feature selection. Jour-

nal of Machine Learning Research, 2003.

[30] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate gen-

eration. In W. Chen, J. Naughton, and P. A. Bernstein, editors, 2000 ACM

SIGMOD Intl. Conference on Management of Data, pages 1–12. ACM Press, 05

2000.

[31] G. R. Hjaltason and H. Samet. Ranking in spatial databases. In Symposium on

Large Spatial Databases, pages 83–95, 1995.

[32] V. Hristidis, N. Koudas, and Y. Papakonstantinou. PREFER: A system for the

eﬃcient execution of multi-parametric ranked queries. In SIGMOD Conference,

2001.

[33] I. Inc. Informix decision support indexing for the enterprise data warehouse.

http://www.informix.com/informix/corpinfo/- zines/whiteidx.htm.

[34] M. Ip, L. Saxton, and V. Raghavan. On the selection of an optimal set of indexes.

IEEE Transactions on Software Engineering, 1983.

[35] S. Kai-Uwe, E. Schallehn, and I. Geist. Autonomous query-driven index tun-

ing. In International Database Engineering & Applications Symposium, Coimbra,

Portugal, 2004.

[36] R. Kohavi and G. John. Wrappers for feature subset selection. AI, 1997.

[37] F. Korn and S. Muthukrishnan. Inﬂuence sets based on reverse nearest neighbor

queries. In SIGMOD, 2000.

[38] M. S. Lobo. Robust and Convex Optimization with Applications in Finance. PhD

thesis, The Department of Electrical Engineering, Stanford University, March

2000.

138

[39] B. S. Manjunath. Airphoto dataset. http://vivaldi.ece.ucsb.edu/Manjunath/research.htm,

May 2000.

[40] B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of

image data. IEEE Transactions on Pattern Analysis and Machine Intelligence,

18(8):837–842, August 1996.

[41] R. P. Mount. The Oﬃce of Science Data-Management Challenge. DOE Oﬃce

of Advanced Scientiﬁc Computing Research, March–May 2004.

[42] Y. Nesterov and A. Nemirovsky. A general approach to polynomial-time algo-

rithms design for convex programming. Technical report, Centr. Econ. & Math.

Inst., USSR Acad. Sci., Moscow, USSR, 1988.

[43] Y. Nesterov and A. Nemirovsky. Interior-Point Polynomial Algorithms in Convex

Programming. SIAM, Philadelphia, PA 19101, USA, 1994.

[44] J. Pei, J. Han, and R. Mao. CLOSET: An eﬃcient algorithm for mining frequent

closed itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining

and Knowledge Discovery, pages 21–30, 2000.

[45] S. Ponce, P. M. Vila, and R. Hersch. Indexing and selection of data items in

huge data sets by constructing and accessing tag collections. In Proceedings of

the 19th IEEE Symposium on Mass Storage Systems and Tenth Goddard Conf.

on Mass Storage Systems and Technologies, 2002.

[46] M. Ramakrishna. In Indexing Goes a New Direction., volume 2, page 70, 1999.

[47] U. M. L. Repository. The us census1990 data set.

http://kdd.ics.uci.edu/databases/census1990/USCensus1990-desc.html.

[48] N. Roussopoulos, S. Kelly, and F. Vincent. Nearest neighbor queries. In SIG-

MOD, 1995.

[49] A. Shoshani, L. Bernardo, H. Nordberg, D. Rotem, and A. Sim. Multi-

dimensional indexing and query coordination for tertiary storage management.

In Proceedings of the 11th International Conference on Scientiﬁc and Statistical

Data(SSDBM), 1999.

[50] R. R. Sinha and M. Winslett. Multi-resolution bitmap indexes for scientiﬁc data.

ACM Trans. Database Syst., 32(3):16, 2007.

[51] M. Theobald, G. Weikum, and R. Schenkel. Top-k query evaluation with prob-

abilistic guarantees. VLDB, 2004.

139

[52] G. Valentin, M. Zuliani, D. Zilio, G. Lohman, and A. Skelley. Db2 advisor: An

optimizer smart enough to recommend its own indexes. In Proceedings Interna-

tional Conference Data Engineering, 2000.

[53] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance

study for similarity-search methods in high-dimensional spaces. In VLDB, 1998.

[54] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance

study for similarity-search methods in high-dimensional spaces. In Proceedings

of the 24th International Conference on Very Large Databases, pages 194–205,

1998.

[55] K. Whang. Index selection in relational databases. In International Conference

on Foundations on Data Organization (FODO), Kyoto, Japan, 1985.

[56] K. Wu, W. Koegler, J. Chen, and A. Shoshani. Using bitmap index for interactive

exploration of large datasets. In Proceedings of SSDBM, 2003.

[57] K. Wu, E. Otoo, and A. Shoshani. Compressing bitmap indexes for faster search

operations. In SSDBM, 2002.

[58] K. Wu, E. Otoo, and A. Shoshani. On the performance of bitmap indices for high

cardinality attributes. Lawrence Berkeley National Laboratory. Paper LBNL-

54673., March 2004.

[59] K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indexes with eﬃcient

compression. ACM Transactions on Database Systems, 2006.

[60] Z. Zhang, S. Hwang, K. C.-C. Chang, M. Wang, C. Lang, and Y. Chang. Boolean

+ ranking: Querying a database by k-constrained optimization. In SIGMOD,

2006.

140

ABSTRACT

The proliferation of massive data sets across many domains and the need to gain meaningful insights from these data sets highlight the need for advanced data retrieval techniques. Because I/O cost dominates the time required to answer a query, sequentially scanning the data and evaluating each data object against query criteria is not an eﬀective option for large data sets. An eﬀective solution should require reading as small a subset of the data as possible and should be able to address general query types. Access structures built over single attributes also may not be eﬀective because 1) they may not yield performance that is comparable to that achievable by an access structure that prunes results over multiple attributes simultaneously and 2) they may not be appropriate for queries with results dependent on functions involving multiple attributes. Indexing a large number of dimensions is also not eﬀective, because either too many subspaces must be explored or the index structure becomes too sparse at high dimensionalities. The key is to ﬁnd solutions that allow for much of the search space to be pruned while avoiding this ‘curse of dimensionality’. This thesis pursues query performance enhancement using two primary means 1) processing the query eﬀectively based on the characteristics of the query itself and 2) physically organizing access to data based on query patterns and data characteristics. Query performance enhancements are described in the context of several novel applications

ii

including 1) Optimization Queries, which presents an I/O-optimal technique to answer queries when the objective is to maximize or minimize some function over the data attributes, 2) High-Dimensional Index Selection, which oﬀers a cost-based approach to recommend a set of low dimensional indexes to eﬀectively address a set of queries, and 3) Multi-Attribute Bitmap Indexes, which describes extensions to a traditionally single-attribute query processing and access structure framework that enables improved query performance.

iii

ACKNOWLEDGMENTS

First of all, I would like to thank my wife, Moni Wood, for her love and support. She has made many sacriﬁces and expended much eﬀort to help make this dream a reality. I would not have been to complete this dissertation without her. I also would not have been able to accomplish this without the help of my family. I thank my mother Marilyn, my sister Susan and my grandmother Margaret for their support. I thank my late father Murray for providing an example to which I have always aspired. Moni’s family was also understanding and supportive of the travails of Ph.D research. I want to express my appreciation to my adviser, Hakan Ferhatosmanoglu, for his guidance and encouragement over the years. He has been generous with his time, eﬀort, and advice during my graduate studies. I am thankful for the technical discussions we have had and the opportunities he has given me to work on interesting projects. Many people in the department have helped me to get to this point. I speciﬁcally want to thank my dissertation committee members, Dr. Atanas Rountev and Dr. Hui Fang, for their eﬀorts and valuable comments on my research. I wish to thank Dr. Bruce Weide for his help and support during my studies, Dr. Rountev for my early exposure to research and paper writing, and Dr. Jeﬀ Daniels for giving me the opportunity to work on an interdisciplinary geo-physics project. iv

Many friends and members of the Database Research Group provided valuable guidance, discussions, and critiques over the past several years. I want to especially thank Guadalupe Canahuate for her help and friendship during this time. Also thanks to her family for providing enough Dominican coﬀee to get through all the paper deadlines. My Ph.D. research was supported by NSF grant NSF IIS -0546713.

v

VITA

May 10, 1969 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Omaha, Nebraska 1993 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.S. Electrical Engineering, Ohio State University, Columbus Ohio 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Computer Science and Engineering, Ohio State University, Columbus Ohio 1989-1992 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Electrical Engineering Co-op, British Petroleum, Various Locations 1994-2002 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Project Manager, Acquisition Logistics Engineering, Worthington, Ohio 2003-present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Assistant, Oﬃce Of Research, Department of Computer Science and Engineering, Department of Geology Ohio State University, Columbus, Ohio

PUBLICATIONS

Research Publications Michael Gibas, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Online Index Recommendations for High-Dimensional Databases Using Query Workloads. IEEE Transactions on Knowledge and Data Engineering (TDKE) Journal, February 2008 (Vol. 20, no.2), pp. 246-260

vi

Indexing Incomplete Databases. Developing Indices for High Dimensional Databases Using Query Access Patterns. International Conference on Scientiﬁc and Statistical Database Management (SSDBM). 14 pp. Germany. OSU-CISRC-4/06-TR45. ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’04). Hakan Ferhatosmanoglu. Encyclopedia of Geographical Information Science. LexisNexis. September. 2006/March. Michael Gibas. International Conference on Extending Database Technology (EDBT). Hakan Ferhatosmanoglu. 2005/October Atanas Rountev. Austria. Banﬀ. pp. Michael Gibas. Vienna. 15 pp. Guadalupe Canahuate. Dayton. Hakan Ferhatosmanoglu. Ohio State Technical Report. 33rd International Conference on Very Large Data Bases (VLDB 07). Fast Semantic-based Similarity Searches in Text Repositories. Ohio State Technical Report. (E-GIS) Michael Gibas. 2007. Indexes for Databases with Missing Data. Munich. ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE’04). Canada. A General Framework for Modeling and Processing Optimization Queries. OSUCISRC-10/07-TR74. Masters Thesis. Ning Zheng. June 2004 Atanas Rountev. pages 14-16. Tan Apaydin. Michael Gibas. Update Conscious Bitmap Indexes. Hakan Ferhatosmanoglu. pages 1-11. Scott Kagan. Hakan Ferhatosmanoglu. and Michael Gibas. Static and Dynamic Analysis of Call Chains in Java. 2007/July Guadalupe Canahuate. pp. July 2004 Fatih Altiparmak. and Michael Gibas. and Hakan Ferhatosmanoglu. Michael Gibas. Hakan Ferhatosmanoglu. Conference on Using Metadata to Manage Unstructured Text. OH. Scott Kagan. Guadalupe Canahuate. 884-901 Michael Gibas. Michael Gibas. June 2004 FIELDS OF STUDY vii . Relationship Preserving Feature Selection for Unlabelled Time-Series.Michael Gibas. 10691080 Guadalupe Canahuate. High-Dimensional Indexing. Evaluating the Imprecision of Static Analysis.

Major Field: Computer Science and Engineering viii .

. . . . . . . . . . . 1. . .3 Non-Covered Access Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . ix 2. . . . . . . . . . . .2 Common Query Types Cast into Optimization Query Model 2. . . . . . . . . . . . . . . . . . . . . . . . .2 Query Processing over Non-hierarchical Access Structures . . . . . . . . . Conclusions . .3 Optimization Query Applications . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . Related Work . . . . . I/O Optimality . . . . . . . . .4. . . . . . . . . . . . .7 . . . . .4 Approach Overview . . . . .4. . . . . . Query Model Overview . . . . . . . . .3. . . . . . . . 1 1 2 6 7 9 10 10 13 15 16 19 21 26 28 28 33 45 ii iv vi xi Modeling and Processing Optimization Queries . . . . . . . .5 2. . . . 2. . . . . . . . . . Novel Database Management Applications . . . . . . . . . . . . . . . . . . . .4 2. . . . . . . . . . . . . . . . . . . . . . . . . .1 2.1 1. . . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . .2 2. . . . . . . . .1 Background Deﬁnitions . . . Introduction .6 2. . . . . . . . . . . . . . . . . Query Processing Framework . . . . . .3. . .2 2. . . . . . . . . . 2. . . . . . . . . List of Figures . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . .4. . . . . . .TABLE OF CONTENTS Page Abstract . . . . Performance Evaluation . . . . . . . . . . . . . . . . . .3 Introduction . . . . . . . . . . . Overall Technical Motivation . . . . . . . . . . . . . . . . . . . . . Vita . . . . .1 Query Processing over Hierarchical Access Structures . . . . . . . . . . . . . . . . 2. . . Chapters: 1. .3. . .

. . . .5 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Index Selection . . . . . . . . . . . . . 136 x . . . . . . . . . . . . Proposed Approach . . . . . . . . . . . . . . . . . . . . 4. . . .1 Experimental Setup . . . . .3. . . 3. Empirical Analysis . . . . . . . . . . 4. . .1 High-Dimensional Indexing . . . .5 System Enhancements . . . . . . . . . . . .3. . . . . 3. . . . .2. . . . . . . . . Conclusion . . . . Approach . . . . . . . . . . .4. . 4. . . . . . . .3. . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . . .3. . . . . .4 Automatic Index Selection . . . . . . . . . . . . . . . . . . . . . . . . . . .1 3. . . . . . . . .3. . . . . . . . Multi-Attribute Bitmap Indexes . . . 4. . . . . . . . . . Conclusions . . . .3. . . . . . .4 4. . . .4 Proposed Solution for Online Index Selection 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 3. . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . .3 Proposed Solution for Index Selection . . . . .4 3. Background .3 Index Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4. .3 Introduction . . . . . . . . . . . . . . . . 3.1 Problem Statement . . . 3. . . . . 3.4. . . . . . . . . . . 3. .2. . . . . .3. . . . . . . . . . . . . . . . .2 Introduction . . 3. . . . .5 5. . . . . . . . . . . . . . . . . . . . . . .1 4. . . . . . . . . . . . . . . . . . . . 134 Bibliography . . . . . . . . . . .2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 4. . . . . . Related Work . . . . . .2. . . . . . . . . . . . 4. . . . . Experiments . . . . . . . . . . . . .2 Feature Selection . . . 3. . . . . . . . . . . . . . . . . . . .1 Multi-attribute Bitmap Index Selection for Ad Hoc Queries 4. . . . . . . . . . .2 Experimental Results . . . . . . . . . .3. Online Index Recommendations for High-Dimensional Databases using Query Workloads . . . . . . . . . . .2 Experimental Results . . 4. . . . . . . . . . . . . . . . . . . . . . . . .3. . . . 3. 3. 48 48 54 54 55 55 57 57 57 59 61 71 76 78 78 80 95 98 98 101 109 110 115 120 121 121 123 131 3. . . . .1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . .2 Query Execution with Multi-Attribute Bitmaps . . . . . . . . . . . . . .

g1 (α. . . . . . . . . Accesses. . . . . . . 43 44 xi . . g2 (α.4 2. . . A) : y − 6 ≤ 0. . . . . . . . .7 36 38 2. . . . 8-D Color Histogram. . . A) : −y + 3 ≤ 0. . . R*-tree. . . . 8-D Histogram. . . . . . . . . . k = 1 . . .11 Pages Accesses vs. . . . . . . . . . . . . 100k pts Page Accesses. . . . . . . .10 Accesses vs. . . . . . . Random weighted kNN vs kNN. . . . . R*-tree. . . NN Query. . . . . . Constraints. Optimization Query System Functional Block Diagram . . . . . . . Example of Hierarchical Query Processing . . . . . R*-Tree. . . . . . R*-Tree. CPU Time. Optimizing Random Function. . . . . . Index Type. . 60-D Sat. . . . . . 5-D Uniform Random. . Page 12 20 25 2. . . . . .5 Cases of MBRs with respect to R and optimal objective function contour 30 Cases of partitions with respect to R and optimal objective function contour . . . . . . . . 100k Points . . . h1 (β. . . . . . . . . . 32 2. 8-D Histogram. Accesses. . . . . . . . . . . . . . . . A) : x − 5 = 0. . . .6 35 2. . . . .LIST OF FIGURES Figure 2. . . A) : (x − 1)2 + (y − 2)2 . . 100k pts . . . . . . . . .8 2. . .1 Sample Optimization Objective Function and Constraints F (θ. . . . . . . . . . . . . . . Random weighted kNN vs kNN. . . . . . . . VA-File. . . . 50k Points . . .2 2. . . . o(min). . . . . . . 2.9 40 2. . . . .3 2. . . Random weighted kNN vs kNN. . Random Weighted Constrained k-LP Queries. . . . . . . . 100k pts . 8-D Histogram. . 100k pts . . . . .

. . . . .3. . . . . . . . . . . . . . . . . synthetic Query Workload .5 86 3. . . 90 Comparative Cost of Online Indexing as Major Change Threshold Changes. . . . . . . . . Baseline Comparative Cost of Adaptive versus Static Indexing. . . . .3 93 94 A WAH bit vector. . . . . . 106 xii 4. . 2M entries . . . . . . . . . . . synthetic Query Workload . . . Comparative Cost of Online Indexing as Support Changes. . . .2 3. . Point Queries. . . . . . Costs in Data Object Accesses for Ideal. . . . . . . . . . . . . . random Dataset. . . . . . .6 87 3. . . . . . Dynamic Index Analysis Framework . . . . . . . . . . . . .11 Comparative Cost of Online Indexing as Multi-Dimensional Histogram Size Changes. . . . . . . . . . . . . . . . Baseline Comparative Cost of Adaptive versus Static Indexing. . . . . .8 Comparative Cost of Online Indexing as Support Changes. . stock Dataset. . 2M entries versus . . synthetic Query Workload . . . . . . . . stock Dataset. . . random Dataset. . . . . AutoAdmin. . . . . ANDing 6 bitmaps. . . . . . . Naive. . 4. . . . . . . . . . 3. . . . . random Dataset. 3. . . and the Proposed Index Selection Technique using Relaxed Constraints . . . . . . synthetic Query Workload . . . .10 Comparative Cost of Online Indexing as Major Change Threshold Changes. . hr Query Workload . . . .7 89 3. . . . . . . . Baseline Comparative Cost of Adaptive versus Static Indexing. . . . . . . . . . . . hr Query Workload . random Dataset.1 3. . . . . . . . . . . . . . . . .2 4. . . . . . . . . . . . . . . . .4 . . . . . . . . . . . . . . . . . . . . . .3 Index Selection Flowchart . . 103 Query Execution Times vs Total Bitmap Length. . . . . . Sequential Scan. stock Dataset. .4 85 3. . . 105 Query Execution Times vs Number of Attributes queried. . . . 102 Query Execution Example over WAH compressed bit vectors. . . .9 92 3. . hr Query Workload . . . . . . . . . . . . 61 73 81 3. . . . . . clinical Query Workload . .1 4. . . . . . . . . . . . . . . . . . . random Dataset. . . .

.7 4. . . . . . .6 4. . . . . . . . . 125 Number of bitmaps involved in answering each query type. . .4. 128 4. . . . . 129 4. . .10 Query Execution Times as Query Ranges Vary. . . . .5 Simple multiple attribute bitmap example showing the combination two single attribute bitmaps. . . . . . . . . . . . .11 Single and Multi-Attribute Bitmap Index Size Comparison . . .9 4. . . . . 127 Query Execution Times. . . . Landsat Data . 130 xiii . . . . . . . . . . 126 Total size of the bitmaps involved in answering each query type. . . . . Traditional versus Multi-Attribute Query Processing . . . .8 4. . . Point Queries. . . . . . . . . . . . . 108 Query Execution Time for each query type. . . . .

he would proceed at once with the diligence of the bee to examine straw after straw until he found the object of his search.CHAPTER 1 Introduction If Edison had a needle to ﬁnd in a haystack. .Nikola Tesla. In the case of our proverbial haystack. knowing that a little theory and calculation would have saved him ninety per cent of his labor. we would be much better served by limiting our search to a small region within the haystack. I was a sorry witness of such doings. the query response time during data exploration tasks using a database system is aﬀected both by the physical organization 1 . 1. it is necessary to ﬁnd information that is in some respect interesting within the data. Two primary challenges of this task are specifying the meaning of interesting within the context of an application and searching the data set to ﬁnd the interesting data eﬃciently. we can think of each piece of straw as a data object and needles as a data objects that are interesting.1 Overall Technical Motivation Analogous to our physical haystack example.. 1931 In order to make better use of the massive quantities of data being generated by and for modern applications.. While we could guarantee that we ﬁnd the most interesting needle by examining each and every piece of straw.

As such. it is highly desirable to process generic queries at run-time in a way that is eﬀective for each particular query. 2) data organization such that the access structures available during query resolution are appropriate for the overall query patterns and data distribution and allow for signiﬁcant data space pruning during query resolution. similarity-based analyses and complex mathematical and statistical operators are used to query such data [41]. queries can vary widely on a per query basis. and data constraints 2 .2 Novel Database Management Applications Optimization Queries. This thesis targets query performance enhancement through two sources 1) runtime query processing such that the search paths and intermediate operations during query resolution are appropriate for a given generic query. 1. Besides traditional queries. and business data usually requires iterative analyses that involve sequences of queries with varying parameters. Therefore.of the data and by the way in which the data is accessed. motivated and described in the next section. Additionally. scientiﬁc. Query performance enhancement is explored using three introduced applications. function parametric values. query performance can be enhanced by traversing the data more eﬀectively during query resolution or by organizing access to the data so that it is more suitable for the query. Depending on the application. Exploration of image. the object attributes queried. Physically reorganizing data or data access structures is time consuming and therefore should be performed for a set of queries rather than by trying to optimize an organization for a single query. Many of these data exploration queries can be cast as optimization queries where a linear or non-linear function over object attributes is minimized or maximized (our needle).

The model covers those queries where some function is being minimized or maximized over object attributes. 3 . With respect to the physical organization of the data. (ii) take advantage of query constraints to prune search space. access structures or indexes that cover query attributes can be used to prune much of the data space and greatly reduce the required number of page accesses to answer the query. due to the oft-cited curse of dimensionality. For access structures where the underlying partitions are convex. Because scientiﬁc data exploration and business decision support systems rely heavily on function optimization tasks. the query processing framework is used to access the most promising partitions and data objects and is proven to be I/O-optimal. Such a framework should meet the following requirements: (i) have the expressive power to support solution of each query type. However. indexes built over many dimensions actually perform worse than sequential scan. This problem can be addressed by using sets of lower dimensional indexes that eﬀectively answer a large portion of the incoming queries. High-Dimensional Index Selection.can vary on a per-query basis. This approach can work well if the characteristics of future queries are well known. A general optimization model and query processing framework are proposed for optimization queries. The query processing framework can be used in those cases where the function being optimized and the set of constraints is convex. possibly within a set of constraints. and because such tasks encompass such disparate query types with varying parameters. and (iii) maintain eﬃcient performance when the scoring criteria or optimization function changes. an important and signiﬁcant subset of data exploration tasks. these application domains can beneﬁt from a general model-based optimization framework. but in general this is not the case.

and an index set built based on historical queries may be completely ineﬀective for new queries. A control feedback mechanism is incorporated so that the performance of newly arriving queries can be monitored. and data statistics. This owes to the fact that they are a column-based storage structure and only the data needed to address a query is ever retrieved from disk and that their base operations are performed using hardware supported fast bitwise operations. and is eﬀective in answering queries. Static bitmap indexes have traditionally built over each attribute individually. and if the current index set is no longer eﬀective for the current query patterns. Bitmap indexes have been shown to be eﬀective as an data management system for many domains such a data warehousing and scientiﬁc applications. Historical query patterns are data mined to ﬁnd those attribute sets that are frequently queried together. Multi-Attribute Bitmap Indexes. In the case of physical organization. the index set can be changed. The query model for selection queries is simple and involves ORing together bitcolumns that match query 4 . A set of indexes is obtained that in eﬀect reduces the dimensionality of the query search space. query performance is aﬀected by creating a number of low dimension indexes using the historical and incoming query workloads. This ﬂexible framework that can be tuned to maximize indexing accuracy or indexing speed which allows it to be used for the initial static index recommendation and also online dynamic index recommendation. Potential indexes are evaluated to derive an initial index set recommendation that is eﬀective in reducing query cost for a large portion of queries such that the set of indexes is within some indexing constraint. The nature of the database management system used and the way in which data is physically accessed also aﬀects query execution times.

Methods to determine attribute combination candidates are provided. 5 . the bitmap index structure is limited to those domains that it is naturally suited for without change. A query processing model where multi-attribute bitmaps are available for query resolution is presented. For many queries. As well query resolution does not take advantage of potential multi-attribute pruning and possible reductions in the number of binary operations.criteria within an attribute and ANDing inter-attribute intermediate results. Multi-attribute bitmap indexes are introduced to extend the functionality and applicability of traditional bitmap indexes. As a result. query execution times are improved by reducing size of bitmaps used in query execution and by reducing the number of binary operations required to resolve the query without excessive overhead for queries that do not beneﬁt from multi-attribute pruning.

. While the answer to a given query can be found by computing the function for all data objects and ﬁnding those objects that yield the most extreme values. minimizing the I/O required to answer a query is a primary strategy in query optimization. the one that heralds new discoveries..CHAPTER 2 Modeling and Processing Optimization Queries The most exciting phrase to hear in science. this approach may not be feasible for large data sets. Scientiﬁc discovery and business intelligence activities are frequently exploring for data objects that are in some way special or unexpected. while ensuring that. Given the time expense of I/O. Such data exploration can be performed by iteratively ﬁnding those data objects that minimize or maximize some function over the data attributes. The optimization query framework presented in this chapter provides a method to generically handle an important broad class of queries. given an access structure.Isaac Asimov One challenge associated with data management is determining a cost-eﬀective way to process a given query.’ . 6 . is not ‘Eureka!’ (I found it!) but ‘That’s funny . the query is I/O-optimally processed over that structure.

1 Introduction Motivation and Goal Optimization queries form an important part of realworld database system utilization. Such a framework should meet the following requirements: (i) have the expressive power to support existing query types and the power to formulate new query types. There has been a rich set of literature on processing speciﬁc types of queries. These queries can all be considered to be a type of optimization query with diﬀerent objective functions. (ii) take advantage of query constraints to proactively prune search space. Besides traditional queries. the object attributes queried. Many of these data exploration queries can be cast as optimization queries where a linear or non-linear function over object attributes is minimized or maximized. and data constraints can vary on a per-query basis. function parametric values. and because such tasks encompass such disparate query types with varying parameters. However. similar-ity-based analyses and complex mathematical and statistical operators are used to query such data [41].2. such as the query processing algorithms for Nearest Neighbors (NN) [48] and its variants [37]. scientiﬁc. Because scientiﬁc data exploration and business decision support systems rely heavily on function optimization tasks. 7 . and (iii) maintain eﬃcient performance when the scoring criteria or optimization function changes. these application domains can beneﬁt from a general model-based optimization framework. top-k queries [21. Additionally. especially for information retrieval and data analysis. and business data usually requires iterative analyses that involve sequences of queries with varying parameters. they either only deal with speciﬁc function forms or require strict properties for the query functions. Exploration of image. etc. 51].

3. We propose a generic optimization model to deﬁne a wide variety of query types that includes simple aggregates. Since the queries we are considering do not have a speciﬁc objective function but only have a “model” for the objective (and constraints) with parameters that vary for diﬀerent users/queries/iterations. In conjunction with a technique to model optimization query types. and complex user-deﬁned functional analysis. application of specialized techniques requires that the user know of. • We deﬁne query processing algorithms for convex model-based optimization queries utilizing data/space partitioning index structures without changing the 8 . and apply the appropriate tools for the speciﬁc query type. a uniﬁed query technique for a general optimization model has not been addressed in the literature. Furthermore.Despite an extensive list of techniques designed speciﬁcally for diﬀerent types of optimization queries. Contributions The primary contributions of this work are listed as follows. • We propose a general framework to execute convex model-based optimization queries (using linear and non-linear functions) and for which additional constraints over the attributes can be speciﬁed without compromising query accuracy or performance. possess. By applying this single optimization query model with I/O-optimal query processing. Example optimization query applications are provided in Section 2. we also propose a query processing technique that optimally accesses data and space partitioning structures to answer the query.3. we refer to them as model-based optimization queries. range and similarity queries. we can provide the user with a ﬂexible and powerful method to deﬁne and execute arbitrary optimization queries eﬃciently.

• We introduce a generic model and query processing framework that optimally addresses existing optimization query types studied in the research literature as well as new types of queries with important applications. It is also not applicable to constrained linear queries because convex hulls need to be recomputed for each query with a diﬀerent set of constraints. Query 9 . 2. The access structure is a sorted list of records for an arbitrary linear function. The solution followed in the Onion technique is based on constructing convex hulls over the data set. • We prove the I/O optimality for these types of queries for data and space partitioning techniques when the objective function is convex and the feasible region deﬁned by the constraints is convex (these queries are deﬁned as CP queries in this paper). It is infeasible for high dimensional data because the computation of convex hulls is exponential with respect to the number of dimensions and is extremely slow. The computation of convex hulls over the whole data set is expensive and in the case of high-dimensional data.index structure properties and the ability to eﬃciently address the query types for which the structures were designed. The Prefer technique [32] is used for unconstrained and linear top-k query types. An example linear model query selects the best school by ranking the schools according to some linear function of the attributes of each school. One is the “Onion technique” [14]. which deals with an unconstrained and linear query model.2 Related Work There are few published techniques that cover some instances of model-based optimization queries. infeasible in design time.

3. an . However. our technique organizes data retrieval to answer more general queries. A). . and requires no additional storage space.1 Query Model Overview Background Deﬁnitions Let Re(a1 . A). . A) be certain functions over attributes in A. θ2 .processing is performed for some new linear preference function by scanning the sorted list until a record is reached such that no record below it could match the new query. 1 ≤ i ≤ u. Boolean + Ranking [60] oﬀers a technique to query a database to optimize boolean constrained ranking functions. addresses only single dimension access structures (i. In contrast with the other techniques listed. Let gi (α. . . θm ) is a vector of parameters for function F . Denote by A a subset of attributes of Re. . a2 . . it is limited to boolean rather than general convex constraints. can not optimize on multiple dimensions simultaneously). a2 . but does not provide a guarantee on minimum I/O. our proposed solution is proven to be I/O-optimal for both hierarchical and space partitioned access structures with convex partition boundaries. assume all attributes have the same domain. . . and therefore largely uses heuristics to determine a cost eﬀective search strategy. and α and β are vector parameters for the constraint functions. covers a broader spectrum of optimization queries. 2. While the Onion and Prefer techniques organize data to eﬃciently answer a speciﬁc type of query (LP). Without loss of generality. hj (β.e. . This avoids the costly build time associated with the Onion technique. . where u and v are positive integers and θ = (θ1 . 1 ≤ j ≤ v and F (θ. and still does not incorporate constraints. .3 2. 10 . . an ) be a relation with attributes a1 .

11 . A) ≤ 0. Without loss of generality. 1 ≤ j ≤ v (if not empty). (ii) a set of constraints (possibly empty) speciﬁed by inequality constraints gi (α. The constraints limit the feasible region to the line passing through x = 5. NN queries over an attribute set A can be considered as model-based optimization queries with F (θ. a model-based optimization query Q is given by (i) an objective function F (θ. The answer of a model-based optimization query is a set of tuples with maximum cardinality k satisfying all the constraints such that there exists no other tuple with smaller (if minimization) or larger (if maximization) objective function value F satisfying all constraints as well. Similarly.2). a2 . an ). 1 ≤ i ≤ u (if not empty) and equality constraints hj (β.. . For instance.g. The objective function ﬁnds the nearest neighbor to point a (1. This deﬁnition is very general. Euclidean) and the optimization objective is minimization. with optimization objective o (min or max). where 3 ≤ y ≤ 6.Deﬁnition 1 (Model-Based Optimization Query) Given a relation Re(a1 .1 shows a sample objective function with both inequality and equality constraints. and answer set size integer k. we will consider minimization as the objective optimization throughout the rest of this chapter. . Figure 2. . weighted NN queries and linear/non-linear optimization queries can all be considered as speciﬁc modelbased optimization queries. top-k queries. A) = 0. The points b and d represent the best and worst point within the constraints for the objective function respectively. and almost any type of query can be considered as a special case of model-based optimization query. constraint parameters α and β. . Point c could be the actual point in the database that meets the constraints and minimizes the objective function. and (iii) user/query adjustable objective parameters θ. A) as the distance function (e. A) .

(ii) and (iii) form a well known type of problem in Operations Research literature Convex Optimization problems. The conditions (i). The only assumptions are that the queries can be formulated into convex functions and the feasible regions deﬁned by the constraints of the queries are convex (e. o(min). users can ask any form of queries with any coeﬃcients as long as these assumptions hold.g. A) : (x − 1)2 + (y − 2)2 . A) : y − 6 ≤ 0. polyhedron regions). A function is convex if it is second diﬀerentiable and 12 . A). A) is convex. A) : x − 5 = 0. A) : −y + 3 ≤ 0.. Notice that the deﬁnition of a CP query does not have any assumptions on the speciﬁc form of the objective function.1: Sample Optimization Objective Function and Constraints F (θ. 1 ≤ i ≤ u (if not empty) are convex. h1 (β. and (iii) hj (β. 1 ≤ j ≤ v (if not empty) are linear or aﬃne. k=1 Deﬁnition 2 (Convex Optimization (CP) Query) A model-based optimization query Q is a Convex Optimization (CP) query if (i) F (θ. g1 (α. (ii) gi (α. A). g2 (α.Figure 2. Therefore.

top-k queries. 2. a2 . . linear programming problems (LP) and convex quadratic programming problems (QP) [27]. wn > 0 are the weights and can be diﬀerent for each query. . .The objective function of a Linear Optimization query is L(a1 . . gi and hj . .its Hessian matrix is positive deﬁnite. a0 . The formulation of some common query types as CP optimization queries are listed and discussed below. Traditional Nearest Neighbor queries are the subset of weighted nearest neighbor queries where the weights are all equal to 1. Linear Optimization Queries . if one travels in a straight line from inside a convex region to outside the region. linear optimization queries. More intuitively. . . . + wm (an − a0 )2 . . Therefore. an ) such that an objective function is minimized.3. . . including NN queries. . an ) = w1 (a1 − a0 )2 + w2 (a2 − a0 )2 + . . 1 2 n where w1 . .The Weighted Nearest Neighbor query asking NN for a point (a0 .2 Common Query Types Cast into Optimization Query Model Although appearing to be restricted in functions F . . a0 ) can be considered as an optimization n 1 2 query asking for a point (a1 . w2 . Euclidean Weighted Nearest Neighbor Queries . . . . it is not possible to re-enter the convex region. . For Euclidean WNN the objective function is WNN(a1 . a2 . + cn an 13 . . the set of CP problems is a superset of all least square problems. . and angular similarity queries. . an ) = c1 a1 + c2 a2 + . they cover a wide variety of linear and non-linear queries. . . a2 .

i = 1. where C is any constant. an ) s. . an ) = C.A range query asks for all data (a1 .t. Therefore. . . Points that are not within the constraints are pruned by those constraints.where (c1 . The common score functions are sum and average. but could be any arbitrary function over the attributes. Constructing an objective function as f (a1 . all form the solution set of the range query. . and constraints gi = li − ai <= 0. . .The objective functions of top-k queries are score (or ranking) functions over the set of attributes. n Any points that are within the constraints will get the objective value C. c2 . Clearly. 2. cn ) ∈ Rn are coeﬃcients and can be diﬀerent for diﬀerent queries. . and the constraints form a polyhedron. . . sum and average are special cases of linear optimization queries. a2 . where [li . gn+i = ai − ui <= 0. but also ‘irregular’ range queries such as l ≤ a1 + a2 ≤ u and l ≤ 3a1 − 2a2 ≤ u 14 . ui ]n is the range for dimension i. . Range Queries . Since linear functions are convex and polyhedrons are convex. the top-k query is a CP query. Those points that remain all have an objective value of C and because they all have the same maximum(minimum) objective value. . li ≤ ai ≤ ui . · · · . a2 . range queries can be formed as special cases of CP queries with constant objective functions applying the range limits as constraints. linear optimization queries are also special cases of CP queries with a parametric function form. for some (or all) dimensions i. Top-k Queries . . If the ranking function in question is convex. Also note that our model can not only deal with the ‘traditional’ hyper-rectangular range queries as described above.

In such a case. Some hard exclusion criteria is known and some attributes are more important than others with respect to estimating participation probability and suitability. kNN with Adaptive Scoring . we will change the nearest neighbor scoring weights dependent on the current population set. and solved using our framework.3 Optimization Query Applications Any problem that can be rewritten as a convex optimization problem can be cast to our model. In a case where we want half of the participants to be male and 15 .2.3.In some situations we may have an objective function that changes based on the characteristics of a population set we are trying to ﬁll. We provide a short list of potential optimization query applications that could be relevant during scientiﬁc data exploration or business decision support where the problem being optimized is continuously evolving. Weighted Constrained Nearest Neighbors . As an example. This type of query would be applicable in situations where attribute distances vary in importance and some constraints to the result set are known. Weighted Constrained Nearest Neighbors (WCNN) ﬁnd the closest weighted distance object that exists within some constrained space.Weighted nearest neighbor queries give the ability to assign diﬀerent weights for diﬀerent attribute distances for nearest neighbor queries. A potential application is clinical trial patient matching where an administrator is looking to ﬁnd the best candidate patients to ﬁll a clinical trial. this generic framework can be used to solve the problem by eﬃciently accessing available access structures. In cases where the optimization problem or objective function parameters are not known in advance. again consider the patient clinical trial matching application where we want to maintain a certain statistical distribution over the trial participants.

the data attributes may be subject to lower and upper bounds. functional attributes. For example. Queries over Changing Attributes . constraints. 2. All of the discussed applications can be stated as an objective function with a objective of maximization or minimization. Weights. and optimization functions themselves can all change on a per-query basis. We can solve a continuous CP problem over a convex partition of space by optimizing the function of interest within the 16 . This demonstrates the power and ﬂexibility that the user has to deﬁne data exploration queries and the examples represent a small subset of the query types that are possible. The solution can optionally be subject to some convex criteria over the data attributes. The goal in CP is optimize (minimize or maximize) some convex function over data attributes. In the examples listed above.4 Approach Overview We propose the use of Convex Optimization (CP) in order to traverse access structures to ﬁnd optimal data objects. we can adjust weights of the objective optimization function to increase the likelihood that future trial candidates will match the currently underrepresented gender. each query or each member of a set of queries can be rewritten as an optimization query in our model. a series of optimization queries may search a stock database for maximum stock gains over diﬀerent time intervals.The attributes involved in optimization queries can vary based on the iteration of the query.3. A database system that can eﬀectively handle the potential variations in optimization queries will beneﬁt data exploration tasks. The attributes involved in each query will be diﬀerent. Additionally.half female.

the partition is found to be infeasible for the problem. Most access structures are built over convex partitions.constraints of that space. Particularly common partitions are Minimum Bounding Rectangles (MBRs). Given the function. and a set of partitions. 17 . partition constraints. Rectangular partitions can be represented by lower and upper bounds over data attributes and can be directly addressed by the CP problem bounds. Given a function. The solution of these CP problems allows us to prune partitions and search paths that can not contain an optimal answer during query processing. If no solution exists (problem constraints and partition constraints do not intersect). The partition and its children can be eliminated from further consideration. model constraints. The access structure partitions to search and evaluate during query processing are determined by solving CP problems that incorporate the objective function. problem constraints. If the problem under analysis has constraints in addition to the constraints imposed by the partition boundaries. the partition the yields the best CP result with respect to the function under analysis is the one that contains some space that optimizes the problem under analysis better than the other partitions. Other convex partitions can be addressed with data attribute constraints (such as x + y < 5). we can use CP to ﬁnd the optimal answer for the function within the intersection of the problem constraints and partition constraints. they can also be added to the CP problem as constraints. With our technique we endeavor to minimize the I/O operations and access structure search computations required to perform arbitrary model-based optimization queries. These partition functional values can be used to order partitions according to how promising they are with respect to optimizing the function within problem constraints. and problem constraints. and partition constraints.

which can be applied to existing space or data partitioning-based indices without losing the capabilities of those indexing techniques to answer other types of queries. We focus on query processing techniques by presenting a technique for general CP objective functions. The basic idea of our technique is to divide the query processing into two phases: First.e. we found very little time diﬀerence in the functions we examined. CP problems can be so eﬃciently solved that Boyd and Vandenberghe stated “if you formulate a practical problem as a convex optimization problem. (i. Details for our technique for diﬀerent index types are provided in the following section.” ([10] page 8). 18 . search through the index by pruning the impossible partitions and access the most promising pages. Current implementations of CP algorithms can solve problems with hundreds of variables and constraints in very short times on a personal computer [38]. Second.The CP literature discusses the eﬃciency of CP problem solution. solve the CP problem without considering the database. We are in eﬀect replacing the bound computations from existing access structure distance optimization algorithms with a CP problem that covers the objective function and constraints. then you have solved the original problem. Although such computations can be more complex. It can be solved in polynomial time in the number of dimensions in the objective functions and the number of constraints ([43]). In [42]. it is stated that a CP problem can be very eﬃciently solved with polynomial-time algorithms. The readers can refer to [43] for a detailed review on interior-point polynomial algorithms for CP problems. in continuous space). This “relaxed” optimization problem can be solved eﬃciently in the main memory using many possible algorithms in the convex optimization literature.

there exists a lower bound for each partition in the database that can be used to prune the data space before retrieving pages from disk. By solving CP subproblems using index partition boundaries as additional constraints. partition constraints.2. Figure 2. and objective function. A lightweight query compiler validates the user input.2 is a functional block diagram of the optimization query processing system that shows interactions between components. The bounds can be deﬁned for any type of convex partition (such as minimum bounding regions) used for indexing the data. we ﬁnd the best possible objective function value achievable by a point in the partition. The query processor invokes a CP solver to ﬁnd bounds for partitions that could potentially contain answers to the query. converts the function and constraints into appropriate format for the query processor. The generic query processing framework for a CP query is described below. we can ensure that we access pages in the order of how promising they are. Thus. The user gets back those data object(s) in the database that answer the query within the constraints. This framework can be applied to indexing structures in which the partitions on which they are built are convex because the partition boundaries themselves are used to form the new CP problems. Query processing proceeds ﬁrst by the user providing an objective function and constraints to the system. and executes the appropriate query processing algorithm using the speciﬁed access structure. A popular taxonomy of access structures divides the structures into the categories of hierarchical and non-hierarchical structures.4 Query Processing Framework By solving CP problems associated with the problem constraints. Hierarchical structures are usually 19 .

20 .2. “worst” or upper bounds) in place of minimums. We terminate when we come to a partition that could not contain an optimal answer to the query.4.Figure 2. The overall idea is to traverse the access structure and solve the problem in sequential order of how good partitions and search paths could be. Solutions to maximization problems can be performed by using maximums (i.1 and non-hierarchical structures in 2.2: Optimization Query System Functional Block Diagram partitioned based on data objects while non-hierarchical structures are usually based on partitioning space.4.e. The implementations described address a minimization objective function and prune based on comparison to minimum values. Therefore we provide algorithms for hierarchical structures in Subsection 2. These two families have important diﬀerences that aﬀect how queries are processed over them.

BSP-tree. If the user desires only the single best result.. i.1 Query Processing over Hierarchical Access Structures Algorithm 1 shows the query processing over hierarchical access structures. P-tree. if all objects that meet the range constraints are desired.4. 21 .5 we show that both of these algorithms are in fact optimal. A partial list of structures covered by this algorithm includes the R-tree. Only the data points within the constrained region will be returned. Now.e. Oct-tree. For range queries. k = 1. B-tree. X-tree. SS-tree. This algorithm is applicable to any hierarchical access structure where the partitions at each tree level are convex. LSD-tree. k-d-B-tree. R+-tree. Buddy-tree. Quad-tree. R*-tree. be an arbitrary value in order to return the k best data objects. GBD-tree. In Section 2. 9]. Extended k-DTree. SKD-tree. Therefore. and SR-tree [25.The number of results that are returned by the query processing is dependent on an input k. the number of data points. It can however. we describe the implementations of this framework based on hierarchical and non-hierarchical indexing techniques. B+-tree. and only the points that are within the lowest-level partitions that intersect the constrained region will be examined during query processing. 2. the value of k should be set to n. the query processing will be at least as I/O-eﬃcient as standard range query processing over a given access structure. no unnecessary MBRs (or partitions) given an access structure are accessed during the query processing.

rather than distance. If a point is not within the original problem constraints. our algorithm also prunes out any partitions and search paths that do not fall within problem constraints. The algorithm takes as input the convex optimization function of interest OF . If this partition is a leaf partition. In line 5.The algorithm is analogous to existing optimal NN processing [31]. we initialize the structures that we maintain throughout a query. then it will not have a feasible solution and will be discarded from further consideration. If the partition under analysis is not a leaf. we remove the ﬁrst promising partition. and a desired result set size k. In lines 1-3. we solve the CP problem corresponding to the objective function. Points with objective function values lower than the objective function value of the k th best point so far will cause the best point and best point objective function value vectors to be updated accordingly. We keep a vector of the k best points found so far as well as a vector of the objective function values for these points. original problem constraints. Additionally. original problem constraints. We keep a worklist of promising partitions along with the minimum objective value for the partition. We initialize the worklist to contain the tree root partition. the set of convex constraints C. we ﬁnd its child partitions. and partition constraints (line 11). For each of these child partitions. and the constraints imposed by the partition boundaries. except that partitions are traversed in order of promise with respect to an arbitrary convex function. then we access the points in the partition and compute their objective function values. This yields 22 . We solve a new CP problem for each partition that we traverse using the objective function. We then process entries from the worklist until the worklist is empty.

the ith best point found so far F (b[i]) . If the minimum objective function value is less than the k th best objective function value (i.F (b[k]) ← +∞ L ← (root) while L = ∅ remove ﬁrst partition P from L if P is a leaf access and compute functions for points in P within constraints C update vectors b and F if better points discovered else for each child Pchild of P minObjV al = CPSolve(OF ..Pchild .CPQueryHierarchical (Objective Function OF .C) if minObjV al < F (b[k]) insert Pchild into L according to minObjV al end for end if prune partitions from L with minObjV al > F (b[k]) end while return vector b Algorithm 1: I/O Optimal Query Processing for Hierarchical Index Structures. it is possible for a point within the intersection of the problem constraints and partition constraints to 23 ... minObjVal) pairs 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: b[1].k) notes: b[i] .b[k] ←null.e.. Constraints C.the objective function value of the ith best point L . the best possible objective function value achievable within the intersection of the original problem constraints and partition constraints.worklist of (partition. F (b[1]).

Point B. If there is no feasible solution for the partition within problem constraints. We initialize our worklist with the root and start by processing it.a. We do not process A further. then the partition is inserted into the worklist according to its minimum objective value. Query Processing Example As an example of query processing. B. and B.2.2. We process B.3. constrained 1-NN query). consider Figure 2.2.3 with an optimization objective of minimizing the distance to the query point within the constrained region (i. After a partition is processed.2.2 and since it is a leaf.a is the closer of the two.2. B.1. Since point e is closest.1. Since B is not a leaf.beat the k th best point). we may have updated our k th best objective function value. and f respectively. Since it is not a leaf. we process it next. and B. we prune it from the 24 . B. and the best point objective function is updated.3.e. we examine its child partitions. We solve for the minimum objective function value for A and do not obtain a feasible solution because A and the constrained region do not intersect. the partition will be dropped from consideration. We prune any partitions in the worklist that can not beat this value. We insert partition B into our worklist and since it is the only partition in the worklist. This point is designated as the best point thus far. we know we can do no worse than the distance from the query point to B.3 can not do better than this. After processing B. point c.2. we compute objective function values for the points in B. We ﬁnd the minimum objective function for partition B and get the distance from the query point to the point where the constrained region and the partition B intersect. Since partition B. We ﬁnd minimum objective function values for each and obtain the distances from the query point to point d. we insert partitions into the worklist in the order B. e. we examine its child partitions A and B.

a. We update the best point to be B. We examine only points in partitions that could contain points as good as the best solution.1.worklist.1.3: Example of Hierarchical Query Processing 25 .1. Point B. Since the worklist is now empty.a. Point B.1. we have completed the query and return the best point.a gets a value which is better than the current best point.1. Figure 2. We examine points in B. The optimal point for this optimization query this query is B.b is not a feasible solution (it is not in the constrained region) so it is dropped.

Grid-File. Those partitions that do not intersect the constraints will be bypassed. EXCELL. VA+-File. and result set size k as inputs. we show that the general framework for CP queries can also be implemented on top of non-hierarchical structures. Algorithm 2 can be applied to non-hierarchical access structures where the partitions are convex. 23]. but we use the objective function rather than a distance and also consider original problem constraints to ﬁnd lower bounds. OF . In stage 2. we process the unpruned partitions by accessing objects contained by the ﬁrst partition and computing actual objective values.4. For those that do have a feasible solution. This technique is similar to the technique identiﬁed in [53] for ﬁnding NN for a non-hierarchical structure. VA-File. 9.2 Query Processing over Non-hierarchical Access Structures In this section. 53. and Pyramid Technique [25. and continue accessing points contained by subsequent partitions until a partition’s lower bound is greater than the k th best value.2. By accessing partitions 26 . we place unique unpruned partitions in increasing sorted order based on their lower bound. We maintain a sorted list of the top-k objects we have found so far. We use a two stage approach detailed in Algorithm 2 to determine which data objects optimize the query within the constraints. the lower bound of the CP problem is recorded. In stage 1. We take the objective function. A partial list of structures where the algorithm can be applied is Linear Hashing. we process each populated partition vin the structure by solving the CP problem that corresponds to the original objective function and constraints combined with the additional constraints imposed by the partition boundaries (line 5). At the end of stage 1. We initialize the structures to be used during query processing (lines 1-3). original problem constraints C.

we access only those points in cell representations that could contain points as good as the actual solution.b[k] ←null.in order of how good they could be. minObjV al) into L according to minObjV al Stage 2: 8: for i = 1 to |L| 9: if minObjV al of partition L[i] > F (b[k]) 10: break 11: Access objects that partition L[i] contains 12: for each object o 13: if OF (o) < F (b[k]) 14: insert o into b 15: update F (b) Algorithm 2: I/O Optimal Query Processing over Non-hierarchical Structures. 2: F (b[1]). P . Constraints C.the objective function value of the ith best point L .F (b[k]) ← +∞ 3: L ← ∅ Stage 1: 4: for each populated partition P 5: minObjV al = CPSolve(OF .. CPQueryNonHierarchical (Objective Function OF . according to the CP problem.the ith best point found so far F (b[i]) .worklist of (partition.. 27 . C) 6: if minObjV al < +∞ 7: insert (P .. minObjVal) pairs Initialize: 1: b[1].k) notes: b[i] ..

3 Non-Covered Access Structures A review of access structure surveys yielded few structures that are not covered by the algorithms. space ﬁlling curves which are used for approximate ordering). We denote the objective function contour which goes through the actual optimal data point in the database as the optimal contour. 9].5 I/O Optimality In this section.e. 2. we deﬁne the optimality as follows. 28 . They include structures that contain non-convex regions (BANG-File. Since attribute information is lost. we use the optimal contour. hB-tree. The k th optimal contour would pass through all points in continuous space that yield the same objective function value as the k th best point in the database. These include structures that either do not guarantee spatial locality (i. we will show that the proposed query processing technique achieves optimal I/O for each implementation in Section 2. BV-tree) [25.4.2. they can not be applied to general functions over the original attributes). Deﬁnition 3 (Optimality) An algorithm for optimization queries is optimal iﬀ it retrieves only the pages that intersect the intersection of the constrained region R and the optimal contour. They include structures built over values computed for a speciﬁc function (such as distance in Mtrees. Note that M-trees would still work to answer queries for the functions they are built over and the structures built with non-convex leaves could still be optimally traversed down to the level of non-convexity. Following the traditional deﬁnition in [5]. This contour also goes through all other points in continuous space that yield the same objective function value as the optimal point. but could replace them with the k th optimal contour.4. In the following arguments.

i. It is easy to see that any partition intersecting the intersection of optimal contour and feasible region R will be accessed since they are not pruned in any phase of the algorithm. Figure 2. a is an optimal data point in the data base.4 shows diﬀerent cases of the various position relations between a partition. D: intersects the optimal contour and R. We will use minObjV al(P artition. R) to denote the minimum objective function value that is computed for the partition and constraints R. the proof is given in the following. Lemma 1 The proposed query processing algorithm is I/O optimal for CP queries under convex hierarchical structures.For hierarchical tree based query processing. 29 . We need to prove only these partitions are accessed.e. Proof. C: intersects the optimal contour but not R. A: intersects neither R nor the optimal contour. R and the optimal contour under 2 dimensions.. other partitions will be pruned without accessing the pages. B: intersects R but not the optimal contour. E: intersects the intersection of the optimal contour and R. but not the intersection of optimal contour and R.

4: Cases of MBRs with respect to R and optimal objective function contour A and C will be pruned since they are infeasible regions. Let CP0 be the data partition that contains an optimal data point. . ≥ minObjV al(CPk . CP1 . We shall show that a page in case D is pruned by the algorithm. and CPk be the root partition that contains CP0 . CPk−1 . R) ≥ minObjV al(CP1 . . B is pruned because minObjV al(B.Figure 2. .. we have F (a) ≥ minObjV al(CP0 . when the CP for these partitions are solved. they will be eliminated from consideration since they do not intersect the constrained region and minObjV al(A. CP1 be the partition that contains CP0 . . R) = minObjV al(C. R) > F (a) and E will be accessed earlier than B in the algorithm because minObjV al(E. R) < minObjV al(B. R) ≥ . Because the objective function is convex. R) 30 . . R) = +∞. . R).

R) is larger than constrained function value of any partition containing the optimal data point. R) > F (a) ≥ minObjV al(CP0 .4. Partitions A. whether it contains data or other partitions. B. D. This only can occur after CP0 has been accessed because minObjV al(D.and minObjV al(D. unless that MBR could contain an optimal answer. This contradicts with the assumption that D will be accessed. If CP0 is accessed earlier than D. Similarly. If D will be accessed at some point during the search. however. then D must be the ﬁrst in the MBR list at some time. R) ≥ minObjV al(CP1 . Lemma 2 The proposed query processing algorithm is I/O optimal for CP queries over non-hierarchical convex structures. that is. and so on. the algorithm is also optimal in terms of the access structure traversal. we will never retrieve the objects within an MBR (leaf or non-leaf). 31 . R) During the search process of our algorithm. . R) ≥ . . Figure 2. However. we can prove the I/O optimality of proposed algorithm over non-hierarchical structures. ≥ minObjV al(CPk . and E are the same as deﬁned for Figure 2. including D. The I/O-optimality relates to accessing data from leaf partitions. the algorithm prunes all partitions which have a constrained function value larger than F (a). and CPk−1 by CPk−2 . Proof. C. CPk is replaced by CPk−1 . until CP0 is accessed.5 shows diﬀerent possible cases of the partition position. The proof applies to any partition.

and the best data point found in E will prune partition B. Then an optimal algorithm retrieves only partition E. y). for a query F (x.Figure 2. a.5: Cases of partitions with respect to R and optimal objective function contour Without loss of generality. and partition E contains an optimal data point. we only need to show that partition D will not be retrieved. R)). assume that only these ﬁve partitions contain some data points. Partition B will not be retrieved because partition E will be retrieved before partition B (because minObjV al(E. 32 . R) < minObjV al(B. Thus. Partitions C and A will not be retrieved because they are infeasible and our algorithm will never retrieve infeasible partitions.

i. the virtual solution corresponding to (minObjV al(D. F (a) ≥ minObjV al(D.Because partition D does not intersect the intersection of the optimal contour and R but E does. Suppose partition D is also retrieved later. R). Weighted Constrained kNN. Hence. This contradicts the assumption that partition D does not intersect the intersection of R and the optimal contour.000 33 . Therefore. a 64-dimensional color image histogram dataset of 100. we compare cpu times against a non-general optimization technique. R) > minObjV al(E. This means partition D cannot be pruned by data point a.e. and the data point a is found at that time. R)) is contained by the optimal contour (because of the convexity of function F ). 2. R). We index and perform que-ries over four real data sets.6 Performance Evaluation The goals of the experiments are to show i) that the proposed framework is general and can be used to process a variety of queries that optimize some objective ii) that the generality of the framework does not impose signiﬁcant cpu burden over those queries that already have an optimal solution. One of these is Color Histogram data. and Constrained Weighted Linear Optimization. which also means the virtual solution corresponding to (minObjV al(D. partition E must be retrieved before partition D since the algorithm always retrieves the most promising partition ﬁrst. Therefore. we can ﬁnd at least one virtual point x∗ (without considering the database) in partition D such that x∗ is both in R and the optimal contour. We performed experiments for a variety of optimization query types including kNN. and iii) that non-relevant data subspace can be eﬀectively pruned during query processing. minObjV al(D. Weighted kNN. R)) ∈ D ∩ R∩optimal contour. Where possible..

unweighted nearest neighbor queries. as the weights of a weighted nearest neighbor query become more skewed. our algorithm matches the I/O-optimal access oﬀered by traditional R-tree nearest neighbor branch and bound searching. For unconstrained. 34 . the queries are weighted nearest neighbor queries. Weighted kNN Queries Figure 2. We execute 100 queries for each trial. We vary the value of k between 1 and 100. the optimal contour is a hyper-ellipse.data points. These datasets are widely used for performance evaluation of index structures and similarity search algorithms [39.000 60-dimensional vectors representing texture feature vectors of Landsat images [40]. the optimal contour is a hyper-sphere around the query point with a radius equal to the distance of the nearest point in the data set. We assume the index structure is in memory and we measure the number of additional page accesses required to access points that could be kNN according to the query processing. Intuitively. If instead. The ﬁgure shows that we achieve nearly the same I/O performance for weighted kNN queries using our query processing algorithm over the same index that is appropriate for traditional non-weighted kNN queries. The vectors represent color histograms computed from a commercial CD-ROM. and randomly assign weights between 0 and 1 for the k-WNN queries. 26]. The second is Satellite Image Texture (Landsat) data. which consists of 100.6 shows the performance of our query processing over R*-trees for the ﬁrst 8-dimensions of the histogram data for both kNN and k weighted NN (k-WNN) queries. Because the weights are 1 for the kNN queries. We used a clinical trials patient data set and a stock price data set to perform real world optimization queries. The index is built over the ﬁrst 8 dimensions of the histogram data and the queries are generated to ﬁnd the kNN or k-WNN to a random point in the data space over the same 8 dimensions.

we could generate an index based on attribute values transformed based on the weights and yield an optimal contour that was hyper-spherical. R*-tree. this would require generating an index for every potential set of query weights. 100k pts the hyper-volume enclosed by the optimal contour increases. Random weighted kNN vs kNN. We could then traverse the access structure using traditional nearest neighbor techniques. However. and the opportunity to intersect with access structure partitions increases.6: Accesses.Figure 2. This indicates that these weighted functions do not cause the optimal contour to cross signiﬁcantly more access structure partitions than their non-weighted counterparts. 8-D Histogram. For a weighted nearest neighbor query. Results for the weighted queries track very closely to the non-weighted queries. which is not practical for applications where weights are not typically known prior to the 35 .

7 shows the average cpu times required per query to traverse the access structure using convex problem solutions for the weighted kNN queries compared to the standard nearest neighbor traversal used for unweighted kNN queries. For similar applications where query weights can vary. Random weighted kNN vs kNN. 8-D Histogram. we would be better served to process the query over a single reasonable access structure in an I/O-optimal way than by building speciﬁc access structures for a speciﬁc set of weights.7: CPU Time. R*-tree. Figure 2. Figure 2. and these weights are not known in advance. 100k pts 36 .query.

Part of this diﬀerence is due to the slightly more complicated objective function (weighted versus unweighted Euclidean distance).000 objects. the same characteristics that make high-dimensional R-tree type structures ineﬀective for traditional queries also make them ineﬀective for optimization queries. the probability that a point’s objective function value can prune other partitions decreases.8 shows page access results for kNN and k-WNN queries over 60-dimensional. If the dataspace were more uniformly populated. Figure 2. The experiments assume the VA-ﬁle is in memory and measure the number of objects that need to be accessed to answer the query. For this reason. much of the data in this data set is clustered. while the remaining part is due to incorporating data set constraints (unused in this example) into computing minimum partition function values. 5-bit per attribute. the partitions overlap with each other more. part is due to the additional partitions that the optimal contour crosses. Similar to the R*-tree case. However. As data set dimensionality increases. As partitions overlap more. and some vector representations are heavily 37 . Therefore. we would expect to see object accesses much closer to k. Hierarchical data partitioning access structures are not eﬀective for high-dimensional data sets. VA-ﬁle built over the landsat dataset. the results show that the performance of the algorithm is not signiﬁcantly aﬀected by re-weighting the objective function. we perform similar experiments for high-dimensional data using VA-ﬁles.The ﬁgure shows that the generic CP-driven solution takes slightly longer to perform than a specialized solution for nearest neighbor queries alone. Sequential scan would result in examining 100.

For example. a computation needs to occur for every approximated object. it can not overcome the inherent weaknesses of that structure. 60-D Sat.Figure 2. 100k pts populated. A consequence of VA-File access structures is that. at high dimensionality. It should be noted that while the framework results in optimal access of a given structure. Times 38 . If a best answer comes from one of these vector representations. hierarchical data structures are ineﬀective. without some other data reorganization. VA-File. and the framework can not overcome this. As distance computations need to be performed for each vector in traditional VA-ﬁle nearest neighbor algorithms.8: Accesses. we need to look at each object assigned the same vector representation. following the VA-ﬁle processing model so too must a CP problem be solved for each vector for general optimization queries. Random weighted kNN vs kNN.

The graph shows interesting results. we access fewer partitions when this particular problem is less constrained. Weights are uniformly randomly generated and are between the values of -10 and 10. corner [0. and the maximum constrained value for the negatively weighted attributes will yield the optimal result within the constrained region (e. We vary k and we vary the constrained area region ﬁxed at the origin as a 8-D hypercube with the indicated value. It takes longer to perform each query but the ratio of the general optimization query times to the traditional kNN CPU times is similar to the R*-tree case. the corner corresponding to the minimum constraint value for the positively weighted attributes.e.+]).g.0] would be an optimal minimization answer for a constrained unit cube and a linear function with a weight vector [+. When weights can be negative. However. Figure 2. The number of page accesses is not directly related to the constraint size.for this experiment reﬂect this fact. leading to denser packing of access structure partitions near the origin. Constrained Weighted Linear Optimization Queries We also explore Constrained Weighted-Linear Optimization queries. For this data set. Because of the sparser density of partitions around the point that optimizes the query.-. The rate of increase in page accesses to accommodate diﬀerent values of k does vary with the constraint size. the radius of the k th -optimal contour increases more in these less dense regions than 39 . We generated 100 queries for each data point in the form of a summation of weighted attributes over the 8 dimensions.006 (i. about 1. very little computational burden is added to allow query generality). many points are clustered around the origin.9 shows the number of page accesses required to answer randomly weighted LP minimization queries for the histogram data set using the 8-D R*-Tree access structure.1.

R*-Tree. This means that there are less than k objects in our constrained region. leading to a faster increase in the number of page accesses required when k increases. and it. the sequential scan of this data set is 834 pages. 100k pts more clustered regions. For our query processing.Figure 2. we only need to examine those points that are in partitions 40 . For comparative purposes.9: Page Accesses. We should also note what happens when there are less than k optimal answers in the data set. as well as the other competing techniques are not appropriate for convex constrained areas. Note that the Onion technique is not feasible for this dataset dimensionality. 8-D Color Histogram. Random Weighted Constrained k-LP Queries.

It corresponds to a vector access ratio of 0. Unlike nearest neighbor and linear optimization type queries.49x2 − 0. randomly weighted queries.51x5 .09x1 − 0.0007.28x3 − 0. A sample function. We also wanted to explore that constrained optimization problems for real applications. Functions contain both quadratic and linear terms and were generated over 5 dimensions in the form 5 i=1 (random() ∗ x2 − random() ∗ xi ). is to minimize 0. we achieved an average of 11.5 vector accesses to ﬁnd the k = 10 results.49x2 +0. the continuous optimal solution is not intuitively obvious by examining the function. This means that on average the number of vectors within the intersection of the constrained space and 10th -optimal contour is 11. Using the VA-ﬁle structure with 5 bits assigned per attribute and 100 randomly selected. 4 5 41 .91x4 − 0. we generated a set of 100 random functions. We performed Weighted Constrained kNN experiments to emulate a patient matching data exploration task described in Section 2. Arbitrary Functions over Diﬀerent Access Structures In order to show that the framework can be used to ﬁnd optimal data points for arbitrary convex functions. where coeﬃcients are rounded to two decimal places. A technique that did not evaluate constraints prior to or during query processing will need to examine every point. To test this. we performed a number of queries over a set of 17000 real patient records.3.76x2 − 0. We established constraints on age and gender and queried to ﬁnd the 10 weighted nearest neighbors in the data set according to 7 measured blood analytes.that intersect the constrained region. i where xi is the data attribute value for the ith dimension.3.55x2 + 0. Queries with Real Constraints The previous example showed results with synthetically generated constraints.4x2 + 1 2 3 0.91x2 +0.5.

10 shows results over the 100 functions.0025 for all the access structures. the average ratio of the data points accessed is below 0. The R*-tree looks to avoid some of the partition overlaps produced by the R-tree insertion heuristics and it shows better performance. The VAM-split R*-tree bulk loads the data into the 42 . the R*-tree. Even the worst performing structure over the worst case function still prunes over 99% of the search space. and these optimal answers were scattered throughout the data space. We explored 2 non-hierarchical structures. we modiﬁed code to include an optimization function that uses CP to traverse the index structure using the algorithm appropriate for the structure. Each structure performed well for these optimization queries. It shows the minimum. and the VA-File 32000. Figure 2. maximum. For each of these index types. We added a new set of 50000 uniform random data points within the 5-dimension hypercube and built each of the index structures over this data set. the grid ﬁle and the VAﬁle. the grid ﬁle 16000. Results between hierarchical access structures are as expected. These include 3 hierarchical structures. including the R-tree. For each structure we try to keep the number partitions relatively close to each other. For the hierarchical structures these are leaf partitions and for the non-hierarchical structures they are grid cells. We ran the set of random functions over each structure and measured the number of objects in the partitions that we retrieve. and average object access ratio to guarantee discovery of the optimal answer over the set of functions. The number of partitions for the hierarchical structures is between 12000 and 13000. Each index yielded the same correct optimal data point for each function.We explored the results of our technique over 5 index structures. and the VAM-split bulk loaded R*-tree.

5-D Uniform Random. Optimizing Random Function. Index Type. Incorporating Problem Constraints during Query Processing The proposed framework processes queries in conjunction with any set of convex problem constraints. this yields better performance than the dynamically constructed R*-tree. For a constrained NN type query we could process the query according to traditional NN techniques and when we ﬁnd a result.10: Accesses vs. and therefore achieves better access ratios. check it against the constraints until we ﬁnd a point that meets the 43 . We compare the proposed framework to alternative methods for handling constraints in optimization query processing. 50k Points index and can avoid more partition overlaps than an index built using dynamic insertion.Figure 2. As expected. The VA-File has greater resolution than the grid-ﬁle in this particular test.

Figure 2. R*-Tree. and we can terminate the search.11 shows the number of pages accessed using our technique compared to these two alternatives as the constrained area varies. and then perform distance computations on the result set and select the best one.constraints. the more likely a discovered NN will fall within the constraints. As the problem becomes less constrained. we could perform a query on the constraints themselves. NN Query.11: Pages Accesses vs. Figure 2. The Selectivity line is a 44 . Constraints. 100k Points In the ﬁgure. Constrained kNN queries were performed over the 8-D R*-tree built for the Color Histogram data set. and then check if the result meets the constraints. the R*-tree line shows the number of page accesses required when we use traditional NN techniques. 8-D Histogram. Alternatively.

The CP line shows the results for our framework. 2. query constraints are convex. Our search drills down to the leaf partition that either contains the optimal answer within the problem constraints. or yields the best potential answer within constraints. We provide speciﬁc implementations of the processing framework for hierarchical data partitioning and non-hierarchical space partitioning access structures. We experimentally show that the framework can be used to answer a number of diﬀerent types of optimization queries including nearest neighbor queries.lower bound on the number of pages that would be read to obtain the points within the constraints. we prune search paths that can not lead to feasible solutions.7 Conclusions In this chapter we present a general framework to model optimization queries. as the problem becomes more constrained. We present a query processing framework to answer optimization queries in which the optimization function is convex. We demonstrate better results with respect to page accesses except in cases where the constraints are very small (and we can prune much of the space by performing a query over the constraints ﬁrst) or when the NN problem is not constrained (where our framework reduces to the traditional unconstrained NN problem). Clearly. linear function optimization. When there are problem constraints. which processes original problem constraints as part of the search process and does not further process any partitions that do not intersect constraints. and non-linear function optimization queries. 45 . and the underlying access structure is made up of convex partitions. the fewer pages will need to be read to ﬁnd potential answers. We prove the I/O optimality of these implementations.

the performance will not be as good. This means that given an access structure we will only read points from partitions that have minimum objective function values lower than the k th actual best answer. Because the framework is built to best utilize the access structure. and partitions and search paths that do not intersect with problem constraints are pruned during the access structure traversal. If the optimal contour and problem constraints can be isolated to a few leaf partitions. it captures the beneﬁts as well as the ﬂaws of the underlying structure. The proposed framework oﬀers I/O-optimal access of whatever access structure is used for the query. Furthermore. it demonstrates how well the particular access structure isolates the optimal contour within the problem constraints to access structure partitions.The ability to handle general optimization queries within a single framework comes at the price of increasing the computational complexity when traversing the access structure. Since constraints are handled during the processing of partitions in our framework. This results in a signiﬁcant reduction in required accesses compared to alternative methods in the cases where the inclusion of constraints makes these alternatives less eﬀective. the access structure will yield nice results. if the optimal contour crosses many partitions. As such. we only access points in partitions that could potentially have better answers than the actual optimal database answers. However. Essentially. We show a slight increase in CPU times in order to perform convex programming based traversal of access structures in comparison with equivalent nonweighted and unconstrained versions of the same problem. we do not need to examine points that do not intersect with the constraints. the framework can be used to measure page access performance associated with using diﬀerent indexes and index types to 46 .

answer certain classes of optimization queries. The system does not need to compute a queried function for all data objects in order to ﬁnd the optimal answers. The technique provides optimization of arbitrary convex functions. weights. Database researchers and administrators can use this technique as a benchmarking tool to evaluate the performance of a wide range of index structures available in the literature. one does not need to know the objective function. This makes the framework appropriate for applications and domains where a number of diﬀerent functions are being optimized or when optimization is being performed over diﬀerent constrained regions and the exact query parameters are not known in advance. To use this framework. in order to determine which structures can most eﬀectively answer the optimization query type. or constraints in advance. 47 . and does not incur a signiﬁcant penalty in order to provide this generality.

.CHAPTER 3 Online Index Recommendations for High-Dimensional Databases using Query Workloads When solving problems. rendering statically determined access structures ineﬀective. access structures that are appropriate for the query must be available to the query processor at query run-time. In order to eﬀectively prune search space during query resolution. 3. Appropriate access structures should match well with query selection criteria and allow for a signiﬁcant reduction in the data search space. such as business data warehouses and scientiﬁc data repositories.1 Introduction An increasing number of database applications. a method for monitoring performance and recommending access structure changes is provided. This chapter describes a framework to determine a meaningful set of appropriate access structures for a set of queries considering query cost savings estimated from both query attributes and data characteristics. Since query patterns can change over time. dig at the roots instead of just hacking at the leaves. As the number of dimensions/attributes and overall size of data sets increase. it becomes essential to eﬃciently retrieve speciﬁc queried data from the database in order to eﬀectively utilize 48 . deal with high dimensional data sets. D’Angelo.Anthony J.

Another possibility would be to build an index over each single dimension. To answer a query. Range queries are consistently executed over ﬁve of the attributes. the answer set contains about 10 results). Multi-dimensional indexing. The query selectivity over each attribute is 0. However.1. However.000 data objects with several hundred attributes. We could build a multi-dimensional index over the data set so that we can directly answer any query by only using the index. each of these potential solutions has inherent problems. Indexing support is needed to eﬀectively prune out signiﬁcant portions of the data set that are not relevant for the queries. so the overall query selectivity is 1/105 (i.000. The eﬀectiveness of this approach is limited to the amount of search space that can be pruned by a single dimension (in the example the search space would only be pruned to 100. for high-dimensional data sets. performance of multi-dimensional index structures is subject to Bellman’s curse of dimensionality [4] and degrades rapidly as the number of dimensions increases. As the dimensionality of the index increases.000 objects). For the given example. data is placed in a partition that contains the data point and could overlap with other partitons. consider a uniformly distributed data set of 1. such an index would perform much worse than sequential scan. the overlaps between partitions increase and 49 . all potentially matching search paths must be explored. To illustrate these problems.e. and RDBMS index selection tools all could be applied to the problem. dimensionality reduction. An ideal solution would allow us to read from disk only those pages that contain matching answers to the query.the database. For data-partitioning indexes such as the R-tree family of indexes.

where partitions do not overlap and data points are associated with cells which contain them (e. For spacepartitioning structures.at high enough dimensions the entire data space needs to be explored. A 100 dimensional index with only a single split per attribute results in 2100 cells. they also select attributes based on data statistics to support classiﬁcation accuracy rather than focusing on query performance and workload in a database domain. As well. Commercial RDBMS’s have developed index recommendation systems to identify indexes that will work well for a given workload. 50 . However. Therefore. and transform the query in the same way that the data was transformed. traditional feature selection techniques are based on selecting attributes that yield the best classiﬁcation capabilities. They also introduce a signiﬁcant overhead in the processing of queries.g. the selected features may oﬀer little or no data pruning capability given query attributes. Another possible solution is to apply feature selection to keep the most important attributes of the data according to some criteria and index the reduced dimensionality space. These tools are optimized for the domains for which these systems are primarily employed and the indexes that the systems provide. the problem is the exponential explosion of the number of cells. The index structure can become intractably large just to identify the cells. grid ﬁles). They are targeted towards lower dimension transactional databases and do not produce results optimized for single high-dimensional tables. the dimensionality reduction approaches are mostly based on data statistics. However. Another possible solution would be to use some dimensionality reduction technique. index the reduced dimension data space. This phenomenon is thoroughly described in [54]. and perform poorly especially when the data is not highly correlated.

the search criterion is expected to consist of 10 to 30 parameters. However. only a small subset of the overall data dimensions are popular for a majority of queries and recurring patterns of dimensions queried occur. rather than maintaining data energy as is the case for traditional approaches. Their long term eﬀectiveness is dependent on the stability of the query workload. Researchers have proposed workload based index recommendation techniques. However. and results in multiple and potentially overlapping sets of dimensions. Each such collision generates on the order of 1-10MBs of raw data.Our approach is based on the observation that in many high dimensional database applications. rather than a single set. Large Hadron Collider (LHC) experiments are expected to generate data with up to 500 attributes at the rate of 20 to 40 per second [45]. forcing their collision. Query pattern evolution over time presents another challenging problem. The queries are predominantly range queries and involve mostly around 5 dimensions out of a total of 200. This approach is also analogous to dimensionality reduction or feature selection with the novelty that the reduction is speciﬁcally designed for reducing query response times. For example. Our reduction considers both data and access patterns. We address the high dimensional database indexing problem by selecting a set of lower dimensional indexes based on joint consideration of query patterns and data statistics. Another example is High Energy Physics (HEP) experiments [49] where sub-atomic particles are accelerated to nearly the speed of light. query access patterns may change over time becoming completely dissimilar from the 51 . The new set of low dimensional indexes is designed to address a large portion of expected queries and allow eﬀective pruning of the data space to answer those queries. which corresponds to 300TBs of data per year consisting of 100-500 million objects.

for each query. diﬀerent database uses at diﬀerent times of the month or day). or the construction of an entirely new index set is beneﬁcial. For this reason. To make this approach practical in the presence of query pattern change. a change in the focus of user knowledge discovery (e. When the current query patterns are substantially different from the query patterns used to recommend the database indexes. or simply random variation of query attributes. Pattern change could be the result of periodic time variation (e. Because of the need to proactively monitor query patterns and query performance quickly. We generate this abstract representation of the query workload by mining patterns in the workload. the index set should evolve with the query patterns. the system performance will degrade drastically since incoming queries do not beneﬁt from the existing indexes. current events cause an increase in queries for certain search attributes).g. To estimate the query cost. a change in the popularity of a search attribute (e. The query workload representation consists of a set of attribute sets that occur frequently over the entire query set that have non-empty intersections with the attributes of the query.g. the index selection technique we have developed uses an abstract representation of the query workload and the data set that can be adjusted to yield faster analysis. the data set is represented by a multi-dimensional histogram where each unique value represents an approximation of data and contains a count of the number of 52 .patterns on which the index set were originally determined. a researcher discovery spawns new query patterns). the replacement of an existing index. we introduce a dynamic mechanism to detect when the access patterns have changed enough that either the introduction of a new index. There are many common reasons why query patterns change.g.

Introduction of a ﬂexible index selection technique designed for high-dimensional data sets that uses an abstract representation of the data set and query workload. The resolution of the abstract representation can be tuned to achieve either a high ratio of index-covered queries for static index selection or fast index selection to facilitate online index selection. 2. The contributions of this work can be summarized as follows: 1. Analysis speed and granularity is aﬀected by tuning the resolution of the abstract representations. This process is iterated until some indexing constraint is met or no further improvement is achieved by adding additional indexes. The number of potential indexes considered is aﬀected by adjusting data mining support level. As new queries arrive. we monitor the ratio of potential performance to actual performance of the system in terms of cost and based on the parameters set for the control feedback loops. the estimated cost of using that index for the query is computed. Introduction of a technique using control feedback to monitor when online query access patterns change and to recommend index set changes for highdimensional data sets. we make major or minor changes to the recommended index set. we propose a control feedback system with two loops. a ﬁne grain control loop and a coarse control loop.records that match that approximation. Initial index selection occurs by traversing the query workload representation and determining which frequently occurring attribute set results in the greatest beneﬁt over the entire query set. 53 . For each possible index for each query. The size of the multi-dimensional histogram aﬀects the accuracy of the cost estimates associated with using an index for a query. In order to facilitate online index selection.

We conclude in Section 3. It takes into consideration both data and query characteristics and can be applied to perform real-time index recommendations for evolving query patterns. Experimental analysis showing the eﬀects of varying abstract representation parameters on static and online index selection performance and showing the eﬀects of varying control feedback parameters on change detection response. Section 3.2. Our work diﬀers from the related index selection work in that we provide a index selection framework that can be tuned for speed or accuracy.2 Related Work High-dimensional indexing.3 explains our proposed index selection and control feedback framework. Section 3.5. and the GC-tree [28]. 3. Section 3. 3. Presentation of a novel data quantization technique optimized for query workloads. each of these methods provides less than ideal solutions for the problem of fast high-dimensional data access.2 presents the related work in this area. 4. Our technique is optimized to take advantage of multi-dimensional pruning oﬀered by multi-dimensional index structures.3. and DBMS index selection tools are possible alternatives for addressing the problem of answering queries over a subspace of high-dimensional data sets. The rest of the chapter is organized as follows. feature selection.1 High-Dimensional Indexing A number of techniques have been introduced to address the high-dimensional indexing problem.4 presents the empirical analysis. such as the X-tree [6]. As described below. While these index 54 .

2. almost every commercial Relational Database Management System (RDBMS) provides the users with an index recommendation tool based on a query workload and using the query optimizer to obtain cost estimates. The query workload should be a good representative of the types of queries an application supports. Then a candidate index selection step evaluates the queries in a given query workload and eliminates from consideration those candidate 55 . 17. they still suﬀer performance degradation at higher index dimensionality. A query workload is a set of SQL data manipulation statements. 24. selected features may have no overlap with queried attributes. an iterative process is utilized to ﬁnd an optimal conﬁguration. These techniques are also focused on maximizing data energy or classiﬁcation accuracy rather than query response.3 Index Selection The index selection problem has been identiﬁed as a variation of the Knapsack Problem and several papers proposed designs for index recommendations [34. 3. 16. 3. Currently. single dimension candidate indexes are chosen.structures have been shown to increase the range of eﬀective dimensionality.2 Feature Selection Feature selection techniques [7. Microsoft SQL Server’s AutoAdmin tool [15. 36. 1] selects a set of indexes for use with a speciﬁc data set given a query workload. These earlier designs could not take advantage of modern database systems’ query optimizer. 3. As a result. 55. First. In the AutoAdmin algorithm. 13] based on optimization rules. 29] are a subset of dimensionality reduction targeted at ﬁnding a set of untransformed attributes that best represent the overall data set.2.

These index structures can not achieve the same level of result pruning that can be oﬀered by an index technique that indexes multiple dimensions simultaneously (such as R-tree or grid ﬁle). The process is iterated for increasingly wider multi-column indexes until a maximum index width threshold is reached or an iteration yields no improvement in performance over the last iteration. single and multi-level B+-trees are evaluated. In the case of SQL Server. The cost function computed by the query optimizer is used to calculate the beneﬁt of using the index. Costs are estimated using the query optimizer which is limited to considering those physical designs oﬀered by the DBMS. For DB2. 56 .indexes which would provide no useful beneﬁt. Remaining candidate indexes are evaluated in terms of estimated performance improvement and index cost. MAESTRO (METU Automated indEx Selection Tool)[19] was developed on top of Oracle’s DBMS to assist the database administrator in designing a complete set of primary and secondary indexes by considering the index maintenance costs based on the valid SQL statements and their usage statistics automatically derived using SQL Trace Facility during a regular database session. IBM has developed the DB2Adviser [52] which recommends indexes with a method similar to AutoAdmin with the diﬀerence that only one call to the query optimizer is needed since the enumeration algorithm is inside the optimizer itself. As a result the indexes suggested by the tool often do not capture the query performance that could be achieved for multidimensional queries. The SQL statements are classiﬁed by their execution plan and their weights are accumulated.

and the query set Q is a instance DS is a set of n tuples t = {(a1 . do not reﬂect the pruning that could be achieved by indexing multiple dimensions together. In [35] a cost model is used to identify beneﬁcial indexes and decide when to create or drop an index at runtime.1 Approach Problem Statement In this section we deﬁne the problem of index selection for a multi-dimensional space using a query workload. Microsoft Research has proposed a physical design alerter [11] to identify when a modiﬁcation to the physical design could result in improved performance. the }. Q).3.. and Q (the query set) is a set of subsets of DS. 57 .4 Automatic Index Selection The idea of having a database that can tune itself by automatically creating new indexes as the queries arrive has been proposed [35. DS ⊆ D is a ﬁnite subset (the data set).. where d is the dimensionality of the dataset. where D is the domain. DS. ad )|ai ∈ set of range queries. [18] proposes an agent-based database architecture to deal with automatic index creation. In our case. the domain is d . . we deﬁne a workload as follows: Deﬁnition 4 A workload W is a tuple W = (D. More formally.These commercial index selection tools are coupled to physical design options provided by their respective query optimizers and therefore.. 3.3 3. 3.2. a2 . 18]. A query workload W consists of a set of queries that select objects within a speciﬁed subspace in the data domain.

an optional analysis time constraint ta . an optional indexing constraint C. attribute order has no impact on the amount of pruning possible. the time to retrieve the data pages from disk. when no index is present. reduces to scanning all the points in the dataset and testing whether the query conditions are met.e. i. we are aiming to recommend the set of indexes that yields the lowest estimated cost for every query in a workload for any query that can beneﬁt from an index.Finding the answers to a query. either as the minimum number of indexes or a time constraint. In the context of this problem an index is considered to be the set of attributes that can be used to prune subspace simultaneously with respect to each attribute. Our problem can be deﬁned as ﬁnding a set of indexes I. The overall goal of this work is to develop a ﬂexible index selection framework that can be tuned to achieve eﬀective static and online index selection for highdimensionsal data under diﬀerent analysis constraints. For static index selection. Therefore. a query workload W . that provides the best estimated cost over W . when no constraints are speciﬁed. from which the queries 58 . The assumption is that the time spend doing I/O dominates the time required to perform the simple bound comparisons. given a multi-dimensional dataset DS. The index can identify a smaller set potential matching objects and only those data pages containing these objects need to be retrieved from disk. In the case that an index is present. In the case when a constraint is speciﬁed. we want to recommend a set of indexes within the constraint. the cost of answering the query can be lower. In this scenario we can deﬁne the cost of answering the query as the time it takes to scan the dataset. The degree to which an index prunes the potential answer set for a query determines its eﬀectiveness for the query.

we want an online index selection system that diﬀerentiates between low-cost index set changes and higher cost index set changes and also can make decisions about index set changes based on diﬀerent cost-beneﬁt thresholds. Typically.can beneﬁt the most. we are striving to develop a system that can recommend an evolving set of indexes for incoming queries over time. a cost model is embedded into the query optimizer to decide on the query plan. The cost associated with an index is calculated based on the number of estimated matches derived from the histogram and the dimensionality of the index. whether the query should be answered using a sequential scan or using an existing index. While maintaining the original query information for later use to determine estimated query cost. The histogram captures data correlations between only those attributes that could be represented in a selected index. Instead of using the query optimizer to estimate query cost. When there is a time-constraint.3. we need to automatically adjust the analysis parameters to increase the speed of analysis. such that the beneﬁt of index set changes outweighs the cost of making those changes. we conservatively estimate the number of matches associated with using a given index by using a multi-dimensional histogram abstract representation of the dataset. 3. we apply one abstraction to the query workload to convert each 59 . Increasing the size of the multi-dimensional histogram enhances the accuracy of the estimate at the cost of abstract representation size. we need to be able to estimate the cost of executing the queries.2 Approach Overview In order to measure the beneﬁt of using a potential index over a set of queries. with and without the index. Therefore. For online index selection.

By varying the support. The Database Administrator (DBA) has to decide when to run this wizard and over which workload. online index selection. 60 . Lowering the conﬁdence threshold improves analysis time by eliminating some lower dimensional indexes from consideration but can result in recommending indexes that cover a strict superset of the queried attributes. index selection considering a broader analysis space or frequent. The assumption is that the workload is going to remain static over time and in case it does change. All of the commercial index wizards work in design time. we aﬀect the speed of index selection and the ratio of queries that are covered by potential indexes. The ﬂexibility aﬀorded by the abstract representation we use allows it to be used for infrequent. the DBA would collect the new workload and run the wizard again. In the following two sections we present our proposed solution for index selection which is used for static index selection and as a building block for the online index selection. We further prune the analysis space using association rule mining by eliminating those subsets above a certain conﬁdence threshold. We perform frequent itemset mining over this abstraction and only consider those sets of attributes that meet a certain support to be potential indexes. Our technique diﬀers from existing tools in the method we use to determine the potential set of indexes to evaluate and in the quantization-based technique we use to estimate query costs.query into the set of attributes referenced in the query.

and a Multi-dimensional Histogram H. • Potential Index Set P 61 . The description of the outputs and how they are generated is given below.3.Figure 3. and the index selection loop.3 Proposed Solution for Index Selection The goal of the index selection is to minimize the cost of the queries in the workload given certain constraints. the query cost computation.1: Index Selection Flowchart 3. Figure 3. Given a query workload. Initialize Abstract Representations The initialization step uses a query workload and the dataset to produce a set of Potential Indexes P . We identify three major components in the index selection framework: the initialization of the abstract representations. conﬁdence. a dataset.1 provides a list of the notations used in the descriptions. In the following subsections we describe these components and the dataﬂow between them.1 shows a ﬂow diagram of the index selection framework. our framework produces a set of suggested indexes as an output. a Query Set Q. and several analysis parameters. Table 3. according to the support. the indexing constraints. and histogram size speciﬁed by the user.

used as a workload-optimized abstract representation of the data set S Suggested Indexes. and attributes 1 and 2 are queried together in greater than 10 percent of the queries. a representation of the query workload. the set of attribute sets under consideration as a suggested indexes Q Query Set. then a representation of the set 62 . It consists of the attribute sets in P that intersect with the query attributes. For instance. This set is computed using traditional data mining techniques. support of a set of attributes A is deﬁned as: n SA = i=1 1 if A ⊆ Qi 0 otherwise n where Qi is the set of attributes in the ith query and n is the number of queries. Formally.Symbol P Description Potential Set of Indexes. for each query. the set of attribute sets currently selected as recommended indexes i attribute set currently under analysis support the minimum ratio of occurrence of an attribute in a query workload to be included in P conf idence the maximum ratio of occurrence of an attribute subset to the occurrence of a set before the subset is pruned from P histogram size the number of bits used to represent a quantized data object Table 3. and estimated query costs H Multi-Dimensional Histogram.1: Index Selection Notation List The potential index set P is a collection of attribute sets that could be beneﬁcial as an index for the queries in the input query workload. if the input support is 10%. P consists of the sets of attributes that occur together in a query at a ratio greater than the input support. query ranges. Considering the attributes involved in each query from the input query workload to be a single transaction.

is deﬁned as: 63 . Formally. all subsets of attribute sets meeting the support will also be included as a potential index (in the example above both the sets {1} and {2} will be included). where A and B are disjoint. Note that because a subset of an attribute set that meets the support requirement will also necessarily meet the support. conﬁdence of an association rule {set of attributes A}→{set of attributes B }. Note that our particular system is built independently from a query optimizer. While data mining the frequent attribute sets in the query workload in determining P . the number of potential indexes increases. but the sets of attributes appearing in the predicates from a query optimizer log could just as easily be substituted for the query workload in this step.of attributes {1. an index built over the subset will likely not provide much beneﬁt over the query workload if an index is built over the attributes in the set. As the input support is decreased. Conﬁdence is the ratio of a set’s occurrence to the occurrence of a subset. the input conﬁdence is used to prune analysis space. If a set occurs nearly as often as one of its subsets. The conﬁdence of an association rule is deﬁned as the ratio that the antecedent (Left Hand Side of the rule) and consequent (Right Hand Side of the rule) appear together in a query given that the antecedent appears in the query.2} will be included as a potential index. we also maintain the association rules for disjoint subsets and compute the conﬁdence of these association rules. In order to enhance analysis speed with limited eﬀect on accuracy. Such an index will only be more eﬀective in pruning data space for those queries that involve only the subset’s attributes.

another possible solution in the case where a large number of attributes are frequent together would be to partition a large closed frequent itemset into disjoint subsets for further examination. then we will prune the attribute set {1} from P . then the conﬁdence {2}→{1} = 0. a more eﬃcient algorithm such as the FP-Tree [30] could be applied if the attribute sets associated with queries are too large for the apriori technique to be eﬃcient. We can also set the conﬁdence level based on attribute set cardinality. if every time attribute 1 appears.n CA→B = i=1 n 1 if (A ∪ B) ⊆ Qi 0 otherwise 1 if A ⊆ Qi 0 otherwise i=1 where Qi is the set of attributes in the ith query and n is the number of queries. If we have set the conﬁdence input to 0. • Query Set Q 64 . The conﬁdence can contain more than one value depending on set cardinality.6. we want to be more conservative with respect to pruning attribute subsets. While it is desirable to avoid examining a high-dimensional index set as a potential index.0. but we will keep attribute set {2}. Techniques such as CLOSET [44] could be used to arrive at the initial closed frequent itemsets. In our example. While the apriori algorithm was appropriate for the relatively low attribute query sets in our domain.5. If attribute 2 appears without attribute 1 as many times as it appears with attribute 1. Since the cost of including extra attributes that are not useful for pruning increases with increased indexed dimensionality. attribute 2 also appears then the conﬁdence of {1}→{2} = 1.

we only aggregate the count of rows with each unique bucket representation because we are just interested in estimating 65 . • Multi-Dimensional Histogram H An abstract representation of the data set is created in order to estimate the query cost associated with using each query’s possible indexes to answer that query. Data for an attribute that has been assigned b bits is divided into 2b buckets. In order to handle data sets with uneven data distribution. each query has an identiﬁed set of possible indexes for that query. This representation is in the form of a multi-dimensional histogram H. These are the indexes in the potential index set P that share at least one common attribute with the query. The input histogram size dictates the number of bits used to represent each unique bucket in the histogram. A single bucket represents a unique bit representation across all the attributes represented in the histogram. The histogram is built by converting each record in the data set to its representation in bucket numbers. The number of bits that each of the represented attributes gets is proportional to the log of that attribute’s support. we deﬁne the ranges of each bucket so that each bucket contains roughly the same number of points. There is no reason to sacriﬁce data representation resolution for attributes that will not be evaluated. As we process data rows. then it can not be part of an attribute set appearing in P . If a single attribute does not meet the support. These bits are designated to represent only the single attributes that met the input support in the input query workload. At the end of this step.The query set Q is the abstract representation of the query workload is initialized by associating the potential indexes that could be beneﬁcial for each query with that query. This gives more resolution to those attributes that occur more frequently in the query workload.

0 1 01. for attributes 2 and 3 values from 1-5 quantize to 0. A higher accuracy in representation is achieved by using more bits to quantize the attributes that are more frequently queried. 3 and 4 quantize to 01.0.2 shows a simple multi-dimensional histogram example.0.query cost.1 1 01.0 1 11. For illustration.1.0 2 00. and 2 bits to quantize attribute 1. values 1 and 2 quantize to 00. Note that the multi-dimensional histogram is based on a scalar quantizer designed on data and access patterns. Note that we do not maintain any entries in the histogram for bit representations that have no occurrences. This histogram covers 3 attributes and uses 1 bit to quantize attributes 2 and 3. The .1. and values from 6-10 quantize to 1.’s in the ’value’ column denote attribute boundaries (i.0 2 11. A1 2 4 1 6 3 2 5 8 3 9 Sample Dataset A2 A3 Encoding 5 5 0000 8 3 0110 4 3 0000 7 1 1010 2 2 0100 2 6 0001 6 5 1010 1 4 1100 8 7 0111 3 8 1101 Histogram Value Count 00. Table 3. For attribute 1. 5-7 quantize to 10.e.2: Histogram Example 66 . assuming it is queried more frequently than the other attributes.0.0. as opposed to just data in the traditional case.1 1 10.0. In this example.1.0 1 01. and 8 and 9 quantize to 11.1 1 Table 3. So we can not have more histogram entries than records and will not suﬀer from exponentially increasing the number of potential multi-dimensional histogram buckets for high-dimensional histograms. attribute 1 has 2 bits assigned to it).

our abstract query set representation has estimated costs for each index that could improve the query cost. At the end of this step.Query Cost Calculation Once generated. we also keep a current cost ﬁeld. For example. one could apply a cost function that amortized the cost of loading an index over a certain number of queries or use a function tailored to the type of index that is used. we aggregate the number of matches we ﬁnd in the multi-dimensional histogram looking only at the attributes in the query that also occur in the index (bits associated with other attributes are considered to be don’t cares in the query matching logic). Many cost functions have been proposed over the years. we also initialize an empty set of suggested indexes S. which is the index type used for this work. To estimate the query cost.F BF = d 1 Cef f +1 67 . we then apply a cost function based on the number of matches we obtain using the index and the dimensionality of the index. the expected number of data page accesses is estimated in [22] by: d Ann. For an R-Tree. For each query in the query set representation. the abstract representations of the query set Q and the multidimensional histogram H are used to estimate the cost of answering each query using all possible indexes for the query. The cost function can be varied to accurately reﬂect a cost model for the database system. For a given query-index pair.mm. which we initialize to the cost of performing the query using sequential scan. • Cost Function A cost function is used to estimate the cost associated with using a certain index for a query. At this point.

Each of the cost estimate formulas require a range radius.em. and Cef f is the capacity of a data page. These cost estimates also assume that data distribution is independent between attributes. d is the dataset dimensionality. A more recently proposed cost model is given in [8] where the expected number of pages accesses is determined as: d Ar. and that the data is uniformly distributed throughout the data space. However. we apply a cost estimate that is based on the actual matches that occur over the multi-dimension histogram over the attributes that form a potential index. In order to overcome these limitations. they have certain characteristics that make them less than ideal for the given situation. Using actual matches 68 . While these published cost estimates can be eﬀective to estimate the number of page accesses associated with using a multi-dimensional index structure under certain conditions. The cost model for R-trees we use in this work is given by (d(d/2) ∗ m) where d is the dimensionality of the index and m is the number of matches returned for query matching attributes in the multi-dimensional histogram.where d is the dimensionality of the dataset and Cef f is the number of data objects per disk page. Therefore the formulas break down when assessing the cost of a query that is an exact match query in one or more of the query dimensions. N is the number of data objects.ui (r) = 2r · d N 1 +1− Cef f Cef f where r is the radius of the range query. this formula assumes the number of points N approaches to inﬁnity and does not consider the eﬀects of high dimensionality or correlations.

It also ties the cost estimate to the actual data characteristics (i. A penalty is imposed on a potential index by the dimensionality term. The cost estimate provided is conservative in that it will provide a result that is at least as great as the actual number of matches in the database. It could model query cost with greater accuracy. we use a greedy algorithm that takes into account the indexes already selected to iteratively select indexes that would be 69 . and we assumed that we would retrieve data from disk for all query matches. Index Selection Loop After initializing the index selection data structures and updating estimated query costs for each potentially useful index for a query. incorporates both data correlation between attributes and data distribution.eliminates the need for a range radius. while the published models will produce results that are dependent only on the range radius for a given index structure). for example by crediting complete attribute coverage for coverage queries.e. Given equal ability to prune the space. The cost function could be more complicated in order to more accurately model query costs. the multi-dimensional subspace pruning that can be achieved using diﬀerent index possibilities is taken into account. and additional cost of traversing the higher dimension structure. such as B+-trees. We used this particular cost model because the index type was appropriate for our data and query sets. There is additional cost associated with higher dimensionality indexes due to the greater number of overlaps of the hyperspaces within the index structure. It could also reﬂect the appropriate index structures used in the database system. By evaluating the number of matches over the set of attributes that match the query. a lower dimensional index will translate into a lower cost.

At the end of a loop iteration. The input indexing constraints provides one of the loop stop criteria. The overall speed of this algorithm is coupled with the number of potential indexes analyzed. or total number of dimensions indexed. a check is made to determine if the index selection loop should continue. then the loop exits. If the index does not provide any positive beneﬁt for the query. If no potential index yields further improvement. so the analysis time can be reduced by increasing the support or decreasing the conﬁdence.appropriate for the given query workload and data set. For each index in the potential index set P . the current query cost is replaced by the improved cost. After each i is selected. The potential index i that yields the highest improvement over the query set Q is considered to be the best index. total index size. we traverse the queries in query set Q that could be improved by that index and accumulate the improvement associated with using that index for that query. The indexing constraint could be any constraint such as the number of indexes. This includes actions such as eliminating potential indexes that do not provide better cost estimates than the current cost for any query and pruning from consideration those queries whose best index is already a member of the set of suggested indexes. or the indexing constraints have been met. For the queries that beneﬁt from i. no improvement is accumulated. when possible. Index i is removed from potential index set P and is added to suggested index set S. The set of suggested indexes S contains the results of the index selection algorithm. The improvement for a given query-index pair is the diﬀerence between the cost for using the index and the query’s current cost. we prune the complexity of the abstract representations in order to make the analysis more eﬃcient. 70 .

we use control feedback to monitor the performance of the current set of indexes for incoming queries and determine when adjustments should be made to the index set.Diﬀerent strategies can be used in selecting a best index. the output of a system is monitored and based on some function involving the input and output. By monitoring the query workload and detecting when there is a change on the query pattern that generated the existing set of indexes. this may result in recommending a lower-dimensioned index and later in the algorithm a higher-dimensioned index that always performs better. 3. 2. If the indexing constraint is based on total index size. we are able to maintain good performance as query patterns evolve. then beneﬁt per index size unit may be a more appropriate measure. The recommendation set can be pruned in order to avoid recommending an index that is non-useful in the context of the complete solution.3. The strategy provided assumes a indexing constraint based on the number of indexes and therefore uses the total beneﬁt derived from the index as the measure of index ’goodness’. the input to the system is readjusted through a control feedback loop. However. In our approach. Our system input is a set of indexes and a set of incoming queries rather than a simple input. Our situation is analogous but more complex than the typical electrical circuit control feedback system in several ways: 1. In a typical control feedback system. such as an electrical signal. The system output must be some parameter that we can measure and use to make decisions about changing the input. Query performance is the obvious 71 .4 Proposed Solution for Online Index Selection The online index selection is motivated by the fact that query patterns can change over time.

then the output overshoots the target and oscillates around the desired output before reaching it. because lower query performance could be related to other aspects rather than the index set. we may have a set of attributes that appears in queries frequently enough that our system indicates that it is beneﬁcial to create an index over those attributes. then it fails to cause the output to change based on input changes in a timely manner. Control feedback systems can fail to be eﬀective with respect to response time. However. Both situations are undesirable and should be designed out of the system. One is for ﬁne grain control and is used to recommend minor. 72 .parameter to monitor.2 represents our implementation of dynamic index selection. Figure 3. The control system can be too slow to respond to changes. If the system is too slow. The other loop is for coarse control and is used to avoid very poor system performance by recommending major index set changes. We implement two control feedback loops. Our system input is a set of indexes and a set of incoming queries. We do not have a predictable function to relate system input and output because of the non-determinism associated with new incoming queries. For example. or it can respond too quickly. inexpensive changes to the index set. Our system simulates and estimates costs for the execution of incoming queries. but there is no guarantee that those attributes will ever be queried again. 3. System output is the ratio of potential system performance to actual system performance in terms of database page accesses to answer the most recent queries. Each control feedback loop has decision logic associated with it. our decision making control function must necessarily be more complex than a basic control system. If it responds too quickly.

In this case. For clarity.Figure 3. System The system simulates query execution over a number of incoming queries. we determine which of the current indexes in I most eﬃciently answers 73 . The abstract representation of the last w queries stored as W . This representation is similar to the one kept for query set Q in the static index selection. where w is an adjustable window size parameter.2: Dynamic Index Analysis Framework System Input The system input is made up of new incoming queries and the current set of indexes I. when a new query q arrives. a notation list for the online index selection is included as Table 3. which is initialized to be the suggested indexes S from the output of the initial index selection algorithm. W is used to estimate performance of a hypothetical set of indexes Inew against the current index set I.3.

Consider the possible new indexes Pnew to be the set of attribute sets that currently meet the input support and conﬁdence over the last w queries. This information is used in the control feedback loop decision logic. The query performance using I is the summation of the costs of queries using the best index from I for the given query. The system also keeps track of the current potential indexes P . we compare the query performance using the current set of indexes I to the performance using a hypothetical set of indexes Inew . the set of attribute sets in consideration to be indexes before q arrives New Potential Indexes. and the current multi-dimensional histogram H. System Output In order to monitor the performance of the system. The hypothetical cost is calculated diﬀerently based on the comparison of P and Pnew .Symbol I Inew w W q P Pnew iq Description Current set of attribute sets used as indexes Hypothetical set of attribute sets used as indexes window size abstract represention of the last w queries the current query under analysis Current Potential Indexes.3: Notation List for Online Index Selection this query and replace the oldest query in W with the abstract representation of q. We also incrementally compute the attribute sets that meet the input support and conﬁdence over the last w queries. the set of attribute sets in consideration to be indexes after q arrives the attribute set estimated to be the best index for query q Table 3. and the identiﬁed best index iq from P or Pnew for the new incoming query: 74 .

minor changes to index set. The hypothetical cost is the cost over the last w queries using Inew . P = Pnew and i is not in I. to the actual performance is used in the control loop decision logic. The ratio of the hypothetical cost. In this case we bypass the control loops since we could do no better for the system by changing possible indexes. P = Pnew and i is in I. which indicates potential performance. We compute the hypothetical cost of these queries to be the real number of matches from the database.1. In this case we bypass the control loops since we could do no better for the system by changing possible indexes. 4. and appropriate changes are made to update the system data structures. Increasing the input minor change threshold causes the frequency of minor changes to also increase. 75 . We recompute a new set of suggested indexes Inew over the last w queries. Hypothetical cost for other queries is the same as the real cost. This loop is entered in case 2 as described above when the ratio of hypothetical performance to actual performance is below some input minor change threshold. Fine Grain Control Loop The ﬁne grain control loop is used to recommend low cost. P = Pnew and i is in I. Then the indexes are changed to Inew . 3. We traverse the last w queries and determine those queries that could beneﬁt from using a new index from Pnew . 2. P = Pnew and i is not in I.

The system performance and rate of index change can be monitored and used in order to tune the control feedback system itself.5 System Enhancements In the following subsections.Coarse Control Loop The coarse control loop is used to recommend more costly. where the system continously adds and drops indexes. and a new set of suggested indexes Inew is generated. Increasing the input major change threshold increases the frequency of major changes. a system that is slow to respond results in recommending useful indexes long after they ﬁrst could have a positive eﬀect. abstract representations are recomputed. we present two system enhancements that provide further robustness and scalability to our framework. When this condition is 76 . Then the static index selection is performed over the last w queries. Appropriate changes are made to update the system data structures to the new situation.3. This loop is entered in case 4 as described above when the ratio of hypothetical performance to actual performance is below some input major change threshold. this indicates a non-responsive system. If the actual number of query results is much lower than the estimated cost using the recommended index set over a window of queries. Self Stabilization Control feedback systems can be either too slow or too responsive in reacting to input changes. It could also fail to recommend potentially useful indexes if thresholds are set so that the system is insensitive to change. 3. In our application. but changes with greater impact on future performance to the index set. A system that is too responsive can result in system instability.

However. At the other extreme would be to perform the index recommendation at each site. index-based pruning based on attributes and data at multiple locations would not be possible. However. A single set of indexes would be recommended for the entire system. 77 . This approach would allow for the creation of indexes that maximize pruning over the global set of dimensions. If the frequency of index change is too high with little or no improvement in query performance. an oversensitive system or unstable query pattern is indicated. Partitioning the Algorithm The system can also be applied to environments where the database is partitioned horizontally or vertically at diﬀerent locations. and we can reperform to analysis applying a reduced support. This increases the probability that Pnew will be diﬀerent from P . Recommended indexes would be determined based on global trends. it would not optimize query performance at the site level. This gives us a more complete P with respect to answering a greater portion of queries. This would maintain the abstract representation and collect global query patterns over the entire system. A solution at one extreme is to maintain a centralized index recommendation system. This allows tight control of the indexes to reﬂect the data at each site. Indexes would be recommended based on the query traﬃc to the data at that site. a diﬀerent abstract representation would be built based on the data at each speciﬁc site given requests to the data at that site. In this case.detected. the system can be made more responsive by cutting the window size. We can reduce the sensitivity by increasing the window size or increasing the support level during recomputation of new recommended indexes.

A global set of indexes can be centrally recommended based on a global abstract representation of the data set with low support thresholds (the centralized location will store a larger set of globally good indexes). eliminating some traﬃc.000 records consisting of 100 dimensions of uniformly distributed integers between 0 and 999.4.a set of 6500 records consisting of 360 dimensions of daily stock market prices. The variation in data sets is intended to show the applicability of our algorithm to a wide range of data sets and to measure the eﬀect that data correlation has on results. The query patterns emanating from each site can be mined in order to ﬁnd which of those global indexes would be appropriate to maintain at the local site. • stocks .4 3. 3. The data is not correlated. This data is extremely correlated.a set of 100. 78 .1 Empirical Analysis Experimental Setup Data Sets Several data sets were used during the performance of experiments.This approach also requires traﬃc to each site that contains potentially matching data. Data sets used include: • random . A hybrid approach can provide the beneﬁts of index recommendation based on global data and query patterns while optimizing the index set at diﬀerent requesting locations.

while others are not at all correlated. the conf idence parameter for the experiments is 1. Analysis Parameters The eﬀect of varying several analysis input parameters including support. and online indexing control feedback decision thresholds was analyzed. So. Unless otherwise stated each query in experiments is a point query with respect to 79 . query workload ﬁles were generated by merging synthetic query histories and query histories from real-world applications with different data sets. if a historical query involved the 3rd and 5th attribute. Unless otherwise speciﬁed. a SELECT type query is generated from the data set where the 3rd attribute is equal to the value of the 3rd attribute of the nth record and the 5th attribute is equal to the value of 5th attribute of the nth record. and has a variable query selectivity. and the nth record was randomly selected from the data set.a set of 33619 records of major league pitching statistics from between the years of 1900 and 2004 consisting of 29 dimensions of data. Histories and data were merged by taking a random record from a data set and the numerical identiﬁer of the attributes involved in the synthetic or historical query in order to generate a point query. Query Workloads It is desirable to explore the general behavior of database interactions without inadvertently capturing the coupling associated with using a speciﬁc query history on the same database. Some dimensions are correlated with each other.0.• mlb . Therefore. multidimensional histogram size. This gives a query workload that reﬂects the attribute correlations within queries.

12. {8.2.4} together.3. And 256 bits are used for the multi-dimension histogram.3 shows how accurately the proposed static index selection technique can perform with relaxed analysis constraints. 20% {5.13.16.659 queries executed from a clinical application. There are no constraints placed on the number of selected indexes. synthetic .7}. some initial portion of the queries are used for some experiments.6. Due to the size of this query set. hr .01. 3. 20% {15. Attributes that are not covered by the query can be any value and still match the query. the estimated cost of using recommended indexes. a low support value. For this scenario. Over the last 300 queries. The query distribution ﬁle has 64 distinct attributes. and the true number of answers for 80 . 20%.2 Experimental Results Index Selection with Relaxed Constraints Figure 3. and the remaining 40% are between 1 to 5 attributes that could be any attribute. is used.860 queries executed from a human resources application.14}.35.the attributes covered by the query.3 shows the cost of performing a sequential scan over 100 queries using the indicated data sets.500 randomly generated queries. 0. The query distribution ﬁle has 54 attributes. 3. The distribution of the queries over the ﬁrst 200 queries is 20% involve attributes {1. and the remaining queries involve between 1 and 5 attributes that could be any attribute.19}. the distribution shifts to 20% covering attributes {11. clinical . Figure 3. 2. 20% {18.4.17}. These query histories form the basis for generating the query workloads used in our experiments: 1.9}.

Table 3. regardless of any reasonable factor used. Note that the cost for sequential scan is discounted to account for the fact that sequential access is signiﬁcantly cheaper than random access. the graph will show the cost of using our indexes much closer to the ideal number of page accesses than the cost of sequential scan. A factor of 10 is applied as the ratio of random access cost to sequential access cost. Sequential Scan. where n is the number of indexes suggested by the proposed algorithm.3: Costs in Data Object Accesses for Ideal. AutoAdmin [15]. AutoAdmin. The naive algorithm uses indexes for the ﬁrst n most frequent itemsets in the query patterns.Figure 3. costs for indexes proposed by AutoAdmin and a naive algorithm are also provided. While the actual ratio of a given environment is variable.4 compares the proposed index selection algorithm with relaxed constraints against SQL Server’s index selection tool. and the Proposed Index Selection Technique using Relaxed Constraints the set of queries. 81 . For comparative purposes. Naive. using the data sets and query workloads indicated.

both algorithms suggest indexes which will improve all of the queries. These are all the queries that have selectivities low enough that an index can be beneﬁcial.4: Comparison of proposed index selection algorithm with AutoAdmin in terms of analysis time and % queries improved For the stock dataset using both the clinical and hr workloads. our ﬁrst index recommendation is a 2-dimension index built over attributes 11 and 44. the amount of the query improvement will be very similar using either recommended index set. but ﬁnds indexes that improve 87 % of the queries.Data Set/ Workload stock/ clinical stock/ hr mlb/ clinical Analysis Tool Time(s) AutoAdmin 450 Proposed 110 AutoAdmin 338 Proposed 160 AutoAdmin 15 Proposed 522 % Queries Improved 100 100 100 100 0 87 Number of Indexes 23 18 20 16 0 16 Table 3. Since the selectivity of these queries is low (the queries return a low number of matches using any index that contains a queried attribute). Our index selection takes longer in this instance. The proposed algorithm executes in less time and generates indexes which are more representative of the query patterns in the workload and allow for greater subspace pruning. However. SQL Server’s index recommendations are all single-dimension indexes for all attributes that appear in the workload. SQL Server quickly recommended no indexes. For the mlb dataset. the most common query in the clinical workload is to query over both attributes 11 and 44 together. It should be noted that the indexes selected by AutoAdmin are appropriate based on the cost models for the underlying index structure used in 82 . For example.

we maintain fewer potential indexes in P and need to analyze fewer attribute sets per index. Results show the total number of query-index pairs analyzed over the query set in the index selection loop and the total estimated improvement in query performance in terms of data objects accessed over the query set as the index selection parameters vary. the strategy yields nearly identical estimated cost although only 34-44% of the query/index pairs need to be evaluated. a single dimensional index is not able to prune the subspace enough to make the application of that index worthwhile compared to sequential scan. Eﬀect of Support and Conﬁdence Table 3. The results are shown for the stock data set over the clinical query workload. the single dimension indexes that are recommended are the least costly possible indexes to use for the query set. an initial set of indexes is recommended based on the ﬁrst 100 queries given the stated analysis parameters. Using a conﬁdence level of 0% is equivalent to using the maximal itemsets of attributes that meet support criteria as recommended indexes. As conf idence decreases.5 presents results on analysis complexity and expected query improvement as support and conf idence are varied. For this example. For this particular experiment. In each of the experiments that show the performance of the adaptive system over a sequence of queries. Baseline Online Indexing Results A number of experiments were performed to demonstrate the eﬀectiveness and characteristics of the adaptive system. Since the underlying index type for SQL Server is B+-trees.SQL Server. which do not index multiple dimensions concurrently. This decreases analysis time but shows very little eﬀect on overall performance. 83 .

These index set changes are considered to take place before the next query. The estimated cost of the last 100 queries given the indexes available at the time of the query are accumulated and presented. a major change threshold of 0. A number of data set and query history combinations are examined. changes are made to index set.4 through 3. The performance gets much worse and does not get better for the static system. a window size of 100. this workload drastically changes. For the online system.5: Comparison of Analysis Complexity and Query Performance as support and conf idence vary.9 and a minor change threshold of 0. 128 bits for each multi-dimension histogram entry. This demonstrates the evolving suitability of the current index set to incoming queries. clinical workload New queries are evaluated.6 show a baseline comparison between using the query pattern change detection and modifying the indexes and making no change to the initial index set. At query 200. The baseline parameters used are 5% support. and depending on the parameters of the adaptive system. stock dataset. and this is evident in the query cost for the static system. Figures 3. the performance degrades a little 84 .Support Conﬁdence 2 4 6 8 10 Query/Index 100 50 229 229 198 198 185 185 178 178 170 170 Pairs 0 90 85 73 66 58 Total 100 57710 55018 46924 42533 37527 Improvement 50 0 57710 57603 55018 55001 46924 46907 42533 42516 37527 37510 Table 3. Figure 3. an indexing constraint of 10 indexes.4 shows the comparison for the random data set using the synthetic query workload.95.

random Dataset. 85 . synthetic Query Workload when the query patterns are changing and then improve again once the indexes have changed to match the new query patterns. Figure 3.4: Baseline Comparative Cost of Adaptive versus Static Indexing. The synthetic query workload was generated speciﬁcally to show the eﬀect of a changing query pattern. Performance for the static system degrades substantially when the patterns change until the point that they change back. This real query workload also changes substantially before reverting back to query patterns similar to those on which the static index selection was performed.Figure 3. The adaptive system is better able to absorb the change.5 shows a comparison of performance for the random data set using the clinical query workload.

Figure 3.5: Baseline Comparative Cost of Adaptive versus Static Indexing, random Dataset, clinical Query Workload

86

Figure 3.6: Baseline Comparative Cost of Adaptive versus Static Indexing, stock Dataset, hr Query Workload

Figure 3.6 shows the comparison for the stock data set using the ﬁrst 2000 queries of the hr query workload. The adaptive system shows consistently lower cost than the static system and in places signiﬁcantly lower cost for this real query workload. The eﬀect of range queries was also explored. The random data set was examined using the synthetic workload using ranges instead of points. As the ranges for a given attribute increased from a point value to 30% selectivity, the cost saved for top 3 proposed indexes (which were the index sets [1,2,3,4], [5,6,7], and [8,9] for all ranges), the overall cost saved decreased. This is due to the increase in the size of the answer 87

set. When the range for each attribute reached 30%, no index was recommended because the result set became large enough that sequential scan was a more eﬀective solution. Online Index Selection versus Parametric Changes Changing online index selection parameters changes the adaptive index selection in the following ways: Support - decreasing the support level has the eﬀect of increasing the potential number of times the index set will be recalculated. It also makes the recalculation of the recommended indexes themselves more costly and less appropriate for an online setting. Figure 3.7 shows the eﬀect of changing support on the online indexing results for the random data set and synthetic query workload, while Figure 3.8 shows the eﬀect on the stock data set using the ﬁrst 600 queries of hr query workload. These graphs were generated using the baseline parameters and varying support levels. As expected, lower support levels translate to lower initial costs because more of the initial query set are covered by some beneﬁcial index. For the synthetic query workload, when the patterns changed, but do not change again (e.g. queries 100-200 in Figure 3.7), lower support levels translated to better performance. However, for the hr query workload, the 10% support threshold yields better performance than the 6% and 8% thresholds for some query windows. The frequent itemsets change more frequently for lower support levels, and these changes dictate when decisions occur. These runs made decisions at points that turned out to be poor decisions, such as eliminating an index that ended up being valuable. Little improvement in cost performance is achieved between 4% support and 2% support. Over-sensitive control feedback can

88

random Dataset. Online Indexing Control Feedback Decision Thresholds .7: Comparative Cost of Online Indexing as Support Changes.Figure 3.9 shows the eﬀect of varying the major change threshold in the coarse control loop for the random 89 . independent of the extra overhead that oversensitive control causes.increasing the thresholds decreases the response time of aﬀecting change when query pattern change does occur. In a real system. synthetic Query Workload degrade actual performance. Figure 3. It also has the eﬀect of increasing the number of times the costly reanalysis occurs. one would need to balance the cost of continued poor performance against the cost of making an index set change.

stock Dataset.Figure 3.8: Comparative Cost of Online Indexing as Support Changes. hr Query Workload 90 .

This indicates that this value should be carefully tuned based on the expected frequency of query pattern changes. Multi − dimensional Histogram Size . Experiments demonstrated that if the representation quality of the histogram was suﬃcient. a value of 0 translates to never making a change.data set and synthetic query workload.11 shows an example of the eﬀect of varying the multi-dimensional histogram size.increasing the number of bits used to represent a bucket in the multi-dimensional histogram improves analysis accuracy at the cost of histogram generation time and space. One reason for this is that artiﬁcially higher costs are calculated because of the lower histogram resolution. As well. very little beneﬁt was achieved through greater resolution. Figure 3. 91 . The graphs show no real beneﬁt once the major change threshold increases (and therefore frequency of major changes) beyond 0. Figure 3. These graphs were generated using the baseline parameters (except that the random graph uses a support of 10%).9 are best in terms of response time when a change occurs.6. Figure 3.9 shows that a value of 1 or 0. The major change threshold is varied between 0 and 1. Here. and a value of 1 means making a change whenever improvement is possible.3 shows the best performance once the query patterns have stabilized. some beneﬁcial index changes are not recommended because of these overly conservative cost estimates.10 shows the eﬀect of changing the major change threshold for the stock data set and hr query workload. and only varying the major change threshold in the coarse control loop. Using a 32 bit size for each unique histogram bucket yields higher costs than by using greater histogram resolution. but a value of 0.

Figure 3.9: Comparative Cost of Online Indexing as Major Change Threshold Changes. random Dataset. synthetic Query Workload 92 .

Figure 3. stock Dataset. hr Query Workload 93 .10: Comparative Cost of Online Indexing as Major Change Threshold Changes.

Figure 3. random Dataset.11: Comparative Cost of Online Indexing as Multi-Dimensional Histogram Size Changes. synthetic Query Workload 94 .

5 Conclusions A ﬂexible technique for index selection is introduced that can be tuned to achieve diﬀerent levels of constraints and analysis complexity. The technique uses a generated multi-dimension histogram to estimate cost. However. The foundation provided here will be used to explore this tradeoﬀ and to develop an improved utility for real-world applications. which may not be able to take advantage of knowledge about correlations between attributes. the system can be tuned to favor analysis time or pattern change recognition. 95 . The proposed technique aﬀords the opportunity to adjust indexes to new query patterns. Indexes are recommended in order to take advantage of multi-dimensional subspace pruning when it is beneﬁcial to do so. and as a result is not coupled to the idiosyncrasies of a query optimizer. more complex analysis can lead to more accurate index selection over stable query patterns. A more constrained. A low constraint.3. A limitation of the proposed approach is that if index set changes are not responsive enough to query pattern changes then the control feedback may not aﬀect positive system changes. this can be addressed by adjusting control sensitivity or by changing control sensitivity over time as more knowledge is gathered about the query patterns. less complex analysis is more appropiate to adapt index selection to account for evolving query patterns. These experiments have shown great opportunity for improved performance using adaptive indexing over real query patterns. By changing the threshold parameters in the control feedback loop. A control feedback technique is introduced for measuring performance and indicating when the database system could beneﬁt from an index change.

the frequency of index set change has been aﬀected through the use of the online indexing control feedback thresholds. It is not feasible to perform real-time analysis of incoming queries and generate new indexes when the patterns change. The kinds of low cost changes that this threshold would trigger are not the kinds of changes that make enough impact to be that much better than the existing index set. and one potential index is now more valuable than a suggested index.From initial experimental results. This would change if the indexing constraints were very low. This could mean moving an index from local storage to main memory. the ﬁne grain control threshold was not triggered unless we contrived a situation such that it would be triggered. The proposed online change detection system utilized two control feedback loops in order to diﬀerentiate between inexpensive and more time consuming system changes. or from remote storage to local storage depending 96 . moved to active status. it seems that the best application for this approach is to apply the more time consuming no-constraint analysis in order to determine an initial index set and then apply a lightweight and low control sensitivity analysis for the online query pattern change detection in order to avoid or make the user aware of situations where the index set is not at all eﬀective for the new incoming queries. In practice. Alternatively. Potential indexes could be generated prior to receiving new queries. the frequency of performance monitoring could be adjusted to achieve similar results and to appropriately tune the sensitivity of the system. Index creation is quite time-consuming. and when indicated by online analysis. In this thesis. This monitoring frequency could be in terms of either a number of incoming queries or elapsed time.

97 . Additionally. the analysis could prompt a server to create a potential index as the analysis becomes aware that such an index is useful.on the size of the index. it could be called on by the local machine. and once it is created.

4. Oracle [2].Albert Einstein (attributed).1 Introduction Bitmap indexes are widely used in data warehouses and scientiﬁc applications to eﬃciently query large-scale data repositories. The multi-attribute access structure may not be eﬀective for general queries as a multi-attribute index can only prune results for those attributes that match selection criteria of the query. e. They are successfully implemented in commercial Database Management Systems. Informix [33]. However.CHAPTER 4 Multi-Attribute Bitmap Indexes The signiﬁcant problems we have cannot be solved at the same level of thinking with which we created them. It describes the changes required when modifying a traditionally single attribute access structure to handle multiple attributes in terms of enhancing the query processing model and adding cost-based index selection. IBM [46]. This chapter presents multi-attribute bitmap indexes.g. 98 .. . Query processing becomes more complex as the number of potential ways in which a query can be executed increases. this added pruning power comes with a cost. An access structure built over multiple attributes aﬀords the opportunity to eliminate more of a potential result set within a single traversal of the structure than a single attribute structure.

A range query over a value-encoded bitmap can be answered by ORing together the bit vectors that correspond to the range for each attribute. or range of values for a given attribute. category. In their simplest and most popular form. General compression techniques are not desirable as they require bitmaps to be decompressed before executing the query. The CPU time required to answer queries is related the total length of the compressed bitmaps used in the query computation. 99 . and then ANDing together these intermediate results between attributes. and have found many other applications such as visualization of scientiﬁc data [56]. on the contrary. allow the query execution over the compressed bitmaps. Run length encoders. equality encoded bitmaps also known as projection indexes. The two most popular are Byte-Aligned Bitmap Code (BBC) [2] and Word Aligned Hybrid (WAH) [57]. In general. Queries are then executed by performing the appropriate bit operations over the bit vectors needed to answer the query. One disadvantage of bitmap indexes is the amount of space they require. the mapping can be used to encode a speciﬁc value. Bitmap indexes can provide very eﬃcient performance for point and range queries thanks to fast bit-wise operations supported by hardware. Compression is used to reduce the size of the overall bitmap index. The resultant bit vector has set bits for the tuples that fall within the range queried. Each attribute is encoded independently.Sybase [20]. are constructed by generating a set of bit vectors where a ’1’ in the ith position of a bit vector indicates that the ith tuple maps to the attributevalue associated with that particular bit vector.

the bitvectors built over two or more attributes are sparser. However. the AND operation is used to combine the bitmaps. such as B+-Trees or projection indexes. if a 2-dimensional index such as an R-tree or Grid File exists over the two queried attributes. which allows for greater compression. 100 . a multi-dimensional bitmap over a combination of attributes and values identiﬁes those records that match the query combination and avoids a further bit operation in order to answer the query criteria. are available over the two attributes queried. the query would ﬁnd candidates for each attribute and then combine the candidate sets to obtain the ﬁnal answer. This results in a smaller result set if the two dimensions indexed are not highly correlated. just as traditional multi-attribute indexes can be more eﬀective in pruning data search space and reducing query time. Similarly.These traditional bitmap indexes are analogous to single dimension indexes in traditional indexing domains. Also a set intersection operation is not required. The resulting smaller bitmaps makes the processing of the bitmap faster which translates into faster query execution time. As an example. First. consider a 2-dimensional query over a large dataset. This cost savings comes from two sources. then the data space can be pruned simultaneously over both dimensions. the number of bitwise operations is potentially reduced as more than one attribute are evaluated simultaneously. the set intersection operation is used to combine the sets. In the case of B+-Trees. In order to avoid sequential scan over the data we build a index access structure that can direct the search by pruning the space. Second. multiattribute bitmap indexes could be applied to more eﬃciently answer queries. However. In the case of projection indexes. If single dimensional indexes.

4. Bitmap tables need to be compressed to be eﬀective on large databases. determine the bitmap query execution steps that results in the best query performance with minimal overhead. Although the techniques incorporate heuristics in order to control analysis search space and query execution overhead. The two most popular compression techniques for bitmaps are the Byte-aligned Bitmap Code (BBC) [2] and the Word-Aligned Hybrid (WAH) code 101 . experiments show signiﬁcant opportunity for query execution performance improvement. The goals of this research are to: • Evaluate of query performance of multi-attribute bitmap enhanced data management system compared to the current bitmap index query resolution model. • Support ad hoc queries where information about expected queries is not available and support the experimental analysis provided herein. We propose an expected cost-improvement based technique to evaluate potential attribute set combinations. The proposed approaches to attribute combination and query processing are based on measuring the expected query cost savings of options.2 Background Bitmap Compression. • Given a selection query over a set of attributes and a set of single and multiple attribute bitmaps available for those attributes.In this thesis. the use of Multi-Attribute Bitmaps is proposed to improve the performance of queries in comparison with traditional single attribute indexes.

The fourth group is less than 31 bits and thus is stored as a literal. Unlike traditional run length encoding.78*0. The most signiﬁcant bit indicates the type of word we are dealing with. and each ﬁll word represents a multiple of 31 bits.20*0. WAH imposes the word-alignment requirement on the ﬁlls.20*0. and the third line shows the hexadecimal representation of the groups.1 shows how the bitmap is divided into 31-bit groups. Figure 4. and the remaining (w − 2) bits store the ﬁll length. these schemes mix run length encoding and direct storage. WAH is simpler because it only has two types of words: literal words and ﬁll words. Let w denote the number of bits in a word. the lower (w − 1) bits of a literal word contain the bit values from the bitmap. we assume 32 bit words. [59]. then the second most signiﬁcant bit is the ﬁll bit.4*1. Under this assumption.4*1.6*0 400003C0 400003C0 62*0 00000000 00000000 80000002 10*0. followed by the 0 ﬁll 102 . In this example. If the word is a ﬁll.1: A WAH bit vector.128 bits 31-bit groups groups in hex WAH (hex) 1. The second line in Figure 4. they are represented as literal words (a 0 followed by the actual bit representation of the group).1 shows a WAH bit vector representing 128 bits. The second group contains a multiple of 31 0’s and therefore is represented as a ﬁll word (a 1. BBC stores the compressed data in Bytes while WAH stores it in Words. Since the ﬁrst and third groups do not contain greater than a multiple of 31 of a single bit value. each literal word stores 31 bits from the bitmap. This requirement is key to ensure that logical operations only access words. The last line shows the values of WAH words.21*1 001FFFFF 001FFFFF 9*1 000001FF 000001FF Figure 4.30*1 1.

Relationship Between Bitmap Index Size and Bitmap Query Execution Time. Using WAH compression.2: Query Execution Example over WAH compressed bit vectors. Therefore. In this paper we use WAH to compress the bitmaps due to the impact on query execution times. not shown. value. The ﬁll word from B1 is a zero-ﬁll word compressing 2 words.2. and it is determined that they are both literal words. First. Bit operations over the compressed WAH bitmap ﬁle have been shown to be faster than BBC (2-20 times) [57] while BBC gives better compression ratio. Logical operations are performed over the compressed bitmaps resulting in another compressed bitmap. two literal words and one ﬁll word. we have two ﬁll words. is needed to stores the number of useful bits in the active word. The ﬁll word 80000002 indicates a 0-ﬁll of two-word long (containing 62 consecutive zero bits). As an example. queries are executed by performing logical operations over the compressed bitmaps which result in another compressed bitmap.B1 B2 B1 AND B2 400003C0 00000100 00000100 80000002 C0000003 80000002 001FFFFF 00001F08 001FFFFF 000001FF 00000108 Figure 4. The fourth word is the active word. the ﬁrst word of each bitvector is decoded. and the ﬁll word from B2 is a one-ﬁll 103 . Another word with the value nine. The ﬁrst three words are regular words. When the next word is read from both bitmaps. consider the two WAH compressed bitvectors in Figure 4. followed by the number of ﬁlls in the last 30 bits). the logical AND operation is performed over the two words and the resultant word is stored as the ﬁrst word in the result bitmap. it stores the last few bits that could not be stored in a regular word.

word compressing 3 words. In this case we operate over 2 words simultaneously by ANDing the ﬁlls and appending a ﬁll word into the result bitmap that compresses two words (the minimum between the word counts). Since we still have one word left from B2, we only read a word from B1 this time, and AND it with the all 1 word from B2. Finally, the last two words are read from the two bitmaps and are ANDed together as they are literal words. A full description of the query execution process using WAH can be found in [59]. Query execution time is proportional to the size of the bitmap involved in answering the query [56]. To illustrate the eﬀect of bit density and consequently bitmap size on query times over compressed bitmaps we perform the following experiment. We generated 6 uniformly random bitmaps representing 2 million entries for a number of diﬀerent bit densities and ANDed the bitmaps together. Figure 4.3 shows average query time against the total bitmap length of the bitmaps used as operands in the series of operations. The expected size of a random bitmap with N bits compressed using WAH is given by the formula [59]: mR (d) ≈ N 1 − (1 − dB )2w−2 − d2w−2 w−1

where w is the word size, and d is the bit density. The lower the bit density, the more quickly and more eﬀectively intermediate results can be compressed as the number of AND operations increases. Figure 3 shows a clear relationship between query times and bitmap length. The diﬀerent slopes of the curve are due to changes in frequency of the type of bit operations performed over the word size chunks of the bitmap operands. At low bit densities, there is a greater portion of ﬁll word to ﬁll word operations. As the bit density increases, more ﬁll word to literal word operations occur. At high bit densities, literal word to literal 104

Figure 4.3: Query Execution Times vs Total Bitmap Length, ANDing 6 bitmaps, 2M entries

word operations become more frequent. As these inter-word operations have diﬀerent costs, the overall query time to bitmap length relationship is not linear. However it is clear that query execution time is related to overall bitmap length. Relationship Between Number of Bitmap Operations and Bitmap Query Execution Time. Figure 4.4 shows AND query times over WAH-compressed bitmaps as the number of bitmaps ANDed varies. Again the bitmaps are randomly generated with the reported bit densities for 2 million entries and the number of bitmaps involved in a AND query is varied. There is clear relationship between query times and the number of bitmap operations.

105

Figure 4.4: Query Execution Times vs Number of Attributes queried, Point Queries, 2M entries versus

Since query performance is related to the total bitmap length and number of bitmap operations, a logical question is ”can we modify the representation of the data in such a way that we can decrease these factors?” One could combine attributes and generate bitmaps over multiple attributes in order to accomplish this. This is because a combination over multiple attributes has a lower bit density and is potentially more compressible, and already has an operation that may be necessary for query execution built into its construction. However, several research questions arise in this scenario such as which attributes should be combined, how can the additional indexes be utilized, and what is the additional burden added by the richer index sets.

106

consider a bitmap index built over two attributes. potentially decreasing the size of a bitcolumn. Such a representation can have the eﬀect of reducing the size of the bitmaps that are used to answer a query and can also reduce the number of operations. Using WAH compression. Additionally. bitcolumns become sparser. As an illustrative example of enhancing compression from combining attributes. One could argue that multi-attribute bitmaps are reducing the dimensionality of the dataset but increasing the cardinality of the attributes involved in the new dataset. very 107 . and the opportunity for compressing that column increases. the advances in compression techniques and query processing over compressed bitmaps have made bitmaps an eﬀective solution for attributes with cardinalities on the order of several hundreds [58]. Although bitmap indexes have historically been discouraged for high-cardinality attributes. In the table. Figure 4. into one 2-attribute bitmap with cardinality 4.b2 means that the value of attribute 1 maps to b1 and the value for attribute 2 maps to b2.5 shows a simple example combining two single attributes with cardinality 2.Technical Motivation for Multi-Attribute Bitmaps. where the data is uniformly randomly distributed independently for each attribute for 5000 tuples. One could think of multi-attribute bitmaps as the AND bitmap of two or more bitmaps from diﬀerent attributes. we would expect 1 set bit for a value). if a bit operation is already incorporated in the multi-attribute bitmap construction. the combined column labeled b1. there are very few instances where there are long enough runs of 0’s to achieve any compression (since out of 10 random tuples. In such a test which matched this scenario. each with cardinality 10. then it does not need to be evaluated as part of the query execution. As attributes are combined.

As an alternative.ATT2=3). The time required is dependent on the total length of the bitmaps involved. The operations using the traditional bitmap processing to answer the query is ((ATT1=1) OR (ATT1=2)) AND (ATT2=3). We can expect much greater compression for this situation.b2 0 0 0 0 1 0 Figure 4. we can reduce 108 . we can generate a bitmap index for the combination of the two attributes.ATT2=3) OR (ATT1=2.5: Simple multiple attribute bitmap example showing the combination two single attribute bitmaps. for a total of 528 bytes. In this scenario. and attribute 2 must be equal to 3.b1 0 0 1 0 0 0 Combined b1. for a total of 2592 bytes.b2 b2. little compression was achieved and bitmaps representing each of the 20 attributevalue pairs were between 640 and 648 bytes long. the generated size of the 100 2-D bitmaps was between 220 and 400 bytes. For the multi-attribute case.b1 1 0 0 1 0 0 1 0 0 0 0 1 b2.Tuple t1 t2 t3 t4 t5 t6 Attribute 1 b1 b2 1 0 0 1 1 0 1 0 0 1 0 1 Attribute 2 b1 b2 0 1 1 0 1 0 0 1 0 1 1 0 b1. the query can be answered by (ATT1=1. Each of these bitmaps as well as the resultant intermediate bitmap from the OR operation is 648 bytes. In the given scenario. Now consider a range query such that attribute 1 must be between values 1 and 2. Now the cardinality of the combined attributes is 100. These bitmaps are 268 and 260 bytes respectively. and the expected number of set bits in a run of 100 tuples for a single value is 1.

where lower resolution bitmaps were used in addition to the single attribute bitmaps or high resolution bitmaps. Multi-attribute bitmaps would only be used in the queries that result in an improved query execution time. point queries would always beneﬁt from the multi-attribute bitmaps as long as all the combined attributes are queried.both the size of the bitmaps involved in the query execution and also the number of bitmaps involved. 109 . The primary overall tasks can be summarized as multi-attribute bitmap index selection and query execution. This can be performed with varying degrees of knowledge about a potential query set. then the single attribute bitmaps would perform better than the multi-attribute bitmaps. 4. Once candidate attribute combinations are determined. Supplementing single attribute bitmaps with additional bitmaps that improve query execution time has also been done in [50]. In general. One could think of this approach as storing the OR of several bitmaps from the same attribute together.3 Proposed Approach In this section we present our proposed solution for the problems presented in the previous section. the combined bitmaps are generated. However. This example demonstrates the opportunity to improve query execution by representing multiple attribute values from separate attributes in a bitmap. we determine which attributes are appropriate or the most appropriate to combine together. In index selection. if only some of the attributes were queried or a large range from one or more attributes were queried. This is the reason why we propose to supplement the single attribute bitmaps with the multi-attribute bitmaps and not to replace the single attribute bitmaps.

3.In query execution. that combination of attributes is a candidate for multi-attribute indexing. 4. The goal behind bitmap index selection is to ﬁnd sets of attributes that. but can eﬀectively prune the data space when combined together. If the form has categorical entries for color and model. Whenever multiple attributes are frequently queried together over a single bitmap value. we use 110 . the candidate pool of bitmaps are explored to determine a plan for query execution. Since query execution time is dependent both on the total size of the operands involved in a bitmap computation and also the number of bitmap operations. An eﬀective index selection technique for bitmap indexes should be cognizant of the relationships between attributes. can yield a large beneﬁt in terms of query execution time. The analogy within the traditional multi-attribute index selection domain would be ﬁnding attributes that are not individually eﬀective in pruning data space. the selectivity estimated for a given attribute combination should accurately reﬂect the true data characteristics. As an example. we provide a cost improvement based candidate evaluation system.1 Multi-attribute Bitmap Index Selection for Ad Hoc Queries In many cases the attributes that should be combined together as multi-attribute bitmaps are obvious given the context of the data and queries. In order to support situations where no contextual information is available about queries and in order to allow for comparative performance evaluation. if combined. then a multi-attribute bitmap built over these two attributes can directly ﬁlter those data objects that meet both search criteria without further processing. As well. consider a database system behind a car search web form.

is a list of Attribute Sets and corresponding goodness measures selected .attributes 18: include as in selected Algorithm 3: Selecting Attribute Combination Candidates.card*aj . cardLimit) notes: attSets . j = (maxAtt(asi )+1) to |A| 7: if (asi .these measures to determine the potential beneﬁt for a given attribute combination. 0] |ai ∈ A} 2: currentDim = 1 3: while (currentDim < dLimit AND newSetsAdded) 4: unset newSetsAdded 5: for each asi ∈ attSets. dLimit. Determining Index Selection Candidates SelectCombinationCandidates (Ordered set of attributes A. Algorithm 3 shows how candidate multi-attribute indexes are evaluated and selected in terms of potential beneﬁt.are a set of recommended attribute sets to build indexes over 1: attSets = {[{ai }. |asi | = currentDim 6: for each attribute aj . set newSetsAdded 11: currentDim + + 12: sort attSets by goodness 13: ca = ∅ 14: while ca = A 15: attribute set as = next set from attSets 16: if as ∩ ca = ∅ 17: ca = ca ∪ as.t.card¡cardLimit) 8: asj = [asi ∪ {aj }. s. aj )] 9: if goodness(asj ) > goodness(asi ) 10: add asj to attSets. detGoodness(asi . 111 .

the sets in attSets are sorted by their goodness measures (line 12). These include an array. The algorithm evaluates increasingly larger attribute sets (loop starting at line 3). A. The card value associated with an attribute set is the number of value combinations possible for the attribute set. and reﬂects the number of bitmaps that would have to be generated to index based on the set. Then sets of attributes that are disjoint from the covered attributes. each potentially beneﬁcial size 2 set is evaluated with relevant size 1 attributes (note that by evaluating only those singleton sets of attributes numbered larger than the largest in the current set. Once the loop terminates. Each unique combination is evaluated (lines 5-6). then the combination set is added to the list of attribute sets for further evaluation during the next loop iteration. If the number of bitmaps generated by this combination is excessive (line 7). The term newSetsAdded in line 3 is a boolean condition indicating if any attribute set was added during the last loop iteration. In the next iteration of the outer loop. Initially all size 1 attribute sets are combined with each other size 1 attribute set such that each unique size 2 attribute set is evaluated. of the attributes under analysis. The maximum cardinality allowed for a given prospective combination can be controlled by the cardLimit parameter. If the goodness of the combination exceeds the goodness of its ancestor. A goodness measure is evaluated for the proposed combination (line 8). 112 . The algorithm takes several parameters as input. The parameter dLimit provides a maximum attribute set combination size. we can evaluate all unique set instances).The algorithm is somewhat analogous to the Apriori method for frequent itemset mining in that a given candidate set is only evaluated if it’s subset is considered beneﬁcial. the set will not be evaluated.

bcj ) savings = bck . The variables. 113 .length 7: if savings > 0 8: goodness+ = savings ∗ cb. are greedily selected from the sorted sets and added to the recommended bitmap indexes.creationLength − cb. The probSR parameter is the probability that the query range for an attribute will be small enough such that a multi-attribute index over the query criteria will still be beneﬁcial. The probQT parameter represents the probability that an attribute is queried. a 2-attribute index is not as useful for a single attribute query.length + bcj . probQT and probSR describe global query characteristics and are used to estimate diminishing beneﬁts as index dimensionality increases. 1: 2: 3: 4: 5: 6: The goodness of a particular attribute set is computed by Algorithm 4.ca. attribute aj ) set probQT and probSR goodness=0 for each bitcolumn bck in asi for each bitcolumn bcj in aj bitcolumn cb = AND(bck . As with traditional multi-attribute index. Estimating Performance Gain for a Potential Index detGoodness (attribute set asi .matches 9: goodness∗ = (probQT ∗ probSR)|asi |+1 10: return goodness Algorithm 4: Determine goodness of including an attribute in a multi-attribute bitmap.length+ bck .

minus the compressed length of the combined bitmap. The index selection solution space is controlled by limiting parameters and through 114 . Our approach takes into account the characteristics of the data by weighting and measuring the eﬀectiveness of speciﬁc inter-attribute combinations. For size 1 attribute sets. Both of these factors can provide a performance gain depending on the query criteria. This strategy assumes that the query space is similar in distribution to the data space. Each combination is weighted by its frequency of occurrence. The savings of a particular combination is measured as the sum of the compressed lengths of the bitmap operands and the compressed length of the bitmaps needed to create the ﬁrst operand.Each unique value combination of the entire attribute set (combinations generated in lines 4-5) is evaluated in terms of bitmap length savings over a corresponding single attribute bitmap solution. the bitcolumn creation length is 0. The total savings reﬂects the diﬀerence in total bitmap operand length to answer the combination under analysis. This technique ﬁnds attributes for which their combinations will lead to a reduction in the number of operations as well as a reduction in the constituent bitmap lengths. This is because this option will not be selected during the query execution for this combination and will not incur an additional penalty. for a multiple attribute bitmap this value is the length of all the constituent bitmaps used to create it. If the savings for a particular combination is negative. However. The ﬁnal goodness measure for an attribute set is also modiﬁed based on the probability that beneﬁt will actually be realized for the query (line 9). then its negative savings is not accumulated (line 7). The goodness of an attribute set is measured as the weighted average of the bitlength savings of each value combination (line 8).

greedy selection of potential solutions. we need a mechanism to decide when single attribute and/or multi-attribute bitmaps would be used.3. the potential execution plans must be evaluated and the option that is deemed better should be performed. Once multi-attribute bitmaps are available in the system. Later experiments show that the reduction in the number of total operations has a more substantial eﬀect on query execution time than reduction of bitmap length. If the query is equally likely to occur in the data space. In order to take advantage of the most useful representation.2 Query Execution with Multi-Attribute Bitmaps One issue associated with using multiple representations for data indexes. Once bitmap combinations are recommended. To be eﬀective. the complexity added by the additional representations can not be more costly than the beneﬁts attained by incorporating them. Although the indexes recommended are not guaranteed to be optimal. (i.e. We take a number of measures in order to limit the 115 . is that the query execution becomes more complex. if index space is critical. each combined bitmap can be generated by assigning an appropriate identiﬁer to the bitmap and ANDing together the applicable single attribute bitmaps. then the weighting in line 8 of the determine Goodness algorithm (Algorithm 4) can be removed. we can order the pairs in line 12 of the Select Combination Candidates algorithm (Algorithm 3) by goodness over combined attribute size. seemingly less than ideal attribute combinations can still provide performance gains). The behavior of the algorithm can be aﬀected with slight alteration. they can be determined eﬃciently. 4. For example. we need to decide on a query execution plan for each query. In other words.

2. we compare the set representation of the attributes in the query to the sets in the index identiﬁcation structure.overhead associated with selecting the best query execution option. Those sets that have non-empty intersections with the query set are candidates for use in query execution. Bitmap Naming We need to be able to easily translate a query to the set of bitmaps that match the criteria of a query. and the values are given in the same order as the attributes. Global Index Identiﬁcation In order to identify the potential index search space. we will include a set identiﬁed as [1] in this structure. The key is made up of the attributes covered by the bitmap and the values for those attributes. We keep track of those attribute sets that have bitmap representations (either single attribute or multi-attribute). The attributes are always stored in numerical order so that there is only one possible mapping for various set orders. We use a unique mapping for bitmap identiﬁcation. If we have available a multi-attribute bitmap built over attributes 1. If we have available a single attribute bitmap for the ﬁrst attribute.3] in the structure. In order to do this we devise a one-to-one mapping that directly translates query criteria to bitmap coverage.10]5. and 3 then we include the set [1. Finding relevant bitmaps is a matter of translating the query into its key to access the bitmaps.2. we maintain a data structure of those attributes and combined attributes for which we have bitmap indexes constructed. Upon receiving a query. 116 . For example. a multiattribute bitmap over attributes 9 and 10 where the value for attribute 9 is 5 and the value for attribute 10 is 3 would get ”[9. We explore potential query execution options using inexpensive operations.3” as its key.

then the single attribute bitmaps should be used. we compute the total length of the bitmaps required to answer the query. However. if the total length of matching bitmaps for the multi-attribute bitmap over attributes [1. if the multi-attribute bitmaps have greater total length. Those indexes which cover an attribute involved in the query are candidates to be used as part of the solution for the query. which we can compute using the recursive function. For the recursive case. the unique key is built up and passed to the recursive call. So rather than incorporating logic that guarantees an optimal minimal total bitmap length. We order the 117 . For example. then the multi-attribute bitmap should be used to answer the query. we approximate this behavior using heuristics. We just need to sum the lengths of bitmaps that match the query criteria. Once candidate indexes are determined. The logic applied to determine the query execution plan needs to be lightweight so as to not add too much overhead to handle the increased complexity of bitmap representation.Query Execution During query execution we need a lightweight method to check which potential query execution will provide the fastest answer to the query. A recursive function is used to compute the total length of all matching bitmaps in order to handle arbitrary sized attribute sets.2] is less than the total length for both attributes [1] and [2]. and the callee length results are tabulated. The complete unique hash key is generated in the base case of the recursive method and the length of the matching bitmaps is obtained and returned to the calling caller. The ﬁrst step is to compare the set of attributes in the query to the sets of attributes for which we have constructed bitmap indexes.

and and over value 3 for attribute 2 where we have bitmaps available for the attribute sets [1].potential attribute sets in descending order based on the amount of length saved by using the index. The algorithm assumes that the bitmap indexes available are identiﬁed and that we can access any bitcolumn through a unique hashkey. [2]. In the ﬁrst step. we will have a total 118 . Those that have negative values. As a baseline. and will not be processed before their constituent single attribute indexes. all potential single attribute indexes get a value of 0 for the amount of length saved. the potential indexes are prioritized based on how promising they are with respect to saving total bitmap length during query execution. The second step performs the bit operations required to answer the query. Those multi-attribute indexes that have positive beneﬁt compared to single attribute indexes will have positive value. and will be processed before any single attribute index. would actually slow down the query. At the end of this section.1 shows the query matching keys for each of the potential indexes. For example. and [1. Table 4. In lines 1-5. Multi-attribute indexes get a value which is the diﬀerence between the length of the single attribute indexes that would be required to answer the query and the length of the multi-attribute index to answer the query. Algorithm 5 provides the algorithm for executing selection queries when the entire bitmap index has both single attribute and multi-attribute bitmap indexes available to process the query. It involves two major steps.2] combined. consider a query over values 1 and 2 for attribute 1. Then we process the query by greedily selecting non-covered attribute sets from this order until the query is answered. we compute the total size of the bitmaps that match the query considering those attributes that intersect between the query and the index attribute set currently under analysis.

subresult) 24: return queryResult Algorithm 5: Query Execution for Selection Queries when Combined Attribute Bitmaps are Available.numAtts == 1 8: as. with length and saved ﬁelds and query matching keys list BI . BitColumn Hashtable BI.keys[0]) 21: for i = 2 to |as.attributes 17: as = next attribute set from AI 18: if as. query q) notes: AI .attributes 20: subresult = BI.get(as.a bitcolumn of all 1’s // Prioritize Query Plan 1: for each attribute set as in AI 2: if as.keys| 22: subresult = OR(subresult.attributes ∩ ca = ∅ 19: ca = ca ∪ as.keys[i])) 23: queryResult = AND(queryResult.saved = 0 9: else 10: for each size 1 attribute subset a in as 11: as.BI.saved -= as.SelectionQueryExecution (Available Indexes AI.contains identiﬁcation of sets of indexed attributes.length 13: sort AI by saved ﬁeld // Execute Query 14: attribute set ca = null 15: queryResult = B1 16: while ca = q.length 12: as.provides access to bitcolumn using unique key B1 . 119 .saved += a.get(as.attributes ∩ q.attributes = ∅ 3: generate query matching keys for as 4: attach query matching keys to as 5: sum query matching bitmap lengths 6: for each attribute set as in AI 7: if as.

2]1.[1]2 [2]3 [1. Length of the resultant bitmap is then derived using eﬀective bit density. Those multi-attribute indexes that reﬂect a cost savings will get a positive value 120 . 2]2. Note that the computation of bitmap length in line 5 is not actually a simple sum. This does not adversely aﬀect query operation since the multi-attribute option would not be selected anyway. All single attribute indexes are assigned a saved value of 0. we sum up the query matching bitmap length associated with each single attribute subset of the multi-attribute attribute set. [1. The diﬀerence between this length and the total length of the query matching bitmaps for the multi-attribute set is the amount of bitmap length saved by using the multi-attribute set. If the bit density is within the range of about 0. For a multiple attribute index. Available Index [1] [2] [1. In lines 6-12.9. Since the matching bitmaps will not overlap. In these cases.query matching bitmap length for each available index. then its value can not be determined. we use this initial information in order to compute the total bitmap length saved by using a multi-attribute bitmap in relation to using the relevant single attribute alternatives.3 Table 4.3 .5. 2] Matching Keys [1]1.1 to 0. we conservatively estimate its value as 0. these bit densities are summed to arrive at an eﬀective bit density of the resultant OR bitmap.1: Sample Query Matching Keys for Query Attributes 1 and 2. The bit density of each matching bitmap is estimated based on its length.

is suitable when the update rate is not very high and we need to maintain the index up-to-date in an online manner. the query is complete and we return the result.3 Index Updates Bitmap indexes are typically used in write-once read-mostly databases such as data warehouses. In line 13. 121 . we will process a subquery using the index.3. If not. When we have covered all query attributes. 4. We get the next best index.18). Bitmap updates are expensive as all columns in the bitmap index need to be accessed. where ci is the number of bitmap values (or cardinality) i=1 for attribute Ai . In [12]. we initialize a set that indicates the attributes covered by the query so far and initialize the query result to all 1’s. and check if it’s attributes have already been covered (line 17.for saved. while those that are more costly will get a negative value. In lines 14 and 15. Then we start processing the query in order of how promising the prospective indexes were an continue until all query attributes are covered (line 15). In other words. the number of bitmaps that need to be updated when a new tuple is appended in a table with A attributes is A as opposed to ΣA ci . we sort the potential indexes in terms of this saved value. We AND together this result to build the subqueries into the overall query result. We update the covered attribute set and OR together all bitcolumns that match the queries relevant to the query and the index. Appends of new data are often done in batches and the whole index is rebuilt from scratch once the update is completed. This approach. which clearly outperforms the alternative. we propose the use of a padword to avoid accessing all bitmaps to insert a new tuple but only to update the bitmaps corresponding to the inserted values.

as only need to be accessed.and 11 follow Zipf distribution with s parameter set to 1.1 Experiments Experimental Setup Experiments were executing using a 1.4. 7-9 have cardinality 10.4 4. 1M records. The bitmap index selection.7. They are characterized as follows: • synthetic .Updating multi-attribute bitmaps is even more eﬃcient than updating traditional bitmaps. attributes 2. If multi-attributes bitmaps are also padded we only need to access the bitmaps corresponding to the combined values (attributes) that are being inserted. and 12 follow Zipf distribution with s parameter set to 2.6. attributes 1-3 have cardinality 2.10 follow uniform random distribution.87GHz HP Pavilion PC with 2GB RAM and Windows Vista operating system. A 2 bitmaps 4. Attributes 1.5.4.12 attributes. consider the previous table with A attributes where multi-attribute bitmaps are built over pair of attributes. • uniform . each with cardinality 10. 4-6 have cardinality 5. It has 12 attributes. Data Several data sets are used during experiments. For example. and 10-12 have cardinality 25. the cost of updating the multi-attribute bitmaps is half of the cost of updating the traditional bitmaps.8. 1M records. 122 . attributes 3. bitmap creation and query execution algorithms were implemented in Java.9.uniformly distributed random data set.

It has 12 attributes. This data set is made up of more than 2 million records. • Landsat . against our proposed approach in which both single and multi-attribute bitmaps are available. These points remain in the data set. where only single attribute bitmaps are available.a bin representation of High-Energy Physics data. We used the ﬁrst 1 million instances and 44 attributes. Department of Commerce) Census Bureau website. There are a total of 122 bitcolumns for this data set. This data set is not uniformly distributed. It contains over 275000 records. Queries Query sets were generated by selecting random points from the data and constructing point queries from those points. The cardinality per dimensional varies between 2 and 12. The discretized dataset comes from the UCI Machine Learning Repository. For those cases where range queries are used. The data set can be obtained from [47] • HEP .S. each cardinality 10. We present comparisons in terms of query execution time. The raw dataset is from the (U. so that it is guaranteed that there will be at least one match for each query. number of bitmaps involved in answering a 123 .a bin representation of satellite imagery data. Metrics Our goal is to compare performance of a system reﬂecting the current state of the art. The cardinality of these attributes varies from 2 to 16.a discretized version of the USCensus1990raw data set.• census . the range is expanded around a selected point so that the range is guaranteed to contain at least one match. The data set is 60 attributes.

we instead get 2 attribute recommendations. while it was 17. [2. After executing index selection with parameters probQT and probSR set to 0. and [3. 124 . no degradation at higher number of attributes).5. Attributes 10. The average query execution of 100 point queries using only single attribute bitmaps was 41. This solution makes sense in the context of the performance metric of weighted beneﬁt. We would expect more expected beneﬁt for combinations for which the number of value combinations approaches the cardinality limit.2 Experimental Results Index Selection We executed the index selection algorithm over the synthetic data set.e. because a sparser bitmap can be more eﬀectively compressed. We set the cardLimit parameter to 100 and set the query probability parameters to 1 (i. and the values are independent. because it’s value combination cardinality is 100. Query execution assumes the bitmap indexes are in main memory.query.6. We would also expect more expected beneﬁt from combining attributes that are independent. The time for using recommended index set includes an average overhead of 0.25ms to explore the richer set of options and select the best query execution plan.47ms.39ms using the recommended indexes. 11. and 12 do not combine with any others in this scenario because there are no free attributes that would be within cardinality constraints.4. The other combinations have greater inter-attribute dependency because they follow Zipf distributions.7]. Therefore set [1.4. The attribute sets that were selected to combine together were [1. Further performance enhancement would be possible during query execution if the required bitmaps need to be read from disk. The proposed set of indexes is [7.7] is selected ﬁrst. 4. and total bitmap size accessed to answer a query.8].9].5.4.8].

8 show the query execution time. Figures ?? and 4. Q1. Query Execution Controlled Query Types. is a point query on both attributes.6]. means that the query for that attribute is a point query. A range of 1. and total size of the bitmaps involved in answering the query. and Q3 selects one value from the ﬁrst attribute and ﬁve values from the second attribute. [4. respectively.12].2: Query Deﬁnitions for 2 Attributes with cardinality 10 and uniform random distribution.10]. We ran 6 query types over bitmaps built over 2 attributes with uniform random distribution and cardinality 10. [2.9]. number of bitmaps involved in answering each query type.2 and varied in the length of the range.Range Length Query A1 A2 Q1 1 1 Q2 1 2 Q3 1 5 Q4 2 5 Q5 2 5 Q6 5 5 Table 4. [5. [1. Each of these has a value combination cardinality greater than or equal to 50 and selection order is dependent on cardinality and then inter-attribute independence.11]. In these ﬁgures. Note that in this case. The ‘Only 1-Dim’ label refers to the case where only Single Attribute Bitmaps are available. The queries are detailed in Table 4. 5 is the worst case scenario. the ‘Only M-Dim’ label refers to the 125 . Any range bigger than ﬁve could take the complement of the non-queried range to answer the query. and [3. for example.

case where the queries are forced to use Multi-Attribute Bitmaps, and the ‘Proposed’ label refers to the case where the query execution plan is selected considering both indexes available. For queries Q1-Q4, the multi-attribute bitmaps are always selected, and they are signiﬁcantly faster than the single-attribute options. For queries Q5-Q6, if single and multi-attribute bitmaps are available, single attribute is always selected. However, the best is always slightly faster than the proposed technique due to the overhead of evaluating and selecting the best option. For these queries, the overhead was on the order of 0.2ms. The proposed technique usually selects the option that involves the shortest original bitmaps required to answer the query. The exception here is Q5. The single dimension index is used because it involves fewer bitmap operations and involves less total bitmap operand length over the set of operations that need to take place.

Figure 4.6: Query Execution Time for each query type.

126

Figure 4.7: Number of bitmaps involved in answering each query type.

Query Execution for Point Queries. Figure 4.9 demonstrates the potential query speedup associated with our proposed query processing technique when multiattribute bitmaps are available to answer queries. The single-D bar shows the average query execution time required to answer 100 point queries using traditional bitmap index query execution, where points are randomly selected from the data set (there is at least one query match). For the multi-attribute test, indexes were selected and generated with a maximum set size of 2. So a suite of bitcolumns representing 2attribute combinations is available for query execution as well the single attribute bitmaps. Each single attribute is represented in exactly one attribute pair. The results show signiﬁcant speedup in query execution for point queries. Point queries over the uniform, landsat, and hep datasets experience speedups of factors of 5.01, 3.23, and 3.17 respectively.

127

Figure 4.8: Total size of the bitmaps involved in answering each query type.

Similar results were achieved for the census data set. Indexes were selected without a dimensionality limit and with a cardinality limit of 100. The query execution time using only single attribute bitmaps was 83.01 ms while the query execution time was only 24.92 ms when both, single attribute and multi-attribute bitmaps were available. Query Execution Over Range Queries. Figure 4.10 shows the impact of using the single-attribute bitmaps instead of the proposed model as the queries trend from being pure point queries to gradually more range-like queries. Average query execution times over 100 queries are shown. The queries are constructed by making the matching range of an attribute either a single value, or half of the attribute cardinality. Whether an attribute is a point or a range is randomly determined. The x-axis shows the probability that a given pair of attributes in the query represents a range (e.g .1 means that there is 90% probability that 2 query attributes are both over single values). The graph shows signiﬁcant speedup for point queries. Although the 128

9: Query Execution Times. The lower portion of the bar shows the total size of the bitmaps for all of the single attribute indexes. Point Queries. the multi-attribute alternative maintains a performance advantage until nearly all the query attributes are over ranges.Figure 4. Traditional versus MultiAttribute Query Processing speedup ratio decreases as the queries become more range-like.11 shows bitmap index sizes for each data set. The entire bar represents the space required for both the single and 2-attribute bitmaps. Even though the multi-attribute 129 . The upper portion of the bar shows the size of the suite of 2-attribute bitmaps (each attribute represented exactly once in the suite). Index Size Figure 4. This is because there are still some cases where using the multiattribute bitmap is useful.

while its 2-attribute counterpart is represented by 3000 bitmaps (30 attribute pairs.10: Query Execution Times as Query Ranges Vary.Figure 4. Landsat Data bitmaps represent many more columns than the single attribute bitmaps(e. the single dimension landsat are covered by 600 bitcolumns. the compression associated with the combined bitmaps means that the space overhead for the multi-attribute representations is not as severe as may be expected.g. 130 . 100 bitmaps per pair)).

Figure 4.11: Single and Multi-Attribute Bitmap Index Size Comparison Bitmap Operations versus Bitmap Lengths Since query execution enhancement is derived from two sources. Table 4. Randomly selected attributes simply combines random pairs of attributes. reduction in bitmap operations and reduction in bitmap length. we tested query execution after performing index selection using diﬀerent strategies. For each listed strategy. a suite of size 2 attribute sets is determined diﬀerently.3 shows the query speedup for point queries using diﬀerent strategies to determine the attributes that should be combined together. The other 2 strategies use Algorithm 3 to generate goodness factors for each combination based on bitmap length saved and weighted by result density. it is valuable to know the relative impacts of each one. The speedup is measured over 100 random point queries taken from HEP data. To this end. The goodness using best greedily selects non-intersecting 131 .

we propose the use of multi-attribute bitmap indexes to speed up the query execution time of point and range queries over traditional bitmap indexes. Using Diﬀerent Attribute Combination Strategies pairs starting from the top of the list. HEP Data. Selecting Best 3.79 tributes 3 Algorithm 1 Selecting Worst 2.3: Query Speedup. It can also be tuned to avoid undesirable conditions such as an excessive number of bitmaps or excessive bitmap dimensionality. Utilizing our index selection 132 . while the other technique greedily selects non-overlapping pairs from the bottom. 4.5 Conclusion In this paper. We introduce a query execution model when both traditional and multi-attribute bitmaps are available to answer a query. The technique is driven by a performance metric that is tied to actual query execution time.67 Beneﬁcial Combinations Table 4.15 Combinations 2 Randomly Selected At2. We introduce an index selection technique in order to determine which attributes are appropriate to combine as multi-attribute bitmaps.Attribute Combination Strategy Query Speedup 1 Algorithm 1. The query execution takes advantage of the multi-attribute bitmaps when it is beneﬁcial to do so while introducing very little overhead for cases when such bitmaps are not useful. The results show that signiﬁcant speedup can be achieved simply by combining reasonable attributes together. It takes into account data distribution and attribute correlations. and that query execution can be further enhanced by careful selection of the indexes.

and query execution techniques. we were able to consistently achieve up to 5 times query execution speedup for point queries while introducing less than 1% overhead for those queries where multi-attribute indexing was not useful. 133 .

In order to maximize query performance. Since workloads can shift over time. the query processing follows a access structure traversal path that matches the function being optimized while respecting additional problem constraints. and the distribution and characteristics of the data itself. the possible query execution plans evaluated by a data management system are limited by available indexes. It considers both the query patterns and the data distribution to determine the beneﬁt of diﬀerent alternatives. the query processing and access structures should be tailored to match the characteristics of the application. 134 . The optimization query framework presented herein describes an I/O-optimal query processing technique to process any query within the broad class of queries that can be stated as a convex optimization query. While resolving queries. The online high-dimensional index recommendation system provides a cost-based technique to identify access structures that will be beneﬁcial for query processing over a query workload. a method to determine when the current index set is no longer eﬀective is also provided. query patterns over time. For such applications.CHAPTER 5 Conclusions Data retrieval for a given application is characterized by the query types.

By ensuring that the run-time traversal and processing of available access structures matches the characteristic of a given query and that the access structures available match the overall query workload and data patterns. Both the run-time query processing and design-time access methods are targeted as potential sources of query time improvement. 135 . The index selection task ﬁnds attribute combinations that lead to the greatest expected query execution improvement. The primary end goal of the research presented in this thesis is to improve query execution time.The multi-attribute bitmap indexes presents both query processing and index selection for the traditionally single attribute bitmap index. Query processing selects a query execution plan to reduce the overall number of operations. improved query execution time is achieved.

Kriegel.BIBLIOGRAPHY [1] S. pages 78–86. 1995. Marathe. 2004. and H. Oracle Corp. A. Sprugnoli. In Data Compression Conference. NY. Langley. 2001. and R. New York. and a M. [5] S. A. Koll´r. 2000. In VLDB. Convex Optimization. Keim. Keim. Nashua. Selection of relevant features and examples in machine learning. Vandenberghe. Adaptive Control Processes: A Guided Tour.-P. Cambridge University Press. AI. pages 28–39. [10] S. Bruno and S.. [2] G. B¨hm. Berchtold. Princeton University Press. [6] S. D. 1990. S. [7] A. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. Database Syst. and D. Narasayya. NH. Byte-aligned bitmap compression. pages 499–510. pages 1110–1121. Blum and P. Database tuning advisor for microsoft sql server 2005. The x-tree: An index structure for highdimensional data. 1996. PODS. In Proceedings of the International Conference on Very Large Data Bases. ACM Trans. 1961. 33(3):322–373. A cost model for query processing in high dimensional data spaces. [4] R. Keim. 1997. [9] C. R. Kriegel. and H. C. Bohm. Chaudhuri. Optimal selection of secondary indexes. P. Pinzani. Syamala. Boyd and L. Barucci. ACM Comput. USA. [11] N. [8] C. 1997. Surv. Chaudhuri. Bohm. 136 . 25(2):129–178. Bellman. Agrawal. R. L. V. Berchtold. A. [3] E. In VLDB. A cost model for nearest o neighbor search in high-dimensional data space. D. To tune or not to tune? a lightweight physical design alerter. Berchtold. IEEE Transactions on Software Engineering. Antoshenkov. 2004.. S. 2006.

2000. Ferhatosmanoglu. pages 367–378. Edelstein. D. NY. Adaptive and automated index selection in rdbms. Castelli. pages 146–155. Bergman. C. Agrawal. Data and Knowledge Engineering. Dogac. [20] H. Faster data warehouses. and A. USA. Narasayya. Autoadmin ’what-if’ index analysis utility. Roussopoulos. Chang. Navathe. An eﬃcient cost-driven index selection tool for microsoft SQL server. and R.[12] G. Faloutsos. and A. In SIGMOD ’87: Proceedings of the 1987 ACM SIGMOD international conference on Management of data. Montevideo. Sellis. An automated index selection tool for oracle7: Maestro 7. Y. and S. R. [22] C. Analysis of object oriented spatial access methods.-L. Capara. [17] S. TUBITAK Software Research and Development Center. H. 1993. Technical Report LBNL/PUB-3161. [14] Y. J. Lotem. ACM SIGMOD. [18] R. Costa and S. 1994. In SSDBM. L. 1997. Vienna. Fagin. C. Austria. Optimal aggregation algorithms for middleware. E. Omiecinski. and T. Ikinci.-S. Smith. In XXVIII Latin-American Conference on Informatics (CLIE). Ferhatosmanoglu. In The VLDB Journal. D. Erisik. Lo. A. and M. New York. Blanken. Chang. Frank.-C. Lifschitz. and D. L. 1995. Vector approximation based indexing for non-uniform high dimensional data sets. 1998. Index self-tuning with agent-based databases. 1992. Li. pages 426–439. Information Week. [16] S. [13] A. Update conscious bitmap indices. Exact and approximate algorithms for the index selection problem in physical database design. IEEE Transactions on Knowledge and Data Engineering. Naor. Journal of Computer and System Sciences. 1987. 2002. [24] M. M. [19] A. [21] R. Maio. [23] H. Uruguay. 2003. T. 137 . Tuncel. Narasayya. A. E. The onion technique: indexing for linear optimization queries. Gibas. M. Chaudhuri and V. E. and N. Abbadi. V. ACM Press. Canahuate. In CIKM ’00. M. Fischetti. Choenni. Chaudhuri and V. 2007. [15] S. In Proceedings ACM SIG-MOD Conference. In International Conference on Extending Database Technologhy (EDBT). December 1995. and H. On the selection of secondary indexes in relational databases.

In Symposium on Large Spatial Databases. [30] J. [31] G. L. An introduction to variable and feature selection. In Proceedings of the Int. Dordrecht. pages 518–529. Muthukrishnan. In SIGMOD. P. Bernstein.[25] V. C. 4(2):235–247. Edinburgh. PhD thesis. On the selection of an optimal set of indexes. Lobo. Inﬂuence sets based on reverse nearest neighbor queries. [27] M. [29] I.-W. Geist. 2001. Boyd. 2000. Ye. Hjaltason and H. E. [36] R. John. IEEE Transactions on MultiMedia. S. Kohavi and G. [37] F. J.com/informix/corpinfo/.zines/whiteidx. UK. Conf. [28] C. The gc-tree: a high-dimensional index structure for similarity search in image databases. 1998. Sootland. In W. Han. pages 1–12. N. [34] M. Guang-Ho Cha. 05 2000. S. Inc. R. March 2000.. Saxton. Gionis. Stanford University. Pei. and I. Elissef. 2005. Ranking in spatial databases. Portugal. editors.informix. Koudas. ACM Press. on Very Large Data Bases. 30(2):170–231. and Y. Disciplined convex programming. Surv. http://www. The Department of Electrical Engineering. Yin. Chen. Mining frequent patterns without candidate generation.htm. June 2002. and V. and Y. Gaede and O. Kluwer (Nonconvex Optimization and its Applications series). 2003. [26] A. In International Database Engineering & Applications Symposium. [38] M. 1997. Korn and S. 2004. Motwani. AI. 2000 ACM SIGMOD Intl. Autonomous query-driven index tuning. pages 83–95. 1995. In SIGMOD Conference. Journal of Machine Learning Research. Ip. 138 . Indyk. and Y. Coimbra. [32] V. Conference on Management of Data. and R. Schallehn. and P. Gunther. [35] S. Raghavan. Naughton. Grant. PREFER: A system for the eﬃcient execution of multi-parametric ranked queries. Multidimensional access methods. J. Robust and Convex Optimization with Applications in Finance. Similarity searching in high dimensions via hashing. Samet. September 1999. 1983. [33] I. Kai-Uwe. Papakonstantinou. In press. ACM Comput. Hristidis. Wrappers for feature subset selection. A. Informix decision support indexing for the enterprise data warehouse. IEEE Transactions on Software Engineering. Guyon and A.

page 70. Ma. SIAM. May 2000. Texture features for browsing and retrieval of image data. 32(3):16. P.edu/databases/census1990/USCensus1990-desc. Schenkel. & Math. S. and R. volume 2.. S.[39] B. Mao. IEEE Transactions on Pattern Analysis and Machine Intelligence.. Nesterov and A.html. 1995.uci. Multidimensional indexing and query coordination for tertiary storage management. Manjunath and W. L. [49] A. Weikum. VLDB. USA. Database Syst. The Oﬃce of Science Data-Management Challenge. Sci. and F. D. [45] S.htm. Inst. Rotem. Interior-Point Polynomial Algorithms in Convex Programming. Nemirovsky. In SIGMOD. DOE Oﬃce of Advanced Scientiﬁc Computing Research. Nemirovsky. 139 . A general approach to polynomial-time algorithms design for convex programming. Bernardo. Hersch. Sim. [41] R. Nordberg. Repository. Top-k query evaluation with probabilistic guarantees. S. 1994. 1999. pages 21–30. USSR. [42] Y. In Indexing Goes a New Direction. 1999. Econ. J. The us census1990 data http://kdd. 2000. Y. 2002. March–May 2004.ece.ics. G. and R. Roussopoulos. Mount. [51] M. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. R. [48] N. 18(8):837–842. Han. [40] B.. P. USSR Acad. Manjunath. Theobald. Vincent. Airphoto dataset. 2004. [43] Y. Technical report. M.. PA 19101. H. on Mass Storage Systems and Technologies. Moscow. Kelly. ACM Trans. Nearest neighbor queries. Nesterov and A. and R. CLOSET: An eﬃcient algorithm for mining frequent closed itemsets.ucsb. In Proceedings of the 19th IEEE Symposium on Mass Storage Systems and Tenth Goddard Conf. M. [50] R. Centr. Ponce. Multi-resolution bitmap indexes for scientiﬁc data. 2007. Ramakrishna. Winslett. and A. Sinha and M. set. Indexing and selection of data items in huge data sets by constructing and accessing tag collections. 1988. [46] M. [44] J. http://vivaldi. August 1996. Shoshani. Pei. Vila. In Proceedings of the 11th International Conference on Scientiﬁc and Statistical Data(SSDBM). Philadelphia. [47] U. L.edu/Manjunath/research.

Blott. K. and S. 1998. and A. Shoshani. Shoshani. Wang. D. 2003. Index selection in relational databases. and A. In Proceedings of SSDBM. Valentin. C. In International Conference on Foundations on Data Organization (FODO). E. G. Otoo. W. In SIGMOD. Shoshani. Compressing bitmap indexes for faster search operations. Zuliani. [58] K.. and A.-J. and A. Lohman.-C. Chang. On the performance of bitmap indices for high cardinality attributes. E. Chang. Optimizing bitmap indexes with eﬃcient compression. Schek. H. Weber. 1985. [60] Z. Blott. S. In Proceedings International Conference Data Engineering. Db2 advisor: An optimizer smart enough to recommend its own indexes. Kyoto. [57] K.[52] G. In SSDBM. Wu. pages 194–205. 2006. Whang. March 2004. [59] K. Shoshani. E. H. 2002. ACM Transactions on Database Systems. [56] K. Wu. Schek. Boolean + ranking: Querying a database by k-constrained optimization. Skelley. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. and S. Using bitmap index for interactive exploration of large datasets. Wu. C.-J. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. M. Lawrence Berkeley National Laboratory. Weber. J. [55] K. In VLDB. 1998. Chen. Paper LBNL54673. Hwang. 2006. In Proceedings of the 24th International Conference on Very Large Databases. J. Zhang. Wu. Koegler. Japan. 2000. Otoo. and Y. [53] R. Otoo. Lang. Zilio. [54] R. M. and A. 140 .

- Read and print without ads
- Download to keep your version
- Edit, email or read offline

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

CANCEL

OK

You've been reading!

NO, THANKS

OK

scribd

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->