P. 1
Gibas%20Michael%20A

Gibas%20Michael%20A

|Views: 41|Likes:
Published by Vikram Cheri

More info:

Published by: Vikram Cheri on Mar 11, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

03/11/2011

pdf

text

original

Sections

  • 1.1 Overall Technical Motivation
  • 1.2 Novel Database Management Applications
  • Modeling and Processing Optimization Queries
  • 2.1 Introduction
  • 2.2 Related Work
  • 2.3 Query Model Overview
  • 2.3.1 Background Definitions
  • 2.3.2 Common Query Types Cast into Optimization Query
  • 2.3.3 Optimization Query Applications
  • 2.3.4 Approach Overview
  • 2.4 Query Processing Framework
  • 2.4.1 Query Processing over Hierarchical Access Structures
  • 2.4.2 Query Processing over Non-hierarchical Access Struc-
  • 2.4.3 Non-Covered Access Structures
  • 2.5 I/O Optimality
  • contour
  • 2.6 Performance Evaluation
  • 8-D Color Histogram, 100k pts
  • Random, 50k Points
  • 100k Points
  • 2.7 Conclusions
  • 3.1 Introduction
  • 3.2 Related Work
  • 3.2.1 High-Dimensional Indexing
  • 3.2.2 Feature Selection
  • 3.2.3 Index Selection
  • 3.2.4 Automatic Index Selection
  • 3.3 Approach
  • 3.3.1 Problem Statement
  • 3.3.2 Approach Overview
  • 3.3.3 Proposed Solution for Index Selection
  • 3.3.4 Proposed Solution for Online Index Selection
  • 3.3.5 System Enhancements
  • 3.4 Empirical Analysis
  • 3.4.1 Experimental Setup
  • 3.4.2 Experimental Results
  • Dataset, synthetic Query Workload
  • Dataset, hr Query Workload
  • 3.5 Conclusions
  • Multi-Attribute Bitmap Indexes
  • 4.1 Introduction
  • 4.2 Background
  • 2M entries versus
  • 4.3 Proposed Approach
  • 4.3.1 Multi-attribute Bitmap Index Selection for Ad Hoc
  • 4.3.2 Query Execution with Multi-Attribute Bitmaps
  • Conclusions
  • BIBLIOGRAPHY

IMPROVING QUERY PERFORMANCE THROUGH

APPLICATION-DRIVEN PROCESSING AND RETRIEVAL
DISSERTATION
Presented in Partial Fulfillment of the Requirements for
the Degree Doctor of Philosophy in the
Graduate School of The Ohio State University
By
Michael Albert Gibas, B.S., M.S.
* * * * *
The Ohio State University
2008
Dissertation Committee:
Hakan Ferhatosmanoglu, Adviser
Hui Fang
Atanas Rountev
Approved by
Adviser
Graduate Program in
Computer Science and
Engineering
ABSTRACT
The proliferation of massive data sets across many domains and the need to gain
meaningful insights from these data sets highlight the need for advanced data re-
trieval techniques. Because I/O cost dominates the time required to answer a query,
sequentially scanning the data and evaluating each data object against query crite-
ria is not an effective option for large data sets. An effective solution should require
reading as small a subset of the data as possible and should be able to address general
query types. Access structures built over single attributes also may not be effective
because 1) they may not yield performance that is comparable to that achievable by
an access structure that prunes results over multiple attributes simultaneously and 2)
they may not be appropriate for queries with results dependent on functions involv-
ing multiple attributes. Indexing a large number of dimensions is also not effective,
because either too many subspaces must be explored or the index structure becomes
too sparse at high dimensionalities. The key is to find solutions that allow for much of
the search space to be pruned while avoiding this ‘curse of dimensionality’. This the-
sis pursues query performance enhancement using two primary means 1) processing
the query effectively based on the characteristics of the query itself and 2) physically
organizing access to data based on query patterns and data characteristics. Query
performance enhancements are described in the context of several novel applications
ii
including 1) Optimization Queries, which presents an I/O-optimal technique to an-
swer queries when the objective is to maximize or minimize some function over the
data attributes, 2) High-Dimensional Index Selection, which offers a cost-based ap-
proach to recommend a set of low dimensional indexes to effectively address a set
of queries, and 3) Multi-Attribute Bitmap Indexes, which describes extensions to a
traditionally single-attribute query processing and access structure framework that
enables improved query performance.
iii
ACKNOWLEDGMENTS
First of all, I would like to thank my wife, Moni Wood, for her love and support.
She has made many sacrifices and expended much effort to help make this dream a
reality. I would not have been to complete this dissertation without her.
I also would not have been able to accomplish this without the help of my family.
I thank my mother Marilyn, my sister Susan and my grandmother Margaret for their
support. I thank my late father Murray for providing an example to which I have
always aspired. Moni’s family was also understanding and supportive of the travails
of Ph.D research.
I want to express my appreciation to my adviser, Hakan Ferhatosmanoglu, for
his guidance and encouragement over the years. He has been generous with his
time, effort, and advice during my graduate studies. I am thankful for the technical
discussions we have had and the opportunities he has given me to work on interesting
projects.
Many people in the department have helped me to get to this point. I specifically
want to thank my dissertation committee members, Dr. Atanas Rountev and Dr.
Hui Fang, for their efforts and valuable comments on my research. I wish to thank
Dr. Bruce Weide for his help and support during my studies, Dr. Rountev for my
early exposure to research and paper writing, and Dr. Jeff Daniels for giving me the
opportunity to work on an interdisciplinary geo-physics project.
iv
Many friends and members of the Database Research Group provided valuable
guidance, discussions, and critiques over the past several years. I want to especially
thank Guadalupe Canahuate for her help and friendship during this time. Also thanks
to her family for providing enough Dominican coffee to get through all the paper
deadlines.
My Ph.D. research was supported by NSF grant NSF IIS -0546713.
v
VITA
May 10, 1969 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Omaha, Nebraska
1993 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.S. Electrical Engineering,
Ohio State University, Columbus Ohio
2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Computer Science and Engineer-
ing,
Ohio State University, Columbus Ohio
1989-1992 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Electrical Engineering Co-op,
British Petroleum,
Various Locations
1994-2002 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Project Manager,
Acquisition Logistics Engineering,
Worthington, Ohio
2003-present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Assistant,
Office Of Research,
Department of Computer Science and
Engineering,
Department of Geology
Ohio State University,
Columbus, Ohio
PUBLICATIONS
Research Publications
Michael Gibas, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Online Index Recom-
mendations for High-Dimensional Databases Using Query Workloads. IEEE Trans-
actions on Knowledge and Data Engineering (TDKE) Journal, February 2008 (Vol.
20, no.2), pp. 246-260
vi
Michael Gibas, Hakan Ferhatosmanoglu, High-Dimensional Indexing, Encyclopedia of
Geographical Information Science, (E-GIS)
Michael Gibas, Ning Zheng, and Hakan Ferhatosmanoglu, A General Framework for
Modeling and Processing Optimization Queries, 33rd International Conference on
Very Large Data Bases (VLDB 07), Vienna, Austria, September, 2007, pp. 1069-
1080
Guadalupe Canahuate, Michael Gibas, Hakan Ferhatosmanoglu, Update Conscious
Bitmap Indexes. International Conference on Scientific and Statistical Database
Management (SSDBM), Banff, Canada, 2007/July
Guadalupe Canahuate, Michael Gibas, Hakan Ferhatosmanoglu, Indexing Incomplete
Databases, International Conference on Extending Database Technology (EDBT),
Munich, Germany, 2006/March, pp. 884-901
Michael Gibas, Developing Indices for High Dimensional Databases Using Query Ac-
cess Patterns, Masters Thesis, June 2004
Atanas Rountev, Scott Kagan, and Michael Gibas, Static and Dynamic Analysis of
Call Chains in Java, ACM SIGSOFT International Symposium on Software Testing
and Analysis (ISSTA’04), pages 1-11, July 2004
Fatih Altiparmak, Michael Gibas, Hakan Ferhatosmanoglu, Relationship Preserving
Feature Selection for Unlabelled Time-Series, Ohio State Technical Report, OSU-
CISRC-10/07-TR74, 14 pp.
Michael Gibas, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Indexes for Databases
with Missing Data, Ohio State Technical Report, OSU-CISRC-4/06-TR45, 15 pp.
Guadalupe Canahuate, Tan Apaydin, Michael Gibas, Hakan Ferhatosmanoglu, Fast
Semantic-based Similarity Searches in Text Repositories, Conference on Using Meta-
data to Manage Unstructured Text, LexisNexis, Dayton, OH, 2005/October
Atanas Rountev, Scott Kagan, and Michael Gibas, Evaluating the Imprecision of
Static Analysis, ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Soft-
ware Tools and Engineering (PASTE’04), pages 14-16, June 2004
FIELDS OF STUDY
vii
Major Field: Computer Science and Engineering
viii
TABLE OF CONTENTS
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapters:
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overall Technical Motivation . . . . . . . . . . . . . . . . . . . . . 1
1.2 Novel Database Management Applications . . . . . . . . . . . . . . 2
2. Modeling and Processing Optimization Queries . . . . . . . . . . . . . . 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Query Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Background Definitions . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Common Query Types Cast into Optimization Query Model 13
2.3.3 Optimization Query Applications . . . . . . . . . . . . . . . 15
2.3.4 Approach Overview . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Query Processing Framework . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Query Processing over Hierarchical Access Structures . . . . 21
2.4.2 Query Processing over Non-hierarchical Access Structures . 26
2.4.3 Non-Covered Access Structures . . . . . . . . . . . . . . . . 28
2.5 I/O Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
ix
3. Online Index Recommendations for High-Dimensional Databases using
Query Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.1 High-Dimensional Indexing . . . . . . . . . . . . . . . . . . 54
3.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.3 Index Selection . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.4 Automatic Index Selection . . . . . . . . . . . . . . . . . . . 57
3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . 59
3.3.3 Proposed Solution for Index Selection . . . . . . . . . . . . 61
3.3.4 Proposed Solution for Online Index Selection . . . . . . . . 71
3.3.5 System Enhancements . . . . . . . . . . . . . . . . . . . . . 76
3.4 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 78
3.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 80
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4. Multi-Attribute Bitmap Indexes . . . . . . . . . . . . . . . . . . . . . . . 98
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.3.1 Multi-attribute Bitmap Index Selection for Ad Hoc Queries 110
4.3.2 Query Execution with Multi-Attribute Bitmaps . . . . . . . 115
4.3.3 Index Updates . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 121
4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 123
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
x
LIST OF FIGURES
Figure Page
2.1 Sample Optimization Objective Function and Constraints F(

θ, A) :
(x −1)
2
+(y −2)
2
, o(min), g
1
( α, A) : y −6 ≤ 0, g
2
( α, A) : −y +3 ≤ 0,
h
1
(

β, A) : x −5 = 0, k = 1 . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Optimization Query System Functional Block Diagram . . . . . . . . 20
2.3 Example of Hierarchical Query Processing . . . . . . . . . . . . . . . 25
2.4 Cases of MBRs with respect to R and optimal objective function contour 30
2.5 Cases of partitions with respect to R and optimal objective function
contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Accesses, Random weighted kNN vs kNN, R*-tree, 8-D Histogram,
100k pts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 CPU Time, Random weighted kNN vs kNN, R*-tree, 8-D Histogram,
100k pts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Accesses, Random weighted kNN vs kNN, VA-File, 60-D Sat, 100k pts 38
2.9 Page Accesses, Random Weighted Constrained k-LP Queries, R*-Tree,
8-D Color Histogram, 100k pts . . . . . . . . . . . . . . . . . . . . . . 40
2.10 Accesses vs. Index Type, Optimizing Random Function, 5-D Uniform
Random, 50k Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.11 Pages Accesses vs. Constraints, NN Query, R*-Tree, 8-D Histogram,
100k Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
xi
3.1 Index Selection Flowchart . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Dynamic Index Analysis Framework . . . . . . . . . . . . . . . . . . . 73
3.3 Costs in Data Object Accesses for Ideal, Sequential Scan, AutoAdmin,
Naive, and the Proposed Index Selection Technique using Relaxed Con-
straints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.4 Baseline Comparative Cost of Adaptive versus Static Indexing, random
Dataset, synthetic Query Workload . . . . . . . . . . . . . . . . . . . 85
3.5 Baseline Comparative Cost of Adaptive versus Static Indexing, random
Dataset, clinical Query Workload . . . . . . . . . . . . . . . . . . . . 86
3.6 Baseline Comparative Cost of Adaptive versus Static Indexing, stock
Dataset, hr Query Workload . . . . . . . . . . . . . . . . . . . . . . . 87
3.7 Comparative Cost of Online Indexing as Support Changes, random
Dataset, synthetic Query Workload . . . . . . . . . . . . . . . . . . . 89
3.8 Comparative Cost of Online Indexing as Support Changes, stock Dataset,
hr Query Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.9 Comparative Cost of Online Indexing as Major Change Threshold
Changes, random Dataset, synthetic Query Workload . . . . . . . . . 92
3.10 Comparative Cost of Online Indexing as Major Change Threshold
Changes, stock Dataset, hr Query Workload . . . . . . . . . . . . . . 93
3.11 Comparative Cost of Online Indexing as Multi-Dimensional Histogram
Size Changes, random Dataset, synthetic Query Workload . . . . . . 94
4.1 A WAH bit vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2 Query Execution Example over WAH compressed bit vectors. . . . . 103
4.3 Query Execution Times vs Total Bitmap Length, ANDing 6 bitmaps,
2M entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4 Query Execution Times vs Number of Attributes queried, Point Queries,
2M entries versus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xii
4.5 Simple multiple attribute bitmap example showing the combination
two single attribute bitmaps. . . . . . . . . . . . . . . . . . . . . . . . 108
4.6 Query Execution Time for each query type. . . . . . . . . . . . . . . 125
4.7 Number of bitmaps involved in answering each query type. . . . . . . 126
4.8 Total size of the bitmaps involved in answering each query type. . . . 127
4.9 Query Execution Times, Point Queries, Traditional versus Multi-Attribute
Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.10 Query Execution Times as Query Ranges Vary, Landsat Data . . . . 129
4.11 Single and Multi-Attribute Bitmap Index Size Comparison . . . . . . 130
xiii
CHAPTER 1
Introduction
If Edison had a needle to find in a haystack, he would proceed at once
with the diligence of the bee to examine straw after straw until he found
the object of his search... I was a sorry witness of such doings, knowing
that a little theory and calculation would have saved him ninety per cent
of his labor. - Nikola Tesla, 1931
In order to make better use of the massive quantities of data being generated by
and for modern applications, it is necessary to find information that is in some respect
interesting within the data. Two primary challenges of this task are specifying the
meaning of interesting within the context of an application and searching the data
set to find the interesting data efficiently. In the case of our proverbial haystack, we
can think of each piece of straw as a data object and needles as a data objects that
are interesting. While we could guarantee that we find the most interesting needle by
examining each and every piece of straw, we would be much better served by limiting
our search to a small region within the haystack.
1.1 Overall Technical Motivation
Analogous to our physical haystack example, the query response time during data
exploration tasks using a database system is affected both by the physical organization
1
of the data and by the way in which the data is accessed. As such, query performance
can be enhanced by traversing the data more effectively during query resolution or
by organizing access to the data so that it is more suitable for the query. Depending
on the application, queries can vary widely on a per query basis. Therefore, it is
highly desirable to process generic queries at run-time in a way that is effective for
each particular query. Physically reorganizing data or data access structures is time
consuming and therefore should be performed for a set of queries rather than by
trying to optimize an organization for a single query.
This thesis targets query performance enhancement through two sources 1) run-
time query processing such that the search paths and intermediate operations during
query resolution are appropriate for a given generic query, 2) data organization such
that the access structures available during query resolution are appropriate for the
overall query patterns and data distribution and allow for significant data space prun-
ing during query resolution. Query performance enhancement is explored using three
introduced applications, motivated and described in the next section.
1.2 Novel Database Management Applications
Optimization Queries. Exploration of image, scientific, and business data usu-
ally requires iterative analyses that involve sequences of queries with varying parame-
ters. Besides traditional queries, similarity-based analyses and complex mathematical
and statistical operators are used to query such data [41]. Many of these data ex-
ploration queries can be cast as optimization queries where a linear or non-linear
function over object attributes is minimized or maximized (our needle). Addition-
ally, the object attributes queried, function parametric values, and data constraints
2
can vary on a per-query basis. Because scientific data exploration and business deci-
sion support systems rely heavily on function optimization tasks, and because such
tasks encompass such disparate query types with varying parameters, these applica-
tion domains can benefit from a general model-based optimization framework. Such
a framework should meet the following requirements: (i) have the expressive power
to support solution of each query type, (ii) take advantage of query constraints to
prune search space, and (iii) maintain efficient performance when the scoring criteria
or optimization function changes.
A general optimization model and query processing framework are proposed for
optimization queries, an important and significant subset of data exploration tasks.
The model covers those queries where some function is being minimized or maximized
over object attributes, possibly within a set of constraints. The query processing
framework can be used in those cases where the function being optimized and the
set of constraints is convex. For access structures where the underlying partitions
are convex, the query processing framework is used to access the most promising
partitions and data objects and is proven to be I/O-optimal.
High-Dimensional Index Selection. With respect to the physical organization
of the data, access structures or indexes that cover query attributes can be used
to prune much of the data space and greatly reduce the required number of page
accesses to answer the query. However, due to the oft-cited curse of dimensionality,
indexes built over many dimensions actually perform worse than sequential scan. This
problem can be addressed by using sets of lower dimensional indexes that effectively
answer a large portion of the incoming queries. This approach can work well if the
characteristics of future queries are well known, but in general this is not the case,
3
and an index set built based on historical queries may be completely ineffective for
new queries.
In the case of physical organization, query performance is affected by creating a
number of low dimension indexes using the historical and incoming query workloads,
and data statistics. Historical query patterns are data mined to find those attribute
sets that are frequently queried together. Potential indexes are evaluated to derive
an initial index set recommendation that is effective in reducing query cost for a large
portion of queries such that the set of indexes is within some indexing constraint.
A set of indexes is obtained that in effect reduces the dimensionality of the query
search space, and is effective in answering queries. A control feedback mechanism
is incorporated so that the performance of newly arriving queries can be monitored,
and if the current index set is no longer effective for the current query patterns, the
index set can be changed. This flexible framework that can be tuned to maximize
indexing accuracy or indexing speed which allows it to be used for the initial static
index recommendation and also online dynamic index recommendation.
Multi-Attribute Bitmap Indexes. The nature of the database management
system used and the way in which data is physically accessed also affects query execu-
tion times. Bitmap indexes have been shown to be effective as an data management
system for many domains such a data warehousing and scientific applications. This
owes to the fact that they are a column-based storage structure and only the data
needed to address a query is ever retrieved from disk and that their base operations
are performed using hardware supported fast bitwise operations. Static bitmap in-
dexes have traditionally built over each attribute individually. The query model for
selection queries is simple and involves ORing together bitcolumns that match query
4
criteria within an attribute and ANDing inter-attribute intermediate results. As a re-
sult, the bitmap index structure is limited to those domains that it is naturally suited
for without change. As well query resolution does not take advantage of potential
multi-attribute pruning and possible reductions in the number of binary operations.
Multi-attribute bitmap indexes are introduced to extend the functionality and
applicability of traditional bitmap indexes. For many queries, query execution times
are improved by reducing size of bitmaps used in query execution and by reducing the
number of binary operations required to resolve the query without excessive overhead
for queries that do not benefit from multi-attribute pruning. A query processing
model where multi-attribute bitmaps are available for query resolution is presented.
Methods to determine attribute combination candidates are provided.
5
CHAPTER 2
Modeling and Processing Optimization Queries
The most exciting phrase to hear in science, the one that heralds new
discoveries, is not ‘Eureka!’ (I found it!) but ‘That’s funny ...’ - Isaac
Asimov
One challenge associated with data management is determining a cost-effective
way to process a given query. Given the time expense of I/O, minimizing the I/O
required to answer a query is a primary strategy in query optimization. Scientific
discovery and business intelligence activities are frequently exploring for data objects
that are in some way special or unexpected. Such data exploration can be performed
by iteratively finding those data objects that minimize or maximize some function over
the data attributes. While the answer to a given query can be found by computing
the function for all data objects and finding those objects that yield the most ex-
treme values, this approach may not be feasible for large data sets. The optimization
query framework presented in this chapter provides a method to generically handle
an important broad class of queries, while ensuring that, given an access structure,
the query is I/O-optimally processed over that structure.
6
2.1 Introduction
Motivation and Goal Optimization queries form an important part of real-
world database system utilization, especially for information retrieval and data anal-
ysis. Exploration of image, scientific, and business data usually requires iterative
analyses that involve sequences of queries with varying parameters. Besides tradi-
tional queries, similar-ity-based analyses and complex mathematical and statistical
operators are used to query such data [41]. Many of these data exploration queries
can be cast as optimization queries where a linear or non-linear function over object
attributes is minimized or maximized. Additionally, the object attributes queried,
function parametric values, and data constraints can vary on a per-query basis.
There has been a rich set of literature on processing specific types of queries, such
as the query processing algorithms for Nearest Neighbors (NN) [48] and its variants
[37], top-k queries [21, 51], etc. These queries can all be considered to be a type of
optimization query with different objective functions. However, they either only deal
with specific function forms or require strict properties for the query functions.
Because scientific data exploration and business decision support systems rely
heavily on function optimization tasks, and because such tasks encompass such dis-
parate query types with varying parameters, these application domains can benefit
from a general model-based optimization framework. Such a framework should meet
the following requirements: (i) have the expressive power to support existing query
types and the power to formulate new query types, (ii) take advantage of query con-
straints to proactively prune search space, and (iii) maintain efficient performance
when the scoring criteria or optimization function changes.
7
Despite an extensive list of techniques designed specifically for different types
of optimization queries, a unified query technique for a general optimization model
has not been addressed in the literature. Furthermore, application of specialized
techniques requires that the user know of, possess, and apply the appropriate tools
for the specific query type. We propose a generic optimization model to define a wide
variety of query types that includes simple aggregates, range and similarity queries,
and complex user-defined functional analysis. Since the queries we are considering
do not have a specific objective function but only have a “model” for the objective
(and constraints) with parameters that vary for different users/queries/iterations, we
refer to them as model-based optimization queries. In conjunction with a technique to
model optimization query types, we also propose a query processing technique that
optimally accesses data and space partitioning structures to answer the query. By
applying this single optimization query model with I/O-optimal query processing,
we can provide the user with a flexible and powerful method to define and execute
arbitrary optimization queries efficiently. Example optimization query applications
are provided in Section 2.3.3.
Contributions The primary contributions of this work are listed as follows.
• We propose a general framework to execute convex model-based optimization
queries (using linear and non-linear functions) and for which additional con-
straints over the attributes can be specified without compromising query accu-
racy or performance.
• We define query processing algorithms for convex model-based optimization
queries utilizing data/space partitioning index structures without changing the
8
index structure properties and the ability to efficiently address the query types
for which the structures were designed.
• We prove the I/O optimality for these types of queries for data and space
partitioning techniques when the objective function is convex and the feasible
region defined by the constraints is convex (these queries are defined as CP
queries in this paper).
• We introduce a generic model and query processing framework that optimally
addresses existing optimization query types studied in the research literature as
well as new types of queries with important applications.
2.2 Related Work
There are few published techniques that cover some instances of model-based
optimization queries. One is the “Onion technique” [14], which deals with an un-
constrained and linear query model. An example linear model query selects the best
school by ranking the schools according to some linear function of the attributes of
each school. The solution followed in the Onion technique is based on constructing
convex hulls over the data set. It is infeasible for high dimensional data because the
computation of convex hulls is exponential with respect to the number of dimensions
and is extremely slow. It is also not applicable to constrained linear queries because
convex hulls need to be recomputed for each query with a different set of constraints.
The computation of convex hulls over the whole data set is expensive and in the case
of high-dimensional data, infeasible in design time.
The Prefer technique [32] is used for unconstrained and linear top-k query types.
The access structure is a sorted list of records for an arbitrary linear function. Query
9
processing is performed for some new linear preference function by scanning the sorted
list until a record is reached such that no record below it could match the new query.
This avoids the costly build time associated with the Onion technique, but does not
provide a guarantee on minimum I/O, and still does not incorporate constraints.
Boolean + Ranking [60] offers a technique to query a database to optimize boolean
constrained ranking functions. However, it is limited to boolean rather than general
convex constraints, addresses only single dimension access structures (i.e. can not
optimize on multiple dimensions simultaneously), and therefore largely uses heuristics
to determine a cost effective search strategy.
While the Onion and Prefer techniques organize data to efficiently answer a specific
type of query (LP), our technique organizes data retrieval to answer more general
queries. In contrast with the other techniques listed, our proposed solution is proven
to be I/O-optimal for both hierarchical and space partitioned access structures with
convex partition boundaries, covers a broader spectrum of optimization queries, and
requires no additional storage space.
2.3 Query Model Overview
2.3.1 Background Definitions
Let Re(a
1
, a
2
, . . . , a
n
) be a relation with attributes a
1
, a
2
, . . . , a
n
. Without loss
of generality, assume all attributes have the same domain. Denote by A a subset
of attributes of Re. Let g
i
( α, A), 1 ≤ i ≤ u, h
j
(

β, A), 1 ≤ j ≤ v and F(

θ, A)
be certain functions over attributes in A, where u and v are positive integers and

θ = (θ
1
, θ
2
, . . . , θ
m
) is a vector of parameters for function F, and α and

β are vector
parameters for the constraint functions.
10
Definition 1 (Model-Based Optimization Query) Given a relation Re(a
1
, a
2
, . . . , a
n
),
a model-based optimization query Q is given by (i) an objective function F(

θ, A) , with
optimization objective o (min or max), (ii) a set of constraints (possibly empty) spec-
ified by inequality constraints g
i
( α, A) ≤ 0, 1 ≤ i ≤ u (if not empty) and equality
constraints h
j
(

β, A) = 0, 1 ≤ j ≤ v (if not empty), and (iii) user/query adjustable
objective parameters

θ, constraint parameters α and

β, and answer set size integer k.
Figure 2.1 shows a sample objective function with both inequality and equality
constraints. The objective function finds the nearest neighbor to point a (1,2). The
constraints limit the feasible region to the line passing through x = 5, where 3 ≤ y ≤
6. The points b and d represent the best and worst point within the constraints for
the objective function respectively. Point c could be the actual point in the database
that meets the constraints and minimizes the objective function.
The answer of a model-based optimization query is a set of tuples with maxi-
mum cardinality k satisfying all the constraints such that there exists no other tuple
with smaller (if minimization) or larger (if maximization) objective function value
F satisfying all constraints as well. This definition is very general, and almost any
type of query can be considered as a special case of model-based optimization query.
For instance, NN queries over an attribute set A can be considered as model-based
optimization queries with F(

θ, A) as the distance function (e.g., Euclidean) and the
optimization objective is minimization. Similarly, top-k queries, weighted NN queries
and linear/non-linear optimization queries can all be considered as specific model-
based optimization queries. Without loss of generality, we will consider minimization
as the objective optimization throughout the rest of this chapter.
11
Figure 2.1: Sample Optimization Objective Function and Constraints F(

θ, A) : (x −
1)
2
+(y −2)
2
, o(min), g
1
( α, A) : y −6 ≤ 0, g
2
( α, A) : −y +3 ≤ 0, h
1
(

β, A) : x−5 = 0,
k = 1
Definition 2 (Convex Optimization (CP) Query) A model-based optimization
query Q is a Convex Optimization (CP) query if (i) F(

θ, A) is convex, (ii) g
i
( α, A), 1 ≤
i ≤ u (if not empty) are convex, and (iii) h
j
(

β, A), 1 ≤ j ≤ v (if not empty) are linear
or affine.
Notice that the definition of a CP query does not have any assumptions on the
specific form of the objective function. The only assumptions are that the queries can
be formulated into convex functions and the feasible regions defined by the constraints
of the queries are convex (e.g., polyhedron regions). Therefore, users can ask any form
of queries with any coefficients as long as these assumptions hold. The conditions (i),
(ii) and (iii) form a well known type of problem in Operations Research literature -
Convex Optimization problems. A function is convex if it is second differentiable and
12
its Hessian matrix is positive definite. More intuitively, if one travels in a straight
line from inside a convex region to outside the region, it is not possible to re-enter
the convex region.
2.3.2 Common Query Types Cast into Optimization Query
Model
Although appearing to be restricted in functions F, g
i
and h
j
, the set of CP prob-
lems is a superset of all least square problems, linear programming problems (LP)
and convex quadratic programming problems (QP) [27]. Therefore, they cover a wide
variety of linear and non-linear queries, including NN queries, top-k queries, linear
optimization queries, and angular similarity queries. The formulation of some com-
mon query types as CP optimization queries are listed and discussed below.
Euclidean Weighted Nearest Neighbor Queries - The Weighted Nearest Neigh-
bor query asking NN for a point (a
0
1
, a
0
2
, . . . , a
0
n
) can be considered as an optimization
query asking for a point (a
1
, a
2
, . . . , a
n
) such that an objective function is minimized.
For Euclidean WNN the objective function is
WNN(a
1
, a
2
, . . . , a
n
) =

w
1
(a
1
−a
0
1
)
2
+ w
2
(a
2
−a
0
2
)
2
+ . . . + w
m
(a
n
−a
0
n
)
2
,
where w
1
, w
2
, . . . , w
n
> 0 are the weights and can be different for each query. Tradi-
tional Nearest Neighbor queries are the subset of weighted nearest neighbor queries
where the weights are all equal to 1.
Linear Optimization Queries - The objective function of a Linear Optimization
query is
L(a
1
, a
2
, . . . , a
n
) = c
1
a
1
+ c
2
a
2
+ . . . + c
n
a
n
13
where (c
1
, c
2
, . . . , c
n
) ∈ R
n
are coefficients and can be different for different queries,
and the constraints form a polyhedron. Since linear functions are convex and poly-
hedrons are convex, linear optimization queries are also special cases of CP queries
with a parametric function form.
Top-k Queries - The objective functions of top-k queries are score (or ranking) func-
tions over the set of attributes. The common score functions are sum and average,
but could be any arbitrary function over the attributes. Clearly, sum and average
are special cases of linear optimization queries. If the ranking function in question is
convex, the top-k query is a CP query.
Range Queries - A range query asks for all data (a
1
, a
2
, . . . , a
n
) s.t. l
i
≤ a
i
≤ u
i
, for
some (or all) dimensions i, where [l
i
, u
i
]
n
is the range for dimension i. Constructing an
objective function as f(a
1
, a
2
, . . . , a
n
) = C, where C is any constant, and constraints
g
i
= l
i
−a
i
<= 0, g
n+i
= a
i
−u
i
<= 0, i = 1, 2, · · · , n
Any points that are within the constraints will get the objective value C. Points
that are not within the constraints are pruned by those constraints. Those points
that remain all have an objective value of C and because they all have the same
maximum(minimum) objective value, all form the solution set of the range query.
Therefore, range queries can be formed as special cases of CP queries with constant
objective functions applying the range limits as constraints. Also note that our model
can not only deal with the ‘traditional’ hyper-rectangular range queries as described
above, but also ‘irregular’ range queries such as l ≤ a
1
+a
2
≤ u and l ≤ 3a
1
−2a
2
≤ u
14
2.3.3 Optimization Query Applications
Any problem that can be rewritten as a convex optimization problem can be
cast to our model, and solved using our framework. In cases where the optimiza-
tion problem or objective function parameters are not known in advance, this generic
framework can be used to solve the problem by efficiently accessing available access
structures. We provide a short list of potential optimization query applications that
could be relevant during scientific data exploration or business decision support where
the problem being optimized is continuously evolving.
Weighted Constrained Nearest Neighbors - Weighted nearest neighbor queries
give the ability to assign different weights for different attribute distances for nearest
neighbor queries. Weighted Constrained Nearest Neighbors (WCNN) find the clos-
est weighted distance object that exists within some constrained space. This type
of query would be applicable in situations where attribute distances vary in impor-
tance and some constraints to the result set are known. A potential application is
clinical trial patient matching where an administrator is looking to find the best can-
didate patients to fill a clinical trial. Some hard exclusion criteria is known and some
attributes are more important than others with respect to estimating participation
probability and suitability.
kNN with Adaptive Scoring - In some situations we may have an objective func-
tion that changes based on the characteristics of a population set we are trying to
fill. In such a case, we will change the nearest neighbor scoring weights dependent on
the current population set. As an example, again consider the patient clinical trial
matching application where we want to maintain a certain statistical distribution over
the trial participants. In a case where we want half of the participants to be male and
15
half female, we can adjust weights of the objective optimization function to increase
the likelihood that future trial candidates will match the currently underrepresented
gender.
Queries over Changing Attributes - The attributes involved in optimization
queries can vary based on the iteration of the query. For example, a series of opti-
mization queries may search a stock database for maximum stock gains over different
time intervals. The attributes involved in each query will be different.
Weights, constraints, functional attributes, and optimization functions themselves
can all change on a per-query basis. A database system that can effectively handle
the potential variations in optimization queries will benefit data exploration tasks.
In the examples listed above, each query or each member of a set of queries can
be rewritten as an optimization query in our model. This demonstrates the power
and flexibility that the user has to define data exploration queries and the examples
represent a small subset of the query types that are possible.
2.3.4 Approach Overview
We propose the use of Convex Optimization (CP) in order to traverse access
structures to find optimal data objects. The goal in CP is optimize (minimize or
maximize) some convex function over data attributes. The solution can optionally
be subject to some convex criteria over the data attributes. Additionally, the data
attributes may be subject to lower and upper bounds.
All of the discussed applications can be stated as an objective function with a
objective of maximization or minimization. We can solve a continuous CP problem
over a convex partition of space by optimizing the function of interest within the
16
constraints of that space. Most access structures are built over convex partitions.
Particularly common partitions are Minimum Bounding Rectangles (MBRs). Rect-
angular partitions can be represented by lower and upper bounds over data attributes
and can be directly addressed by the CP problem bounds. Other convex partitions
can be addressed with data attribute constraints (such as x +y < 5). If the problem
under analysis has constraints in addition to the constraints imposed by the partition
boundaries, they can also be added to the CP problem as constraints.
Given the function, partition constraints, and problem constraints, we can use
CP to find the optimal answer for the function within the intersection of the problem
constraints and partition constraints. If no solution exists (problem constraints and
partition constraints do not intersect), the partition is found to be infeasible for the
problem. The partition and its children can be eliminated from further consideration.
Given a function, problem constraints, and a set of partitions, the partition the yields
the best CP result with respect to the function under analysis is the one that contains
some space that optimizes the problem under analysis better than the other partitions.
These partition functional values can be used to order partitions according to how
promising they are with respect to optimizing the function within problem constraints.
With our technique we endeavor to minimize the I/O operations and access struc-
ture search computations required to perform arbitrary model-based optimization
queries. The access structure partitions to search and evaluate during query process-
ing are determined by solving CP problems that incorporate the objective function,
model constraints, and partition constraints. The solution of these CP problems al-
lows us to prune partitions and search paths that can not contain an optimal answer
during query processing.
17
The CP literature discusses the efficiency of CP problem solution. In [42], it is
stated that a CP problem can be very efficiently solved with polynomial-time algo-
rithms. The readers can refer to [43] for a detailed review on interior-point polynomial
algorithms for CP problems. Current implementations of CP algorithms can solve
problems with hundreds of variables and constraints in very short times on a personal
computer [38]. CP problems can be so efficiently solved that Boyd and Vandenberghe
stated “if you formulate a practical problem as a convex optimization problem, then
you have solved the original problem.” ([10] page 8). We are in effect replacing the
bound computations from existing access structure distance optimization algorithms
with a CP problem that covers the objective function and constraints. Although
such computations can be more complex, we found very little time difference in the
functions we examined.
We focus on query processing techniques by presenting a technique for general CP
objective functions, which can be applied to existing space or data partitioning-based
indices without losing the capabilities of those indexing techniques to answer other
types of queries. The basic idea of our technique is to divide the query processing into
two phases: First, solve the CP problem without considering the database, (i.e. in
continuous space). This “relaxed” optimization problem can be solved efficiently in
the main memory using many possible algorithms in the convex optimization litera-
ture. It can be solved in polynomial time in the number of dimensions in the objective
functions and the number of constraints ([43]). Second, search through the index by
pruning the impossible partitions and access the most promising pages. Details for
our technique for different index types are provided in the following section.
18
2.4 Query Processing Framework
By solving CP problems associated with the problem constraints, partition con-
straints, and objective function, we find the best possible objective function value
achievable by a point in the partition. Thus, there exists a lower bound for each
partition in the database that can be used to prune the data space before retrieving
pages from disk. The bounds can be defined for any type of convex partition (such
as minimum bounding regions) used for indexing the data.
Figure 2.2 is a functional block diagram of the optimization query processing sys-
tem that shows interactions between components. Query processing proceeds first by
the user providing an objective function and constraints to the system. A lightweight
query compiler validates the user input, converts the function and constraints into
appropriate format for the query processor, and executes the appropriate query pro-
cessing algorithm using the specified access structure. The query processor invokes
a CP solver to find bounds for partitions that could potentially contain answers to
the query. By solving CP subproblems using index partition boundaries as additional
constraints, we can ensure that we access pages in the order of how promising they
are. The user gets back those data object(s) in the database that answer the query
within the constraints.
The generic query processing framework for a CP query is described below. This
framework can be applied to indexing structures in which the partitions on which
they are built are convex because the partition boundaries themselves are used to
form the new CP problems.
A popular taxonomy of access structures divides the structures into the categories
of hierarchical and non-hierarchical structures. Hierarchical structures are usually
19
Figure 2.2: Optimization Query System Functional Block Diagram
partitioned based on data objects while non-hierarchical structures are usually based
on partitioning space. These two families have important differences that affect how
queries are processed over them. Therefore we provide algorithms for hierarchical
structures in Subsection 2.4.1 and non-hierarchical structures in 2.4.2.
The overall idea is to traverse the access structure and solve the problem in se-
quential order of how good partitions and search paths could be. We terminate
when we come to a partition that could not contain an optimal answer to the query.
The implementations described address a minimization objective function and prune
based on comparison to minimum values. Solutions to maximization problems can be
performed by using maximums (i.e. “worst” or upper bounds) in place of minimums.
20
The number of results that are returned by the query processing is dependent on
an input k. If the user desires only the single best result, k = 1. It can however, be
an arbitrary value in order to return the k best data objects. For range queries, if all
objects that meet the range constraints are desired, the value of k should be set to n,
the number of data points. Only the data points within the constrained region will be
returned, and only the points that are within the lowest-level partitions that intersect
the constrained region will be examined during query processing. Therefore, the query
processing will be at least as I/O-efficient as standard range query processing over a
given access structure.
Now, we describe the implementations of this framework based on hierarchical
and non-hierarchical indexing techniques. In Section 2.5 we show that both of these
algorithms are in fact optimal, i.e., no unnecessary MBRs (or partitions) given an
access structure are accessed during the query processing.
2.4.1 Query Processing over Hierarchical Access Structures
Algorithm 1 shows the query processing over hierarchical access structures. This
algorithm is applicable to any hierarchical access structure where the partitions at
each tree level are convex. A partial list of structures covered by this algorithm
includes the R-tree, R*-tree, R+-tree, BSP-tree, Quad-tree, Oct-tree, k-d-B-tree,
B-tree, B+-tree, LSD-tree, Buddy-tree, P-tree, SKD-tree, GBD-tree, Extended k-D-
Tree, X-tree, SS-tree, and SR-tree [25, 9].
21
The algorithm is analogous to existing optimal NN processing [31], except that
partitions are traversed in order of promise with respect to an arbitrary convex func-
tion, rather than distance. Additionally, our algorithm also prunes out any partitions
and search paths that do not fall within problem constraints.
We solve a new CP problem for each partition that we traverse using the objective
function, original problem constraints, and the constraints imposed by the partition
boundaries. The algorithm takes as input the convex optimization function of interest
OF, the set of convex constraints C, and a desired result set size k. In lines 1-3, we
initialize the structures that we maintain throughout a query. We keep a vector of
the k best points found so far as well as a vector of the objective function values for
these points. We keep a worklist of promising partitions along with the minimum
objective value for the partition. We initialize the worklist to contain the tree root
partition.
We then process entries from the worklist until the worklist is empty. In line 5,
we remove the first promising partition. If this partition is a leaf partition, then we
access the points in the partition and compute their objective function values. If a
point is not within the original problem constraints, then it will not have a feasible
solution and will be discarded from further consideration. Points with objective
function values lower than the objective function value of the k
th
best point so far
will cause the best point and best point objective function value vectors to be updated
accordingly.
If the partition under analysis is not a leaf, we find its child partitions. For each
of these child partitions, we solve the CP problem corresponding to the objective
function, original problem constraints, and partition constraints (line 11). This yields
22
CPQueryHierarchical (Objective Function OF,
Constraints C,k)
notes:
b[i] - the i
th
best point found so far
F(b[i]) - the objective function value of the i
th
best point
L - worklist of (partition, minObjVal) pairs
1: b[1]...b[k] ←null.
2: F(b[1])...F(b[k]) ← +∞
3: L ← (root)
4: while L = ∅
5: remove first partition P from L
6: if P is a leaf
7: access and compute functions for points in
P within constraints C
8: update vectors b and F if better points
discovered
9: else
10: for each child P
child
of P
11: minObjV al = CPSolve(OF,P
child
,C)
12: if minObjV al < F(b[k])
13: insert P
child
into L according to
minObjV al
14: end for
15: end if
16: prune partitions from L with minObjV al >
F(b[k])
17: end while
18: return vector b
Algorithm 1: I/O Optimal Query Processing for Hierarchical Index Structures.
the best possible objective function value achievable within the intersection of the
original problem constraints and partition constraints. If the minimum objective
function value is less than the k
th
best objective function value (i.e. it is possible for
a point within the intersection of the problem constraints and partition constraints to
23
beat the k
th
best point), then the partition is inserted into the worklist according to
its minimum objective value. If there is no feasible solution for the partition within
problem constraints, the partition will be dropped from consideration.
After a partition is processed, we may have updated our k
th
best objective function
value. We prune any partitions in the worklist that can not beat this value.
Query Processing Example As an example of query processing, consider Fig-
ure 2.3 with an optimization objective of minimizing the distance to the query point
within the constrained region (i.e. constrained 1-NN query). We initialize our work-
list with the root and start by processing it. Since it is not a leaf, we examine its child
partitions A and B. We solve for the minimum objective function value for A and do
not obtain a feasible solution because A and the constrained region do not intersect.
We do not process A further. We find the minimum objective function for partition B
and get the distance from the query point to the point where the constrained region
and the partition B intersect, point c. We insert partition B into our worklist and
since it is the only partition in the worklist, we process it next. Since B is not a
leaf, we examine its child partitions, B.1, B.2, and B.3. We find minimum objective
function values for each and obtain the distances from the query point to point d, e,
and f respectively. Since point e is closest, we insert partitions into the worklist in
the order B.2, B.1, and B.3.
We process B.2 and since it is a leaf, we compute objective function values for
the points in B.2. Point B.2.a is the closer of the two. This point is designated
as the best point thus far, and the best point objective function is updated. After
processing B.2, we know we can do no worse than the distance from the query point
to B.2.a. Since partition B.3 can not do better than this, we prune it from the
24
worklist. We examine points in B.1. Point B.1.b is not a feasible solution (it is not
in the constrained region) so it is dropped. Point B.1.a gets a value which is better
than the current best point. We update the best point to be B.1.a. Since the worklist
is now empty, we have completed the query and return the best point.
The optimal point for this optimization query this query is B.1.a. We examine
only points in partitions that could contain points as good as the best solution.
Figure 2.3: Example of Hierarchical Query Processing
25
2.4.2 Query Processing over Non-hierarchical Access Struc-
tures
In this section, we show that the general framework for CP queries can also be
implemented on top of non-hierarchical structures. Algorithm 2 can be applied to
non-hierarchical access structures where the partitions are convex. A partial list of
structures where the algorithm can be applied is Linear Hashing, Grid-File, EXCELL,
VA-File, VA+-File, and Pyramid Technique [25, 9, 53, 23].
We use a two stage approach detailed in Algorithm 2 to determine which data
objects optimize the query within the constraints. This technique is similar to the
technique identified in [53] for finding NN for a non-hierarchical structure, but we use
the objective function rather than a distance and also consider original problem con-
straints to find lower bounds. We take the objective function, OF, original problem
constraints C, and result set size k as inputs. We initialize the structures to be used
during query processing (lines 1-3). In stage 1, we process each populated partition
vin the structure by solving the CP problem that corresponds to the original objec-
tive function and constraints combined with the additional constraints imposed by the
partition boundaries (line 5). Those partitions that do not intersect the constraints
will be bypassed. For those that do have a feasible solution, the lower bound of the
CP problem is recorded. At the end of stage 1, we place unique unpruned partitions
in increasing sorted order based on their lower bound. In stage 2, we process the
unpruned partitions by accessing objects contained by the first partition and com-
puting actual objective values. We maintain a sorted list of the top-k objects we have
found so far, and continue accessing points contained by subsequent partitions until
a partition’s lower bound is greater than the k
th
best value. By accessing partitions
26
in order of how good they could be, according to the CP problem, we access only
those points in cell representations that could contain points as good as the actual
solution.
CPQueryNonHierarchical (Objective Function OF,
Constraints C,k)
notes:
b[i] - the i
th
best point found so far
F(b[i]) - the objective function value of the i
th
best point
L - worklist of (partition, minObjVal) pairs
Initialize:
1: b[1]...b[k] ←null.
2: F(b[1])...F(b[k]) ← +∞
3: L ← ∅
Stage 1:
4: for each populated partition P
5: minObjV al = CPSolve(OF, P, C)
6: if minObjV al < +∞
7: insert (P, minObjV al) into L according to
minObjV al
Stage 2:
8: for i = 1 to |L|
9: if minObjV al of partition L[i] > F(b[k])
10: break
11: Access objects that partition L[i] contains
12: for each object o
13: if OF(o) < F(b[k])
14: insert o into b
15: update F(b)
Algorithm 2: I/O Optimal Query Processing over Non-hierarchical Structures.
27
2.4.3 Non-Covered Access Structures
A review of access structure surveys yielded few structures that are not covered by
the algorithms. These include structures that either do not guarantee spatial locality
(i.e. space filling curves which are used for approximate ordering). They include
structures built over values computed for a specific function (such as distance in M-
trees. Since attribute information is lost, they can not be applied to general functions
over the original attributes). They include structures that contain non-convex regions
(BANG-File, hB-tree, BV-tree) [25, 9]. Note that M-trees would still work to answer
queries for the functions they are built over and the structures built with non-convex
leaves could still be optimally traversed down to the level of non-convexity.
2.5 I/O Optimality
In this section, we will show that the proposed query processing technique achieves
optimal I/O for each implementation in Section 2.4. We denote the objective func-
tion contour which goes through the actual optimal data point in the database as the
optimal contour. This contour also goes through all other points in continuous space
that yield the same objective function value as the optimal point. The k
th
optimal
contour would pass through all points in continuous space that yield the same objec-
tive function value as the k
th
best point in the database. In the following arguments,
we use the optimal contour, but could replace them with the k
th
optimal contour.
Following the traditional definition in [5], we define the optimality as follows.
Definition 3 (Optimality) An algorithm for optimization queries is optimal iff it
retrieves only the pages that intersect the intersection of the constrained region R and
the optimal contour.
28
For hierarchical tree based query processing, the proof is given in the following.
Lemma 1 The proposed query processing algorithm is I/O optimal for CP queries
under convex hierarchical structures.
Proof. It is easy to see that any partition intersecting the intersection of optimal
contour and feasible region R will be accessed since they are not pruned in any phase
of the algorithm. We need to prove only these partitions are accessed, i.e., other
partitions will be pruned without accessing the pages.
Figure 2.4 shows different cases of the various position relations between a parti-
tion, R and the optimal contour under 2 dimensions. a is an optimal data point in the
data base. We will use minObjV al(Partition, R) to denote the minimum objective
function value that is computed for the partition and constraints R.
A: intersects neither R nor the optimal contour.
B: intersects R but not the optimal contour.
C: intersects the optimal contour but not R.
D: intersects the optimal contour and R, but not the intersection of optimal con-
tour and R.
E: intersects the intersection of the optimal contour and R.
29
Figure 2.4: Cases of MBRs with respect to R and optimal objective function contour
A and C will be pruned since they are infeasible regions, when the CP for these
partitions are solved, they will be eliminated from consideration since they do not
intersect the constrained region and minObjV al(A, R) = minObjV al(C, R) = +∞.
B is pruned because minObjV al(B, R) > F(a) and E will be accessed earlier than
B in the algorithm because minObjV al(E, R) < minObjV al(B, R). We shall show
that a page in case D is pruned by the algorithm. Let CP
0
be the data partition that
contains an optimal data point, CP
1
be the partition that contains CP
0
, . . ., and CP
k
be the root partition that contains CP
0
, CP
1
, , CP
k−1
. Because the objective function
is convex, we have
F(a) ≥ minObjV al(CP
0
, R) ≥ minObjV al(CP
1
, R) ≥ . . .
≥ minObjV al(CP
k
, R)
30
and
minObjV al(D, R) > F(a) ≥ minObjV al(CP
0
, R) ≥
minObjV al(CP
1
, R) ≥ . . . ≥ minObjV al(CP
k
, R)
During the search process of our algorithm, CP
k
is replaced by CP
k−1
, and CP
k−1
by CP
k−2
, and so on, until CP
0
is accessed. If D will be accessed at some point
during the search, then D must be the first in the MBR list at some time. This
only can occur after CP
0
has been accessed because minObjV al(D, R) is larger than
constrained function value of any partition containing the optimal data point. If CP
0
is accessed earlier than D, however, the algorithm prunes all partitions which have a
constrained function value larger than F(a), including D. This contradicts with the
assumption that D will be accessed.
The I/O-optimality relates to accessing data from leaf partitions. However, the
algorithm is also optimal in terms of the access structure traversal, that is, we will
never retrieve the objects within an MBR (leaf or non-leaf), unless that MBR could
contain an optimal answer. The proof applies to any partition, whether it contains
data or other partitions.
Similarly, we can prove the I/O optimality of proposed algorithm over non-hierarchical
structures.
Lemma 2 The proposed query processing algorithm is I/O optimal for CP queries
over non-hierarchical convex structures.
Proof. Figure 2.5 shows different possible cases of the partition position. Partitions
A, B, C, D, and E are the same as defined for Figure 2.4.
31
Figure 2.5: Cases of partitions with respect to R and optimal objective function
contour
Without loss of generality, assume that only these five partitions contain some
data points, and partition E contains an optimal data point, a, for a query F(x, y).
Then an optimal algorithm retrieves only partition E.
Partitions C and A will not be retrieved because they are infeasible and our
algorithm will never retrieve infeasible partitions. Partition B will not be retrieved
because partition E will be retrieved before partition B (because minObjV al(E, R) <
minObjV al(B, R)), and the best data point found in E will prune partition B. Thus,
we only need to show that partition D will not be retrieved.
32
Because partition D does not intersect the intersection of the optimal contour
and R but E does, minObjV al(D, R) > minObjV al(E, R). Therefore, partition
E must be retrieved before partition D since the algorithm always retrieves the
most promising partition first, and the data point a is found at that time. Sup-
pose partition D is also retrieved later. This means partition D cannot be pruned by
data point a, i.e., F(a) ≥ minObjV al(D, R), which also means the virtual solution
corresponding to (minObjV al(D, R)) is contained by the optimal contour (because
of the convexity of function F ). Therefore, the virtual solution corresponding to
(minObjV al(D, R)) ∈ D ∩ R∩optimal contour. Hence, we can find at least one vir-
tual point x

(without considering the database) in partition D such that x

is both
in R and the optimal contour. This contradicts the assumption that partition D does
not intersect the intersection of R and the optimal contour.
2.6 Performance Evaluation
The goals of the experiments are to show i) that the proposed framework is general
and can be used to process a variety of queries that optimize some objective ii) that
the generality of the framework does not impose significant cpu burden over those
queries that already have an optimal solution, and iii) that non-relevant data subspace
can be effectively pruned during query processing.
We performed experiments for a variety of optimization query types including
kNN, Weighted kNN, Weighted Constrained kNN, and Constrained Weighted Linear
Optimization. Where possible, we compare cpu times against a non-general optimiza-
tion technique. We index and perform que-ries over four real data sets. One of these
is Color Histogram data, a 64-dimensional color image histogram dataset of 100,000
33
data points. The vectors represent color histograms computed from a commercial
CD-ROM. The second is Satellite Image Texture (Landsat) data, which consists of
100,000 60-dimensional vectors representing texture feature vectors of Landsat images
[40]. These datasets are widely used for performance evaluation of index structures
and similarity search algorithms [39, 26]. We used a clinical trials patient data set
and a stock price data set to perform real world optimization queries.
Weighted kNN Queries Figure 2.6 shows the performance of our query pro-
cessing over R*-trees for the first 8-dimensions of the histogram data for both kNN
and k weighted NN (k-WNN) queries. The index is built over the first 8 dimensions
of the histogram data and the queries are generated to find the kNN or k-WNN to a
random point in the data space over the same 8 dimensions. We execute 100 queries
for each trial, and randomly assign weights between 0 and 1 for the k-WNN queries.
We vary the value of k between 1 and 100. We assume the index structure is in
memory and we measure the number of additional page accesses required to access
points that could be kNN according to the query processing. Because the weights
are 1 for the kNN queries, our algorithm matches the I/O-optimal access offered by
traditional R-tree nearest neighbor branch and bound searching.
The figure shows that we achieve nearly the same I/O performance for weighted
kNN queries using our query processing algorithm over the same index that is ap-
propriate for traditional non-weighted kNN queries. For unconstrained, unweighted
nearest neighbor queries, the optimal contour is a hyper-sphere around the query point
with a radius equal to the distance of the nearest point in the data set. If instead, the
queries are weighted nearest neighbor queries, the optimal contour is a hyper-ellipse.
Intuitively, as the weights of a weighted nearest neighbor query become more skewed,
34
Figure 2.6: Accesses, Random weighted kNN vs kNN, R*-tree, 8-D Histogram, 100k
pts
the hyper-volume enclosed by the optimal contour increases, and the opportunity to
intersect with access structure partitions increases. Results for the weighted queries
track very closely to the non-weighted queries. This indicates that these weighted
functions do not cause the optimal contour to cross significantly more access struc-
ture partitions than their non-weighted counterparts. For a weighted nearest neighbor
query, we could generate an index based on attribute values transformed based on
the weights and yield an optimal contour that was hyper-spherical. We could then
traverse the access structure using traditional nearest neighbor techniques. However,
this would require generating an index for every potential set of query weights, which
is not practical for applications where weights are not typically known prior to the
35
query. For similar applications where query weights can vary, and these weights are
not known in advance, we would be better served to process the query over a single
reasonable access structure in an I/O-optimal way than by building specific access
structures for a specific set of weights.
Figure 2.7 shows the average cpu times required per query to traverse the access
structure using convex problem solutions for the weighted kNN queries compared to
the standard nearest neighbor traversal used for unweighted kNN queries.
Figure 2.7: CPU Time, Random weighted kNN vs kNN, R*-tree, 8-D Histogram,
100k pts
36
The figure shows that the generic CP-driven solution takes slightly longer to per-
form than a specialized solution for nearest neighbor queries alone. Part of this
difference is due to the slightly more complicated objective function (weighted ver-
sus unweighted Euclidean distance), part is due to the additional partitions that
the optimal contour crosses, while the remaining part is due to incorporating data
set constraints (unused in this example) into computing minimum partition function
values.
Hierarchical data partitioning access structures are not effective for high-dimensional
data sets. As data set dimensionality increases, the partitions overlap with each other
more. As partitions overlap more, the probability that a point’s objective function
value can prune other partitions decreases. Therefore, the same characteristics that
make high-dimensional R-tree type structures ineffective for traditional queries also
make them ineffective for optimization queries. For this reason, we perform similar
experiments for high-dimensional data using VA-files.
Figure 2.8 shows page access results for kNN and k-WNN queries over 60-dimensional,
5-bit per attribute, VA-file built over the landsat dataset. The experiments assume
the VA-file is in memory and measure the number of objects that need to be accessed
to answer the query.
Similar to the R*-tree case, the results show that the performance of the algorithm
is not significantly affected by re-weighting the objective function. Sequential scan
would result in examining 100,000 objects. If the dataspace were more uniformly
populated, we would expect to see object accesses much closer to k. However, much
of the data in this data set is clustered, and some vector representations are heavily
37
Figure 2.8: Accesses, Random weighted kNN vs kNN, VA-File, 60-D Sat, 100k pts
populated. If a best answer comes from one of these vector representations, we need
to look at each object assigned the same vector representation.
It should be noted that while the framework results in optimal access of a given
structure, it can not overcome the inherent weaknesses of that structure. For example,
at high dimensionality, hierarchical data structures are ineffective, and the framework
can not overcome this. A consequence of VA-File access structures is that, without
some other data reorganization, a computation needs to occur for every approximated
object. As distance computations need to be performed for each vector in traditional
VA-file nearest neighbor algorithms, following the VA-file processing model so too
must a CP problem be solved for each vector for general optimization queries. Times
38
for this experiment reflect this fact. It takes longer to perform each query but the
ratio of the general optimization query times to the traditional kNN CPU times is
similar to the R*-tree case, about 1.006 (i.e. very little computational burden is
added to allow query generality).
Constrained Weighted Linear Optimization Queries We also explore Con-
strained Weighted-Linear Optimization queries. Figure 2.9 shows the number of page
accesses required to answer randomly weighted LP minimization queries for the his-
togram data set using the 8-D R*-Tree access structure. We vary k and we vary the
constrained area region fixed at the origin as a 8-D hypercube with the indicated
value. We generated 100 queries for each data point in the form of a summation of
weighted attributes over the 8 dimensions. Weights are uniformly randomly generated
and are between the values of -10 and 10.
The graph shows interesting results. The number of page accesses is not directly
related to the constraint size. The rate of increase in page accesses to accommodate
different values of k does vary with the constraint size. When weights can be negative,
the corner corresponding to the minimum constraint value for the positively weighted
attributes, and the maximum constrained value for the negatively weighted attributes
will yield the optimal result within the constrained region (e.g. corner [0,1,0] would
be an optimal minimization answer for a constrained unit cube and a linear function
with a weight vector [+,-,+]). For this data set, many points are clustered around
the origin, leading to denser packing of access structure partitions near the origin.
Because of the sparser density of partitions around the point that optimizes the query,
we access fewer partitions when this particular problem is less constrained. However,
the radius of the k
th
-optimal contour increases more in these less dense regions than
39
Figure 2.9: Page Accesses, Random Weighted Constrained k-LP Queries, R*-Tree,
8-D Color Histogram, 100k pts
more clustered regions, leading to a faster increase in the number of page accesses
required when k increases.
Note that the Onion technique is not feasible for this dataset dimensionality,
and it, as well as the other competing techniques are not appropriate for convex
constrained areas. For comparative purposes, the sequential scan of this data set is
834 pages.
We should also note what happens when there are less than k optimal answers in
the data set. This means that there are less than k objects in our constrained region.
For our query processing, we only need to examine those points that are in partitions
40
that intersect the constrained region. A technique that did not evaluate constraints
prior to or during query processing will need to examine every point.
Queries with Real Constraints The previous example showed results with
synthetically generated constraints. We also wanted to explore that constrained op-
timization problems for real applications. We performed Weighted Constrained kNN
experiments to emulate a patient matching data exploration task described in Sec-
tion 2.3.3. To test this, we performed a number of queries over a set of 17000 real
patient records. We established constraints on age and gender and queried to find the
10 weighted nearest neighbors in the data set according to 7 measured blood analytes.
Using the VA-file structure with 5 bits assigned per attribute and 100 randomly se-
lected, randomly weighted queries, we achieved an average of 11.5 vector accesses to
find the k = 10 results. This means that on average the number of vectors within the
intersection of the constrained space and 10
th
-optimal contour is 11.5. It corresponds
to a vector access ratio of 0.0007.
Arbitrary Functions over Different Access Structures In order to show that
the framework can be used to find optimal data points for arbitrary convex functions,
we generated a set of 100 random functions. Unlike nearest neighbor and linear
optimization type queries, the continuous optimal solution is not intuitively obvious
by examining the function. Functions contain both quadratic and linear terms and
were generated over 5 dimensions in the form
¸
5
i=1
(random() ∗ x
2
i
−random() ∗ x
i
),
where x
i
is the data attribute value for the i
th
dimension. A sample function, where
coefficients are rounded to two decimal places, is to minimize 0.91x
2
1
+0.49x
2
2
+0.4x
2
3
+
0.55x
2
4
+ 0.76x
2
5
−0.09x
1
−0.49x
2
−0.28x
3
−0.91x
4
−0.51x
5
.
41
We explored the results of our technique over 5 index structures. These include
3 hierarchical structures, including the R-tree, the R*-tree, and the VAM-split bulk
loaded R*-tree. We explored 2 non-hierarchical structures, the grid file and the VA-
file. For each of these index types, we modified code to include an optimization
function that uses CP to traverse the index structure using the algorithm appropriate
for the structure. We added a new set of 50000 uniform random data points within
the 5-dimension hypercube and built each of the index structures over this data set.
We ran the set of random functions over each structure and measured the number
of objects in the partitions that we retrieve. For the hierarchical structures these are
leaf partitions and for the non-hierarchical structures they are grid cells. For each
structure we try to keep the number partitions relatively close to each other. The
number of partitions for the hierarchical structures is between 12000 and 13000, the
grid file 16000, and the VA-File 32000.
Figure 2.10 shows results over the 100 functions. It shows the minimum, maxi-
mum, and average object access ratio to guarantee discovery of the optimal answer
over the set of functions. Each index yielded the same correct optimal data point for
each function, and these optimal answers were scattered throughout the data space.
Each structure performed well for these optimization queries, the average ratio of the
data points accessed is below 0.0025 for all the access structures. Even the worst
performing structure over the worst case function still prunes over 99% of the search
space.
Results between hierarchical access structures are as expected. The R*-tree looks
to avoid some of the partition overlaps produced by the R-tree insertion heuristics
and it shows better performance. The VAM-split R*-tree bulk loads the data into the
42
Figure 2.10: Accesses vs. Index Type, Optimizing Random Function, 5-D Uniform
Random, 50k Points
index and can avoid more partition overlaps than an index built using dynamic inser-
tion. As expected, this yields better performance than the dynamically constructed
R*-tree.
The VA-File has greater resolution than the grid-file in this particular test, and
therefore achieves better access ratios.
Incorporating Problem Constraints during Query Processing The pro-
posed framework processes queries in conjunction with any set of convex problem
constraints. We compare the proposed framework to alternative methods for han-
dling constraints in optimization query processing. For a constrained NN type query
we could process the query according to traditional NN techniques and when we
find a result, check it against the constraints until we find a point that meets the
43
constraints. Alternatively, we could perform a query on the constraints themselves,
and then perform distance computations on the result set and select the best one.
Figure 2.11 shows the number of pages accessed using our technique compared to
these two alternatives as the constrained area varies. Constrained kNN queries were
performed over the 8-D R*-tree built for the Color Histogram data set.
Figure 2.11: Pages Accesses vs. Constraints, NN Query, R*-Tree, 8-D Histogram,
100k Points
In the figure, the R*-tree line shows the number of page accesses required when
we use traditional NN techniques, and then check if the result meets the constraints.
As the problem becomes less constrained, the more likely a discovered NN will fall
within the constraints, and we can terminate the search. The Selectivity line is a
44
lower bound on the number of pages that would be read to obtain the points within
the constraints. Clearly, as the problem becomes more constrained, the fewer pages
will need to be read to find potential answers. The CP line shows the results for our
framework, which processes original problem constraints as part of the search process
and does not further process any partitions that do not intersect constraints.
We demonstrate better results with respect to page accesses except in cases where
the constraints are very small (and we can prune much of the space by performing a
query over the constraints first) or when the NN problem is not constrained (where
our framework reduces to the traditional unconstrained NN problem). When there
are problem constraints, we prune search paths that can not lead to feasible solutions.
Our search drills down to the leaf partition that either contains the optimal answer
within the problem constraints, or yields the best potential answer within constraints.
2.7 Conclusions
In this chapter we present a general framework to model optimization queries.
We present a query processing framework to answer optimization queries in which
the optimization function is convex, query constraints are convex, and the underlying
access structure is made up of convex partitions. We provide specific implementations
of the processing framework for hierarchical data partitioning and non-hierarchical
space partitioning access structures. We prove the I/O optimality of these implemen-
tations. We experimentally show that the framework can be used to answer a number
of different types of optimization queries including nearest neighbor queries, linear
function optimization, and non-linear function optimization queries.
45
The ability to handle general optimization queries within a single framework comes
at the price of increasing the computational complexity when traversing the access
structure. We show a slight increase in CPU times in order to perform convex pro-
gramming based traversal of access structures in comparison with equivalent non-
weighted and unconstrained versions of the same problem.
Since constraints are handled during the processing of partitions in our framework,
and partitions and search paths that do not intersect with problem constraints are
pruned during the access structure traversal, we do not need to examine points that
do not intersect with the constraints. Furthermore, we only access points in partitions
that could potentially have better answers than the actual optimal database answers.
This results in a significant reduction in required accesses compared to alternative
methods in the cases where the inclusion of constraints makes these alternatives less
effective.
The proposed framework offers I/O-optimal access of whatever access structure is
used for the query. This means that given an access structure we will only read points
from partitions that have minimum objective function values lower than the k
th
ac-
tual best answer. Because the framework is built to best utilize the access structure,
it captures the benefits as well as the flaws of the underlying structure. Essentially,
it demonstrates how well the particular access structure isolates the optimal contour
within the problem constraints to access structure partitions. If the optimal contour
and problem constraints can be isolated to a few leaf partitions, the access structure
will yield nice results. However, if the optimal contour crosses many partitions, the
performance will not be as good. As such, the framework can be used to measure
page access performance associated with using different indexes and index types to
46
answer certain classes of optimization queries, in order to determine which struc-
tures can most effectively answer the optimization query type. Database researchers
and administrators can use this technique as a benchmarking tool to evaluate the
performance of a wide range of index structures available in the literature.
To use this framework, one does not need to know the objective function, weights,
or constraints in advance. The system does not need to compute a queried function
for all data objects in order to find the optimal answers. The technique provides
optimization of arbitrary convex functions, and does not incur a significant penalty
in order to provide this generality. This makes the framework appropriate for ap-
plications and domains where a number of different functions are being optimized
or when optimization is being performed over different constrained regions and the
exact query parameters are not known in advance.
47
CHAPTER 3
Online Index Recommendations for High-Dimensional
Databases using Query Workloads
When solving problems, dig at the roots instead of just hacking at the
leaves. - Anthony J. D’Angelo.
In order to effectively prune search space during query resolution, access struc-
tures that are appropriate for the query must be available to the query processor
at query run-time. Appropriate access structures should match well with query se-
lection criteria and allow for a significant reduction in the data search space. This
chapter describes a framework to determine a meaningful set of appropriate access
structures for a set of queries considering query cost savings estimated from both
query attributes and data characteristics. Since query patterns can change over time,
rendering statically determined access structures ineffective, a method for monitoring
performance and recommending access structure changes is provided.
3.1 Introduction
An increasing number of database applications, such as business data warehouses
and scientific data repositories, deal with high dimensional data sets. As the number
of dimensions/attributes and overall size of data sets increase, it becomes essential to
efficiently retrieve specific queried data from the database in order to effectively utilize
48
the database. Indexing support is needed to effectively prune out significant portions
of the data set that are not relevant for the queries. Multi-dimensional indexing,
dimensionality reduction, and RDBMS index selection tools all could be applied to the
problem. However, for high-dimensional data sets, each of these potential solutions
has inherent problems.
To illustrate these problems, consider a uniformly distributed data set of 1,000,000
data objects with several hundred attributes. Range queries are consistently executed
over five of the attributes. The query selectivity over each attribute is 0.1, so the
overall query selectivity is 1/10
5
(i.e. the answer set contains about 10 results).
An ideal solution would allow us to read from disk only those pages that contain
matching answers to the query. We could build a multi-dimensional index over the
data set so that we can directly answer any query by only using the index. However,
performance of multi-dimensional index structures is subject to Bellman’s curse of
dimensionality [4] and degrades rapidly as the number of dimensions increases. For
the given example, such an index would perform much worse than sequential scan.
Another possibility would be to build an index over each single dimension. The
effectiveness of this approach is limited to the amount of search space that can be
pruned by a single dimension (in the example the search space would only be pruned
to 100,000 objects).
For data-partitioning indexes such as the R-tree family of indexes, data is placed
in a partition that contains the data point and could overlap with other partitons.
To answer a query, all potentially matching search paths must be explored. As the
dimensionality of the index increases, the overlaps between partitions increase and
49
at high enough dimensions the entire data space needs to be explored. For space-
partitioning structures, where partitions do not overlap and data points are associated
with cells which contain them (e.g. grid files), the problem is the exponential explosion
of the number of cells. A 100 dimensional index with only a single split per attribute
results in 2
100
cells. The index structure can become intractably large just to identify
the cells. This phenomenon is thoroughly described in [54].
Another possible solution would be to use some dimensionality reduction tech-
nique, index the reduced dimension data space, and transform the query in the
same way that the data was transformed. However, the dimensionality reduction
approaches are mostly based on data statistics, and perform poorly especially when
the data is not highly correlated. They also introduce a significant overhead in the
processing of queries.
Another possible solution is to apply feature selection to keep the most important
attributes of the data according to some criteria and index the reduced dimension-
ality space. However, traditional feature selection techniques are based on selecting
attributes that yield the best classification capabilities. Therefore, they also select
attributes based on data statistics to support classification accuracy rather than focus-
ing on query performance and workload in a database domain. As well, the selected
features may offer little or no data pruning capability given query attributes.
Commercial RDBMS’s have developed index recommendation systems to identify
indexes that will work well for a given workload. These tools are optimized for the
domains for which these systems are primarily employed and the indexes that the
systems provide. They are targeted towards lower dimension transactional databases
and do not produce results optimized for single high-dimensional tables.
50
Our approach is based on the observation that in many high dimensional database
applications, only a small subset of the overall data dimensions are popular for a
majority of queries and recurring patterns of dimensions queried occur. For example,
Large Hadron Collider (LHC) experiments are expected to generate data with up to
500 attributes at the rate of 20 to 40 per second [45]. However, the search criterion
is expected to consist of 10 to 30 parameters. Another example is High Energy
Physics (HEP) experiments [49] where sub-atomic particles are accelerated to nearly
the speed of light, forcing their collision. Each such collision generates on the order
of 1-10MBs of raw data, which corresponds to 300TBs of data per year consisting
of 100-500 million objects. The queries are predominantly range queries and involve
mostly around 5 dimensions out of a total of 200.
We address the high dimensional database indexing problem by selecting a set of
lower dimensional indexes based on joint consideration of query patterns and data
statistics. This approach is also analogous to dimensionality reduction or feature
selection with the novelty that the reduction is specifically designed for reducing query
response times, rather than maintaining data energy as is the case for traditional
approaches. Our reduction considers both data and access patterns, and results in
multiple and potentially overlapping sets of dimensions, rather than a single set. The
new set of low dimensional indexes is designed to address a large portion of expected
queries and allow effective pruning of the data space to answer those queries.
Query pattern evolution over time presents another challenging problem. Re-
searchers have proposed workload based index recommendation techniques. Their
long term effectiveness is dependent on the stability of the query workload. However,
query access patterns may change over time becoming completely dissimilar from the
51
patterns on which the index set were originally determined. There are many common
reasons why query patterns change. Pattern change could be the result of periodic
time variation (e.g. different database uses at different times of the month or day), a
change in the focus of user knowledge discovery (e.g. a researcher discovery spawns
new query patterns), a change in the popularity of a search attribute (e.g. current
events cause an increase in queries for certain search attributes), or simply random
variation of query attributes. When the current query patterns are substantially dif-
ferent from the query patterns used to recommend the database indexes, the system
performance will degrade drastically since incoming queries do not benefit from the
existing indexes. To make this approach practical in the presence of query pattern
change, the index set should evolve with the query patterns. For this reason, we
introduce a dynamic mechanism to detect when the access patterns have changed
enough that either the introduction of a new index, the replacement of an existing
index, or the construction of an entirely new index set is beneficial.
Because of the need to proactively monitor query patterns and query performance
quickly, the index selection technique we have developed uses an abstract represen-
tation of the query workload and the data set that can be adjusted to yield faster
analysis. We generate this abstract representation of the query workload by mining
patterns in the workload. The query workload representation consists of a set of
attribute sets that occur frequently over the entire query set that have non-empty
intersections with the attributes of the query, for each query. To estimate the query
cost, the data set is represented by a multi-dimensional histogram where each unique
value represents an approximation of data and contains a count of the number of
52
records that match that approximation. For each possible index for each query, the
estimated cost of using that index for the query is computed.
Initial index selection occurs by traversing the query workload representation and
determining which frequently occurring attribute set results in the greatest benefit
over the entire query set. This process is iterated until some indexing constraint is met
or no further improvement is achieved by adding additional indexes. Analysis speed
and granularity is affected by tuning the resolution of the abstract representations.
The number of potential indexes considered is affected by adjusting data mining
support level. The size of the multi-dimensional histogram affects the accuracy of the
cost estimates associated with using an index for a query.
In order to facilitate online index selection, we propose a control feedback system
with two loops, a fine grain control loop and a coarse control loop. As new queries
arrive, we monitor the ratio of potential performance to actual performance of the
system in terms of cost and based on the parameters set for the control feedback
loops, we make major or minor changes to the recommended index set.
The contributions of this work can be summarized as follows:
1. Introduction of a flexible index selection technique designed for high-dimensional
data sets that uses an abstract representation of the data set and query work-
load. The resolution of the abstract representation can be tuned to achieve
either a high ratio of index-covered queries for static index selection or fast
index selection to facilitate online index selection.
2. Introduction of a technique using control feedback to monitor when online
query access patterns change and to recommend index set changes for high-
dimensional data sets.
53
3. Presentation of a novel data quantization technique optimized for query work-
loads.
4. Experimental analysis showing the effects of varying abstract representation
parameters on static and online index selection performance and showing the
effects of varying control feedback parameters on change detection response.
The rest of the chapter is organized as follows. Section 3.2 presents the related
work in this area. Section 3.3 explains our proposed index selection and control
feedback framework. Section 3.4 presents the empirical analysis. We conclude in
Section 3.5.
3.2 Related Work
High-dimensional indexing, feature selection, and DBMS index selection tools are
possible alternatives for addressing the problem of answering queries over a subspace
of high-dimensional data sets. As described below, each of these methods provides
less than ideal solutions for the problem of fast high-dimensional data access. Our
work differs from the related index selection work in that we provide a index selection
framework that can be tuned for speed or accuracy. Our technique is optimized
to take advantage of multi-dimensional pruning offered by multi-dimensional index
structures. It takes into consideration both data and query characteristics and can
be applied to perform real-time index recommendations for evolving query patterns.
3.2.1 High-Dimensional Indexing
A number of techniques have been introduced to address the high-dimensional
indexing problem, such as the X-tree [6], and the GC-tree [28]. While these index
54
structures have been shown to increase the range of effective dimensionality, they still
suffer performance degradation at higher index dimensionality.
3.2.2 Feature Selection
Feature selection techniques [7, 36, 29] are a subset of dimensionality reduction
targeted at finding a set of untransformed attributes that best represent the overall
data set. These techniques are also focused on maximizing data energy or classifica-
tion accuracy rather than query response. As a result, selected features may have no
overlap with queried attributes.
3.2.3 Index Selection
The index selection problem has been identified as a variation of the Knapsack
Problem and several papers proposed designs for index recommendations [34, 55, 3,
24, 17, 13] based on optimization rules. These earlier designs could not take advantage
of modern database systems’ query optimizer. Currently, almost every commercial
Relational Database Management System (RDBMS) provides the users with an index
recommendation tool based on a query workload and using the query optimizer to
obtain cost estimates. A query workload is a set of SQL data manipulation state-
ments. The query workload should be a good representative of the types of queries
an application supports.
Microsoft SQL Server’s AutoAdmin tool [15, 16, 1] selects a set of indexes for use
with a specific data set given a query workload. In the AutoAdmin algorithm, an
iterative process is utilized to find an optimal configuration. First, single dimension
candidate indexes are chosen. Then a candidate index selection step evaluates the
queries in a given query workload and eliminates from consideration those candidate
55
indexes which would provide no useful benefit. Remaining candidate indexes are eval-
uated in terms of estimated performance improvement and index cost. The process
is iterated for increasingly wider multi-column indexes until a maximum index width
threshold is reached or an iteration yields no improvement in performance over the
last iteration.
Costs are estimated using the query optimizer which is limited to considering those
physical designs offered by the DBMS. In the case of SQL Server, single and multi-level
B+-trees are evaluated. These index structures can not achieve the same level of result
pruning that can be offered by an index technique that indexes multiple dimensions
simultaneously (such as R-tree or grid file). As a result the indexes suggested by the
tool often do not capture the query performance that could be achieved for multi-
dimensional queries.
MAESTRO (METU Automated indEx Selection Tool)[19] was developed on top
of Oracle’s DBMS to assist the database administrator in designing a complete set of
primary and secondary indexes by considering the index maintenance costs based on
the valid SQL statements and their usage statistics automatically derived using SQL
Trace Facility during a regular database session. The SQL statements are classified by
their execution plan and their weights are accumulated. The cost function computed
by the query optimizer is used to calculate the benefit of using the index.
For DB2, IBM has developed the DB2Adviser [52] which recommends indexes
with a method similar to AutoAdmin with the difference that only one call to the
query optimizer is needed since the enumeration algorithm is inside the optimizer
itself.
56
These commercial index selection tools are coupled to physical design options
provided by their respective query optimizers and therefore, do not reflect the pruning
that could be achieved by indexing multiple dimensions together.
3.2.4 Automatic Index Selection
The idea of having a database that can tune itself by automatically creating new
indexes as the queries arrive has been proposed [35, 18]. In [35] a cost model is used
to identify beneficial indexes and decide when to create or drop an index at runtime.
[18] proposes an agent-based database architecture to deal with automatic index
creation. Microsoft Research has proposed a physical design alerter [11] to identify
when a modification to the physical design could result in improved performance.
3.3 Approach
3.3.1 Problem Statement
In this section we define the problem of index selection for a multi-dimensional
space using a query workload.
A query workload W consists of a set of queries that select objects within a
specified subspace in the data domain. More formally, we define a workload as follows:
Definition 4 A workload W is a tuple W = (D, DS, Q), where D is the domain,
DS ⊆ D is a finite subset (the data set), and Q (the query set) is a set of subsets of
DS.
In our case, the domain is
d
, where d is the dimensionality of the dataset, the
instance DS is a set of n tuples t = {(a
1
, a
2
, ..., a
d
)|a
i
∈ }, and the query set Q is a
set of range queries.
57
Finding the answers to a query, when no index is present, reduces to scanning all
the points in the dataset and testing whether the query conditions are met. In this
scenario we can define the cost of answering the query as the time it takes to scan the
dataset, i.e. the time to retrieve the data pages from disk. The assumption is that
the time spend doing I/O dominates the time required to perform the simple bound
comparisons. In the case that an index is present, the cost of answering the query
can be lower. The index can identify a smaller set potential matching objects and
only those data pages containing these objects need to be retrieved from disk. The
degree to which an index prunes the potential answer set for a query determines its
effectiveness for the query.
Our problem can be defined as finding a set of indexes I, given a multi-dimensional
dataset DS, a query workload W, an optional indexing constraint C, an optional
analysis time constraint t
a
, that provides the best estimated cost over W. In the
context of this problem an index is considered to be the set of attributes that can
be used to prune subspace simultaneously with respect to each attribute. Therefore,
attribute order has no impact on the amount of pruning possible.
The overall goal of this work is to develop a flexible index selection framework
that can be tuned to achieve effective static and online index selection for high-
dimensionsal data under different analysis constraints.
For static index selection, when no constraints are specified, we are aiming to
recommend the set of indexes that yields the lowest estimated cost for every query
in a workload for any query that can benefit from an index. In the case when a
constraint is specified, either as the minimum number of indexes or a time constraint,
we want to recommend a set of indexes within the constraint, from which the queries
58
can benefit the most. When there is a time-constraint, we need to automatically
adjust the analysis parameters to increase the speed of analysis.
For online index selection, we are striving to develop a system that can recommend
an evolving set of indexes for incoming queries over time, such that the benefit of
index set changes outweighs the cost of making those changes. Therefore, we want an
online index selection system that differentiates between low-cost index set changes
and higher cost index set changes and also can make decisions about index set changes
based on different cost-benefit thresholds.
3.3.2 Approach Overview
In order to measure the benefit of using a potential index over a set of queries, we
need to be able to estimate the cost of executing the queries, with and without the
index. Typically, a cost model is embedded into the query optimizer to decide on the
query plan, whether the query should be answered using a sequential scan or using
an existing index. Instead of using the query optimizer to estimate query cost, we
conservatively estimate the number of matches associated with using a given index
by using a multi-dimensional histogram abstract representation of the dataset. The
histogram captures data correlations between only those attributes that could be rep-
resented in a selected index. The cost associated with an index is calculated based on
the number of estimated matches derived from the histogram and the dimensionality
of the index. Increasing the size of the multi-dimensional histogram enhances the
accuracy of the estimate at the cost of abstract representation size.
While maintaining the original query information for later use to determine esti-
mated query cost, we apply one abstraction to the query workload to convert each
59
query into the set of attributes referenced in the query. We perform frequent itemset
mining over this abstraction and only consider those sets of attributes that meet a
certain support to be potential indexes. By varying the support, we affect the speed
of index selection and the ratio of queries that are covered by potential indexes. We
further prune the analysis space using association rule mining by eliminating those
subsets above a certain confidence threshold. Lowering the confidence threshold im-
proves analysis time by eliminating some lower dimensional indexes from considera-
tion but can result in recommending indexes that cover a strict superset of the queried
attributes.
Our technique differs from existing tools in the method we use to determine the
potential set of indexes to evaluate and in the quantization-based technique we use
to estimate query costs. All of the commercial index wizards work in design time.
The Database Administrator (DBA) has to decide when to run this wizard and over
which workload. The assumption is that the workload is going to remain static over
time and in case it does change, the DBA would collect the new workload and run the
wizard again. The flexibility afforded by the abstract representation we use allows
it to be used for infrequent, index selection considering a broader analysis space or
frequent, online index selection.
In the following two sections we present our proposed solution for index selection
which is used for static index selection and as a building block for the online index
selection.
60
Figure 3.1: Index Selection Flowchart
3.3.3 Proposed Solution for Index Selection
The goal of the index selection is to minimize the cost of the queries in the workload
given certain constraints. Given a query workload, a dataset, the indexing constraints,
and several analysis parameters, our framework produces a set of suggested indexes
as an output. Figure 3.1 shows a flow diagram of the index selection framework.
Table 3.1 provides a list of the notations used in the descriptions.
We identify three major components in the index selection framework: the ini-
tialization of the abstract representations, the query cost computation, and the index
selection loop. In the following subsections we describe these components and the
dataflow between them.
Initialize Abstract Representations
The initialization step uses a query workload and the dataset to produce a set of
Potential Indexes P, a Query Set Q, and a Multi-dimensional Histogram H, according
to the support, confidence, and histogram size specified by the user. The description
of the outputs and how they are generated is given below.
• Potential Index Set P
61
Symbol Description
P Potential Set of Indexes, the set of attribute sets under consid-
eration as a suggested indexes
Q Query Set, a representation of the query workload, for each
query. It consists of the attribute sets in P that intersect with
the query attributes, query ranges, and estimated query costs
H Multi-Dimensional Histogram, used as a workload-optimized ab-
stract representation of the data set
S Suggested Indexes, the set of attribute sets currently selected as
recommended indexes
i attribute set currently under analysis
support the minimum ratio of occurrence of an attribute in a query work-
load to be included in P
confidence the maximum ratio of occurrence of an attribute subset to the
occurrence of a set before the subset is pruned from P
histogram size the number of bits used to represent a quantized data object
Table 3.1: Index Selection Notation List
The potential index set P is a collection of attribute sets that could be beneficial
as an index for the queries in the input query workload. This set is computed using
traditional data mining techniques. Considering the attributes involved in each query
from the input query workload to be a single transaction, P consists of the sets of
attributes that occur together in a query at a ratio greater than the input support.
Formally, support of a set of attributes A is defined as:
S
A
=
n
¸
i=1

1 if A ⊆ Q
i
0 otherwise
n
where Q
i
is the set of attributes in the i
th
query and n is the number of queries.
For instance, if the input support is 10%, and attributes 1 and 2 are queried
together in greater than 10 percent of the queries, then a representation of the set
62
of attributes {1,2} will be included as a potential index. Note that because a subset
of an attribute set that meets the support requirement will also necessarily meet the
support, all subsets of attribute sets meeting the support will also be included as a
potential index (in the example above both the sets {1} and {2} will be included). As
the input support is decreased, the number of potential indexes increases. Note that
our particular system is built independently from a query optimizer, but the sets of
attributes appearing in the predicates from a query optimizer log could just as easily
be substituted for the query workload in this step.
If a set occurs nearly as often as one of its subsets, an index built over the subset
will likely not provide much benefit over the query workload if an index is built over
the attributes in the set. Such an index will only be more effective in pruning data
space for those queries that involve only the subset’s attributes. In order to enhance
analysis speed with limited effect on accuracy, the input confidence is used to prune
analysis space. Confidence is the ratio of a set’s occurrence to the occurrence of a
subset.
While data mining the frequent attribute sets in the query workload in determin-
ing P, we also maintain the association rules for disjoint subsets and compute the
confidence of these association rules. The confidence of an association rule is defined
as the ratio that the antecedent (Left Hand Side of the rule) and consequent (Right
Hand Side of the rule) appear together in a query given that the antecedent appears
in the query. Formally, confidence of an association rule {set of attributes A}→{set
of attributes B}, where A and B are disjoint, is defined as:
63
C
A→B
=
n
¸
i=1

1 if (A ∪ B) ⊆ Q
i
0 otherwise
n
¸
i=1

1 if A ⊆ Q
i
0 otherwise
where Q
i
is the set of attributes in the i
th
query and n is the number of queries.
In our example, if every time attribute 1 appears, attribute 2 also appears then
the confidence of {1}→{2} = 1.0. If attribute 2 appears without attribute 1 as many
times as it appears with attribute 1, then the confidence {2}→{1} = 0.5. If we have
set the confidence input to 0.6, then we will prune the attribute set {1} from P, but
we will keep attribute set {2}.
We can also set the confidence level based on attribute set cardinality. Since
the cost of including extra attributes that are not useful for pruning increases with
increased indexed dimensionality, we want to be more conservative with respect to
pruning attribute subsets. The confidence can contain more than one value depending
on set cardinality.
While the apriori algorithm was appropriate for the relatively low attribute query
sets in our domain, a more efficient algorithm such as the FP-Tree [30] could be applied
if the attribute sets associated with queries are too large for the apriori technique to
be efficient. While it is desirable to avoid examining a high-dimensional index set
as a potential index, another possible solution in the case where a large number of
attributes are frequent together would be to partition a large closed frequent itemset
into disjoint subsets for further examination. Techniques such as CLOSET [44] could
be used to arrive at the initial closed frequent itemsets.
• Query Set Q
64
The query set Q is the abstract representation of the query workload is initialized
by associating the potential indexes that could be beneficial for each query with
that query. These are the indexes in the potential index set P that share at least
one common attribute with the query. At the end of this step, each query has an
identified set of possible indexes for that query.
• Multi-Dimensional Histogram H
An abstract representation of the data set is created in order to estimate the query
cost associated with using each query’s possible indexes to answer that query. This
representation is in the form of a multi-dimensional histogram H. A single bucket
represents a unique bit representation across all the attributes represented in the
histogram. The input histogram size dictates the number of bits used to represent
each unique bucket in the histogram. These bits are designated to represent only
the single attributes that met the input support in the input query workload. If a
single attribute does not meet the support, then it can not be part of an attribute
set appearing in P. There is no reason to sacrifice data representation resolution for
attributes that will not be evaluated. The number of bits that each of the represented
attributes gets is proportional to the log of that attribute’s support. This gives more
resolution to those attributes that occur more frequently in the query workload.
Data for an attribute that has been assigned b bits is divided into 2
b
buckets.
In order to handle data sets with uneven data distribution, we define the ranges of
each bucket so that each bucket contains roughly the same number of points. The
histogram is built by converting each record in the data set to its representation
in bucket numbers. As we process data rows, we only aggregate the count of rows
with each unique bucket representation because we are just interested in estimating
65
query cost. Note that the multi-dimensional histogram is based on a scalar quantizer
designed on data and access patterns, as opposed to just data in the traditional case.
A higher accuracy in representation is achieved by using more bits to quantize the
attributes that are more frequently queried.
For illustration, Table 3.2 shows a simple multi-dimensional histogram example.
This histogram covers 3 attributes and uses 1 bit to quantize attributes 2 and 3, and
2 bits to quantize attribute 1, assuming it is queried more frequently than the other
attributes. In this example, for attributes 2 and 3 values from 1-5 quantize to 0, and
values from 6-10 quantize to 1. For attribute 1, values 1 and 2 quantize to 00, 3 and
4 quantize to 01, 5-7 quantize to 10, and 8 and 9 quantize to 11. The .’s in the ’value’
column denote attribute boundaries (i.e. attribute 1 has 2 bits assigned to it).
Note that we do not maintain any entries in the histogram for bit representa-
tions that have no occurrences. So we can not have more histogram entries than
records and will not suffer from exponentially increasing the number of potential
multi-dimensional histogram buckets for high-dimensional histograms.
Sample Dataset Histogram
A
1
A
2
A
3
Encoding Value Count
2 5 5 0000 00.0.0 2
4 8 3 0110 00.0.1 1
1 4 3 0000 01.0.0 1
6 7 1 1010 01.1.0 1
3 2 2 0100 01.1.1 1
2 2 6 0001 10.1.0 2
5 6 5 1010 11.0.0 1
8 1 4 1100 11.0.1 1
3 8 7 0111
9 3 8 1101
Table 3.2: Histogram Example
66
Query Cost Calculation
Once generated, the abstract representations of the query set Q and the multi-
dimensional histogram H are used to estimate the cost of answering each query using
all possible indexes for the query. For a given query-index pair, we aggregate the
number of matches we find in the multi-dimensional histogram looking only at the
attributes in the query that also occur in the index (bits associated with other at-
tributes are considered to be don’t cares in the query matching logic). To estimate
the query cost, we then apply a cost function based on the number of matches we
obtain using the index and the dimensionality of the index. At the end of this step,
our abstract query set representation has estimated costs for each index that could
improve the query cost. For each query in the query set representation, we also keep
a current cost field, which we initialize to the cost of performing the query using
sequential scan. At this point, we also initialize an empty set of suggested indexes S.
• Cost Function
A cost function is used to estimate the cost associated with using a certain index
for a query. The cost function can be varied to accurately reflect a cost model for the
database system. For example, one could apply a cost function that amortized the
cost of loading an index over a certain number of queries or use a function tailored
to the type of index that is used. Many cost functions have been proposed over the
years. For an R-Tree, which is the index type used for this work, the expected number
of data page accesses is estimated in [22] by:
A
nn,mm,FBF
=

d

1
C
eff
+ 1

d
67
where d is the dimensionality of the dataset and C
eff
is the number of data objects
per disk page. However, this formula assumes the number of points N approaches to
infinity and does not consider the effects of high dimensionality or correlations.
A more recently proposed cost model is given in [8] where the expected number
of pages accesses is determined as:
A
r,em,ui
(r) =

2r ·
d

N
C
eff
+ 1 −
1
C
eff

d
where r is the radius of the range query, d is the dataset dimensionality, N is the
number of data objects, and C
eff
is the capacity of a data page.
While these published cost estimates can be effective to estimate the number of
page accesses associated with using a multi-dimensional index structure under certain
conditions, they have certain characteristics that make them less than ideal for the
given situation. Each of the cost estimate formulas require a range radius. Therefore
the formulas break down when assessing the cost of a query that is an exact match
query in one or more of the query dimensions. These cost estimates also assume that
data distribution is independent between attributes, and that the data is uniformly
distributed throughout the data space.
In order to overcome these limitations, we apply a cost estimate that is based on
the actual matches that occur over the multi-dimension histogram over the attributes
that form a potential index. The cost model for R-trees we use in this work is given
by
(d
(d/2)
∗ m)
where d is the dimensionality of the index and m is the number of matches returned for
query matching attributes in the multi-dimensional histogram. Using actual matches
68
eliminates the need for a range radius. It also ties the cost estimate to the actual
data characteristics (i.e. incorporates both data correlation between attributes and
data distribution, while the published models will produce results that are dependent
only on the range radius for a given index structure). The cost estimate provided
is conservative in that it will provide a result that is at least as great as the actual
number of matches in the database.
By evaluating the number of matches over the set of attributes that match the
query, the multi-dimensional subspace pruning that can be achieved using different
index possibilities is taken into account. There is additional cost associated with
higher dimensionality indexes due to the greater number of overlaps of the hyperspaces
within the index structure, and additional cost of traversing the higher dimension
structure. A penalty is imposed on a potential index by the dimensionality term.
Given equal ability to prune the space, a lower dimensional index will translate into
a lower cost.
The cost function could be more complicated in order to more accurately model
query costs. It could model query cost with greater accuracy, for example by crediting
complete attribute coverage for coverage queries. It could also reflect the appropriate
index structures used in the database system, such as B+-trees. We used this partic-
ular cost model because the index type was appropriate for our data and query sets,
and we assumed that we would retrieve data from disk for all query matches.
Index Selection Loop
After initializing the index selection data structures and updating estimated query
costs for each potentially useful index for a query, we use a greedy algorithm that takes
into account the indexes already selected to iteratively select indexes that would be
69
appropriate for the given query workload and data set. For each index in the potential
index set P, we traverse the queries in query set Q that could be improved by that
index and accumulate the improvement associated with using that index for that
query. The improvement for a given query-index pair is the difference between the
cost for using the index and the query’s current cost. If the index does not provide
any positive benefit for the query, no improvement is accumulated. The potential
index i that yields the highest improvement over the query set Q is considered to
be the best index. Index i is removed from potential index set P and is added to
suggested index set S. For the queries that benefit from i, the current query cost is
replaced by the improved cost.
After each i is selected, a check is made to determine if the index selection loop
should continue. The input indexing constraints provides one of the loop stop criteria.
The indexing constraint could be any constraint such as the number of indexes, total
index size, or total number of dimensions indexed. If no potential index yields further
improvement, or the indexing constraints have been met, then the loop exits. The
set of suggested indexes S contains the results of the index selection algorithm.
At the end of a loop iteration, when possible, we prune the complexity of the
abstract representations in order to make the analysis more efficient. This includes
actions such as eliminating potential indexes that do not provide better cost estimates
than the current cost for any query and pruning from consideration those queries
whose best index is already a member of the set of suggested indexes. The overall
speed of this algorithm is coupled with the number of potential indexes analyzed,
so the analysis time can be reduced by increasing the support or decreasing the
confidence.
70
Different strategies can be used in selecting a best index. The strategy provided
assumes a indexing constraint based on the number of indexes and therefore uses
the total benefit derived from the index as the measure of index ’goodness’. If the
indexing constraint is based on total index size, then benefit per index size unit
may be a more appropriate measure. However, this may result in recommending a
lower-dimensioned index and later in the algorithm a higher-dimensioned index that
always performs better. The recommendation set can be pruned in order to avoid
recommending an index that is non-useful in the context of the complete solution.
3.3.4 Proposed Solution for Online Index Selection
The online index selection is motivated by the fact that query patterns can change
over time. By monitoring the query workload and detecting when there is a change on
the query pattern that generated the existing set of indexes, we are able to maintain
good performance as query patterns evolve. In our approach, we use control feedback
to monitor the performance of the current set of indexes for incoming queries and
determine when adjustments should be made to the index set. In a typical control
feedback system, the output of a system is monitored and based on some function
involving the input and output, the input to the system is readjusted through a
control feedback loop. Our situation is analogous but more complex than the typical
electrical circuit control feedback system in several ways:
1. Our system input is a set of indexes and a set of incoming queries rather than
a simple input, such as an electrical signal.
2. The system output must be some parameter that we can measure and use to
make decisions about changing the input. Query performance is the obvious
71
parameter to monitor. However, because lower query performance could be
related to other aspects rather than the index set, our decision making control
function must necessarily be more complex than a basic control system.
3. We do not have a predictable function to relate system input and output because
of the non-determinism associated with new incoming queries. For example, we
may have a set of attributes that appears in queries frequently enough that our
system indicates that it is beneficial to create an index over those attributes,
but there is no guarantee that those attributes will ever be queried again.
Control feedback systems can fail to be effective with respect to response time.
The control system can be too slow to respond to changes, or it can respond too
quickly. If the system is too slow, then it fails to cause the output to change based
on input changes in a timely manner. If it responds too quickly, then the output
overshoots the target and oscillates around the desired output before reaching it.
Both situations are undesirable and should be designed out of the system.
Figure 3.2 represents our implementation of dynamic index selection. Our system
input is a set of indexes and a set of incoming queries. Our system simulates and
estimates costs for the execution of incoming queries. System output is the ratio of
potential system performance to actual system performance in terms of database page
accesses to answer the most recent queries. We implement two control feedback loops.
One is for fine grain control and is used to recommend minor, inexpensive changes
to the index set. The other loop is for coarse control and is used to avoid very
poor system performance by recommending major index set changes. Each control
feedback loop has decision logic associated with it.
72
Figure 3.2: Dynamic Index Analysis Framework
System Input
The system input is made up of new incoming queries and the current set of indexes
I, which is initialized to be the suggested indexes S from the output of the initial
index selection algorithm. For clarity, a notation list for the online index selection is
included as Table 3.3.
System
The system simulates query execution over a number of incoming queries. The
abstract representation of the last w queries stored as W , where w is an adjustable
window size parameter. W is used to estimate performance of a hypothetical set of
indexes I
new
against the current index set I. This representation is similar to the one
kept for query set Q in the static index selection. In this case, when a new query
q arrives, we determine which of the current indexes in I most efficiently answers
73
Symbol Description
I Current set of attribute sets used as indexes
I
new
Hypothetical set of attribute sets used as indexes
w window size
W abstract represention of the last w queries
q the current query under analysis
P Current Potential Indexes, the set of attribute sets in consider-
ation to be indexes before q arrives
P
new
New Potential Indexes, the set of attribute sets in consideration
to be indexes after q arrives
i
q
the attribute set estimated to be the best index for query q
Table 3.3: Notation List for Online Index Selection
this query and replace the oldest query in W with the abstract representation of q.
We also incrementally compute the attribute sets that meet the input support and
confidence over the last w queries. This information is used in the control feedback
loop decision logic. The system also keeps track of the current potential indexes P,
and the current multi-dimensional histogram H.
System Output
In order to monitor the performance of the system, we compare the query per-
formance using the current set of indexes I to the performance using a hypothetical
set of indexes I
new
. The query performance using I is the summation of the costs of
queries using the best index from I for the given query. Consider the possible new
indexes P
new
to be the set of attribute sets that currently meet the input support
and confidence over the last w queries. The hypothetical cost is calculated differently
based on the comparison of P and P
new
, and the identified best index i
q
from P or
P
new
for the new incoming query:
74
1. P = P
new
and i is in I. In this case we bypass the control loops since we could
do no better for the system by changing possible indexes.
2. P = P
new
and i is not in I. We recompute a new set of suggested indexes I
new
over the last w queries. The hypothetical cost is the cost over the last w queries
using I
new
.
3. P = P
new
and i is in I. In this case we bypass the control loops since we could
do no better for the system by changing possible indexes.
4. P = P
new
and i is not in I. We traverse the last w queries and determine those
queries that could benefit from using a new index from P
new
. We compute the
hypothetical cost of these queries to be the real number of matches from the
database. Hypothetical cost for other queries is the same as the real cost.
The ratio of the hypothetical cost, which indicates potential performance, to the
actual performance is used in the control loop decision logic.
Fine Grain Control Loop
The fine grain control loop is used to recommend low cost, minor changes to index
set. This loop is entered in case 2 as described above when the ratio of hypotheti-
cal performance to actual performance is below some input minor change threshold.
Then the indexes are changed to I
new
, and appropriate changes are made to update
the system data structures. Increasing the input minor change threshold causes the
frequency of minor changes to also increase.
75
Coarse Control Loop
The coarse control loop is used to recommend more costly, but changes with
greater impact on future performance to the index set. This loop is entered in case 4 as
described above when the ratio of hypothetical performance to actual performance is
below some input major change threshold. Then the static index selection is performed
over the last w queries, abstract representations are recomputed, and a new set of
suggested indexes I
new
is generated. Appropriate changes are made to update the
system data structures to the new situation. Increasing the input major change
threshold increases the frequency of major changes.
3.3.5 System Enhancements
In the following subsections, we present two system enhancements that provide
further robustness and scalability to our framework.
Self Stabilization
Control feedback systems can be either too slow or too responsive in reacting
to input changes. In our application, a system that is slow to respond results in
recommending useful indexes long after they first could have a positive effect. It
could also fail to recommend potentially useful indexes if thresholds are set so that
the system is insensitive to change. A system that is too responsive can result in
system instability, where the system continously adds and drops indexes.
The system performance and rate of index change can be monitored and used
in order to tune the control feedback system itself. If the actual number of query
results is much lower than the estimated cost using the recommended index set over
a window of queries, this indicates a non-responsive system. When this condition is
76
detected, the system can be made more responsive by cutting the window size. This
increases the probability that P
new
will be different from P, and we can reperform to
analysis applying a reduced support. This gives us a more complete P with respect
to answering a greater portion of queries.
If the frequency of index change is too high with little or no improvement in query
performance, an oversensitive system or unstable query pattern is indicated. We can
reduce the sensitivity by increasing the window size or increasing the support level
during recomputation of new recommended indexes.
Partitioning the Algorithm
The system can also be applied to environments where the database is parti-
tioned horizontally or vertically at different locations. A solution at one extreme
is to maintain a centralized index recommendation system. This would maintain
the abstract representation and collect global query patterns over the entire system.
Recommended indexes would be determined based on global trends. This approach
would allow for the creation of indexes that maximize pruning over the global set of
dimensions. However, it would not optimize query performance at the site level. A
single set of indexes would be recommended for the entire system.
At the other extreme would be to perform the index recommendation at each
site. In this case, a different abstract representation would be built based on the
data at each specific site given requests to the data at that site. Indexes would
be recommended based on the query traffic to the data at that site. This allows
tight control of the indexes to reflect the data at each site. However, index-based
pruning based on attributes and data at multiple locations would not be possible.
77
This approach also requires traffic to each site that contains potentially matching
data.
A hybrid approach can provide the benefits of index recommendation based on
global data and query patterns while optimizing the index set at different requesting
locations. A global set of indexes can be centrally recommended based on a global
abstract representation of the data set with low support thresholds (the centralized
location will store a larger set of globally good indexes). The query patterns emanat-
ing from each site can be mined in order to find which of those global indexes would
be appropriate to maintain at the local site, eliminating some traffic.
3.4 Empirical Analysis
3.4.1 Experimental Setup
Data Sets
Several data sets were used during the performance of experiments. The variation
in data sets is intended to show the applicability of our algorithm to a wide range of
data sets and to measure the effect that data correlation has on results. Data sets
used include:
• random - a set of 100,000 records consisting of 100 dimensions of uniformly
distributed integers between 0 and 999. The data is not correlated.
• stocks - a set of 6500 records consisting of 360 dimensions of daily stock market
prices. This data is extremely correlated.
78
• mlb - a set of 33619 records of major league pitching statistics from between the
years of 1900 and 2004 consisting of 29 dimensions of data. Some dimensions
are correlated with each other, while others are not at all correlated.
Analysis Parameters
The effect of varying several analysis input parameters including support, multi-
dimensional histogram size, and online indexing control feedback decision thresholds
was analyzed. Unless otherwise specified, the confidence parameter for the experi-
ments is 1.0.
Query Workloads
It is desirable to explore the general behavior of database interactions without
inadvertently capturing the coupling associated with using a specific query history
on the same database. Therefore, query workload files were generated by merging
synthetic query histories and query histories from real-world applications with dif-
ferent data sets. Histories and data were merged by taking a random record from
a data set and the numerical identifier of the attributes involved in the synthetic or
historical query in order to generate a point query. So, if a historical query involved
the 3rd and 5th attribute, and the n
th
record was randomly selected from the data
set, a SELECT type query is generated from the data set where the 3rd attribute
is equal to the value of the 3rd attribute of the n
th
record and the 5th attribute is
equal to the value of 5th attribute of the n
th
record. This gives a query workload that
reflects the attribute correlations within queries, and has a variable query selectivity.
Unless otherwise stated each query in experiments is a point query with respect to
79
the attributes covered by the query. Attributes that are not covered by the query can
be any value and still match the query.
These query histories form the basis for generating the query workloads used in
our experiments:
1. synthetic - 500 randomly generated queries. The distribution of the queries over
the first 200 queries is 20% involve attributes {1,2,3,4} together, 20% {5,6,7},
20%, {8,9}, and the remaining queries involve between 1 and 5 attributes that
could be any attribute. Over the last 300 queries, the distribution shifts to
20% covering attributes {11,12,13,14}, 20% {15,16,17}, 20% {18,19}, and the
remaining 40% are between 1 to 5 attributes that could be any attribute.
2. clinical - 659 queries executed from a clinical application. The query distribution
file has 64 distinct attributes.
3. hr - 35,860 queries executed from a human resources application. The query
distribution file has 54 attributes. Due to the size of this query set, some initial
portion of the queries are used for some experiments.
3.4.2 Experimental Results
Index Selection with Relaxed Constraints
Figure 3.3 shows how accurately the proposed static index selection technique can
perform with relaxed analysis constraints. For this scenario, a low support value,
0.01, is used. There are no constraints placed on the number of selected indexes.
And 256 bits are used for the multi-dimension histogram. Figure 3.3 shows the cost
of performing a sequential scan over 100 queries using the indicated data sets, the
estimated cost of using recommended indexes, and the true number of answers for
80
Figure 3.3: Costs in Data Object Accesses for Ideal, Sequential Scan, AutoAdmin,
Naive, and the Proposed Index Selection Technique using Relaxed Constraints
the set of queries. For comparative purposes, costs for indexes proposed by AutoAd-
min and a naive algorithm are also provided. The naive algorithm uses indexes for
the first n most frequent itemsets in the query patterns, where n is the number of
indexes suggested by the proposed algorithm. Note that the cost for sequential scan
is discounted to account for the fact that sequential access is significantly cheaper
than random access. A factor of 10 is applied as the ratio of random access cost
to sequential access cost. While the actual ratio of a given environment is variable,
regardless of any reasonable factor used, the graph will show the cost of using our
indexes much closer to the ideal number of page accesses than the cost of sequential
scan.
Table 3.4 compares the proposed index selection algorithm with relaxed con-
straints against SQL Server’s index selection tool, AutoAdmin [15], using the data
sets and query workloads indicated.
81
Data Set/ Analysis % Queries Number of
Workload Tool Time(s) Improved Indexes
stock/ AutoAdmin 450 100 23
clinical Proposed 110 100 18
stock/ AutoAdmin 338 100 20
hr Proposed 160 100 16
mlb/ AutoAdmin 15 0 0
clinical Proposed 522 87 16
Table 3.4: Comparison of proposed index selection algorithm with AutoAdmin in
terms of analysis time and % queries improved
For the stock dataset using both the clinical and hr workloads, both algorithms
suggest indexes which will improve all of the queries. Since the selectivity of these
queries is low (the queries return a low number of matches using any index that
contains a queried attribute), the amount of the query improvement will be very
similar using either recommended index set. The proposed algorithm executes in
less time and generates indexes which are more representative of the query patterns
in the workload and allow for greater subspace pruning. For example, the most
common query in the clinical workload is to query over both attributes 11 and 44
together. SQL Server’s index recommendations are all single-dimension indexes for
all attributes that appear in the workload. However, our first index recommendation
is a 2-dimension index built over attributes 11 and 44.
For the mlb dataset, SQL Server quickly recommended no indexes. Our index
selection takes longer in this instance, but finds indexes that improve 87 % of the
queries. These are all the queries that have selectivities low enough that an index
can be beneficial. It should be noted that the indexes selected by AutoAdmin are
appropriate based on the cost models for the underlying index structure used in
82
SQL Server. Since the underlying index type for SQL Server is B+-trees, which do
not index multiple dimensions concurrently, the single dimension indexes that are
recommended are the least costly possible indexes to use for the query set. For this
particular experiment, a single dimensional index is not able to prune the subspace
enough to make the application of that index worthwhile compared to sequential scan.
Effect of Support and Confidence
Table 3.5 presents results on analysis complexity and expected query improvement
as support and confidence are varied. The results are shown for the stock data
set over the clinical query workload. Results show the total number of query-index
pairs analyzed over the query set in the index selection loop and the total estimated
improvement in query performance in terms of data objects accessed over the query
set as the index selection parameters vary. As confidence decreases, we maintain
fewer potential indexes in P and need to analyze fewer attribute sets per index. This
decreases analysis time but shows very little effect on overall performance.
Using a confidence level of 0% is equivalent to using the maximal itemsets of at-
tributes that meet support criteria as recommended indexes. For this example, the
strategy yields nearly identical estimated cost although only 34-44% of the query/in-
dex pairs need to be evaluated.
Baseline Online Indexing Results
A number of experiments were performed to demonstrate the effectiveness and
characteristics of the adaptive system. In each of the experiments that show the
performance of the adaptive system over a sequence of queries, an initial set of indexes
is recommended based on the first 100 queries given the stated analysis parameters.
83
Support Query/Index Pairs Total Improvement
Confidence 100 50 0 100 50 0
2 229 229 90 57710 57710 57603
4 198 198 85 55018 55018 55001
6 185 185 73 46924 46924 46907
8 178 178 66 42533 42533 42516
10 170 170 58 37527 37527 37510
Table 3.5: Comparison of Analysis Complexity and Query Performance as support
and confidence vary, stock dataset, clinical workload
New queries are evaluated, and depending on the parameters of the adaptive system,
changes are made to index set. These index set changes are considered to take place
before the next query. The estimated cost of the last 100 queries given the indexes
available at the time of the query are accumulated and presented. This demonstrates
the evolving suitability of the current index set to incoming queries.
Figures 3.4 through 3.6 show a baseline comparison between using the query
pattern change detection and modifying the indexes and making no change to the
initial index set. A number of data set and query history combinations are examined.
The baseline parameters used are 5% support, 128 bits for each multi-dimension
histogram entry, a window size of 100, an indexing constraint of 10 indexes, a major
change threshold of 0.9 and a minor change threshold of 0.95.
Figure 3.4 shows the comparison for the random data set using the synthetic query
workload. At query 200, this workload drastically changes, and this is evident in the
query cost for the static system. The performance gets much worse and does not get
better for the static system. For the online system, the performance degrades a little
84
Figure 3.4: Baseline Comparative Cost of Adaptive versus Static Indexing, random
Dataset, synthetic Query Workload
when the query patterns are changing and then improve again once the indexes have
changed to match the new query patterns.
The synthetic query workload was generated specifically to show the effect of a
changing query pattern. Figure 3.5 shows a comparison of performance for the random
data set using the clinical query workload. This real query workload also changes
substantially before reverting back to query patterns similar to those on which the
static index selection was performed. Performance for the static system degrades
substantially when the patterns change until the point that they change back. The
adaptive system is better able to absorb the change.
85
Figure 3.5: Baseline Comparative Cost of Adaptive versus Static Indexing, random
Dataset, clinical Query Workload
86
Figure 3.6: Baseline Comparative Cost of Adaptive versus Static Indexing, stock
Dataset, hr Query Workload
Figure 3.6 shows the comparison for the stock data set using the first 2000 queries
of the hr query workload. The adaptive system shows consistently lower cost than
the static system and in places significantly lower cost for this real query workload.
The effect of range queries was also explored. The random data set was examined
using the synthetic workload using ranges instead of points. As the ranges for a given
attribute increased from a point value to 30% selectivity, the cost saved for top 3
proposed indexes (which were the index sets [1,2,3,4], [5,6,7], and [8,9] for all ranges),
the overall cost saved decreased. This is due to the increase in the size of the answer
87
set. When the range for each attribute reached 30%, no index was recommended
because the result set became large enough that sequential scan was a more effective
solution.
Online Index Selection versus Parametric Changes
Changing online index selection parameters changes the adaptive index selection
in the following ways:
Support - decreasing the support level has the effect of increasing the potential
number of times the index set will be recalculated. It also makes the recalculation of
the recommended indexes themselves more costly and less appropriate for an online
setting. Figure 3.7 shows the effect of changing support on the online indexing results
for the random data set and synthetic query workload, while Figure 3.8 shows the
effect on the stock data set using the first 600 queries of hr query workload. These
graphs were generated using the baseline parameters and varying support levels. As
expected, lower support levels translate to lower initial costs because more of the ini-
tial query set are covered by some beneficial index. For the synthetic query workload,
when the patterns changed, but do not change again (e.g. queries 100-200 in Figure
3.7), lower support levels translated to better performance. However, for the hr query
workload, the 10% support threshold yields better performance than the 6% and 8%
thresholds for some query windows. The frequent itemsets change more frequently
for lower support levels, and these changes dictate when decisions occur. These runs
made decisions at points that turned out to be poor decisions, such as eliminating
an index that ended up being valuable. Little improvement in cost performance is
achieved between 4% support and 2% support. Over-sensitive control feedback can
88
Figure 3.7: Comparative Cost of Online Indexing as Support Changes, random
Dataset, synthetic Query Workload
degrade actual performance, independent of the extra overhead that oversensitive
control causes.
Online Indexing Control Feedback Decision Thresholds - increasing the
thresholds decreases the response time of affecting change when query pattern change
does occur. It also has the effect of increasing the number of times the costly reanalysis
occurs. In a real system, one would need to balance the cost of continued poor
performance against the cost of making an index set change. Figure 3.9 shows the
effect of varying the major change threshold in the coarse control loop for the random
89
Figure 3.8: Comparative Cost of Online Indexing as Support Changes, stock Dataset,
hr Query Workload
90
data set and synthetic query workload. Figure 3.10 shows the effect of changing
the major change threshold for the stock data set and hr query workload. These
graphs were generated using the baseline parameters (except that the random graph
uses a support of 10%), and only varying the major change threshold in the coarse
control loop. The major change threshold is varied between 0 and 1. Here, a value
of 0 translates to never making a change, and a value of 1 means making a change
whenever improvement is possible. The graphs show no real benefit once the major
change threshold increases (and therefore frequency of major changes) beyond 0.6.
Figure 3.9 shows that a value of 1 or 0.9 are best in terms of response time when a
change occurs, but a value of 0.3 shows the best performance once the query patterns
have stabilized. This indicates that this value should be carefully tuned based on the
expected frequency of query pattern changes.
Multi −dimensional Histogram Size - increasing the number of bits used to
represent a bucket in the multi-dimensional histogram improves analysis accuracy at
the cost of histogram generation time and space. Experiments demonstrated that
if the representation quality of the histogram was sufficient, very little benefit was
achieved through greater resolution. Figure 3.11 shows an example of the effect of
varying the multi-dimensional histogram size. Using a 32 bit size for each unique
histogram bucket yields higher costs than by using greater histogram resolution. One
reason for this is that artificially higher costs are calculated because of the lower
histogram resolution. As well, some beneficial index changes are not recommended
because of these overly conservative cost estimates.
91
Figure 3.9: Comparative Cost of Online Indexing as Major Change Threshold
Changes, random Dataset, synthetic Query Workload
92
Figure 3.10: Comparative Cost of Online Indexing as Major Change Threshold
Changes, stock Dataset, hr Query Workload
93
Figure 3.11: Comparative Cost of Online Indexing as Multi-Dimensional Histogram
Size Changes, random Dataset, synthetic Query Workload
94
3.5 Conclusions
A flexible technique for index selection is introduced that can be tuned to achieve
different levels of constraints and analysis complexity. A low constraint, more complex
analysis can lead to more accurate index selection over stable query patterns. A more
constrained, less complex analysis is more appropiate to adapt index selection to
account for evolving query patterns. The technique uses a generated multi-dimension
histogram to estimate cost, and as a result is not coupled to the idiosyncrasies of
a query optimizer, which may not be able to take advantage of knowledge about
correlations between attributes. Indexes are recommended in order to take advantage
of multi-dimensional subspace pruning when it is beneficial to do so.
These experiments have shown great opportunity for improved performance using
adaptive indexing over real query patterns. A control feedback technique is introduced
for measuring performance and indicating when the database system could benefit
from an index change. By changing the threshold parameters in the control feedback
loop, the system can be tuned to favor analysis time or pattern change recognition.
The foundation provided here will be used to explore this tradeoff and to develop
an improved utility for real-world applications. The proposed technique affords the
opportunity to adjust indexes to new query patterns. A limitation of the proposed
approach is that if index set changes are not responsive enough to query pattern
changes then the control feedback may not affect positive system changes. However,
this can be addressed by adjusting control sensitivity or by changing control sensitivity
over time as more knowledge is gathered about the query patterns.
95
From initial experimental results, it seems that the best application for this ap-
proach is to apply the more time consuming no-constraint analysis in order to de-
termine an initial index set and then apply a lightweight and low control sensitivity
analysis for the online query pattern change detection in order to avoid or make the
user aware of situations where the index set is not at all effective for the new incoming
queries.
In this thesis, the frequency of index set change has been affected through the use
of the online indexing control feedback thresholds. Alternatively, the frequency of
performance monitoring could be adjusted to achieve similar results and to appropri-
ately tune the sensitivity of the system. This monitoring frequency could be in terms
of either a number of incoming queries or elapsed time.
The proposed online change detection system utilized two control feedback loops in
order to differentiate between inexpensive and more time consuming system changes.
In practice, the fine grain control threshold was not triggered unless we contrived a
situation such that it would be triggered. The kinds of low cost changes that this
threshold would trigger are not the kinds of changes that make enough impact to
be that much better than the existing index set. This would change if the indexing
constraints were very low, and one potential index is now more valuable than a
suggested index.
Index creation is quite time-consuming. It is not feasible to perform real-time
analysis of incoming queries and generate new indexes when the patterns change.
Potential indexes could be generated prior to receiving new queries, and when indi-
cated by online analysis, moved to active status. This could mean moving an index
from local storage to main memory, or from remote storage to local storage depending
96
on the size of the index. Additionally, the analysis could prompt a server to create a
potential index as the analysis becomes aware that such an index is useful, and once
it is created, it could be called on by the local machine.
97
CHAPTER 4
Multi-Attribute Bitmap Indexes
The significant problems we have cannot be solved at the same level of
thinking with which we created them. - Albert Einstein (attributed).
An access structure built over multiple attributes affords the opportunity to elim-
inate more of a potential result set within a single traversal of the structure than a
single attribute structure. However, this added pruning power comes with a cost.
Query processing becomes more complex as the number of potential ways in which
a query can be executed increases. The multi-attribute access structure may not
be effective for general queries as a multi-attribute index can only prune results for
those attributes that match selection criteria of the query. This chapter presents
multi-attribute bitmap indexes. It describes the changes required when modifying a
traditionally single attribute access structure to handle multiple attributes in terms
of enhancing the query processing model and adding cost-based index selection.
4.1 Introduction
Bitmap indexes are widely used in data warehouses and scientific applications to
efficiently query large-scale data repositories. They are successfully implemented in
commercial Database Management Systems, e.g., Oracle [2], IBM [46], Informix [33],
98
Sybase [20], and have found many other applications such as visualization of scientific
data [56]. Bitmap indexes can provide very efficient performance for point and range
queries thanks to fast bit-wise operations supported by hardware.
In their simplest and most popular form, equality encoded bitmaps also known
as projection indexes, are constructed by generating a set of bit vectors where a ’1’
in the i
th
position of a bit vector indicates that the i
th
tuple maps to the attribute-
value associated with that particular bit vector. In general, the mapping can be used
to encode a specific value, category, or range of values for a given attribute. Each
attribute is encoded independently.
Queries are then executed by performing the appropriate bit operations over the
bit vectors needed to answer the query. A range query over a value-encoded bitmap
can be answered by ORing together the bit vectors that correspond to the range
for each attribute, and then ANDing together these intermediate results between
attributes. The resultant bit vector has set bits for the tuples that fall within the
range queried.
One disadvantage of bitmap indexes is the amount of space they require. Com-
pression is used to reduce the size of the overall bitmap index. General compression
techniques are not desirable as they require bitmaps to be decompressed before ex-
ecuting the query. Run length encoders, on the contrary, allow the query execution
over the compressed bitmaps. The two most popular are Byte-Aligned Bitmap Code
(BBC) [2] and Word Aligned Hybrid (WAH) [57]. The CPU time required to an-
swer queries is related the total length of the compressed bitmaps used in the query
computation.
99
These traditional bitmap indexes are analogous to single dimension indexes in
traditional indexing domains. However, just as traditional multi-attribute indexes
can be more effective in pruning data search space and reducing query time, multi-
attribute bitmap indexes could be applied to more efficiently answer queries. This
cost savings comes from two sources. First, the number of bitwise operations is
potentially reduced as more than one attribute are evaluated simultaneously. Second,
the bitvectors built over two or more attributes are sparser, which allows for greater
compression. The resulting smaller bitmaps makes the processing of the bitmap faster
which translates into faster query execution time.
As an example, consider a 2-dimensional query over a large dataset. In order to
avoid sequential scan over the data we build a index access structure that can direct
the search by pruning the space. If single dimensional indexes, such as B+-Trees or
projection indexes, are available over the two attributes queried, the query would find
candidates for each attribute and then combine the candidate sets to obtain the final
answer. In the case of B+-Trees, the set intersection operation is used to combine
the sets. In the case of projection indexes, the AND operation is used to combine the
bitmaps. However, if a 2-dimensional index such as an R-tree or Grid File exists over
the two queried attributes, then the data space can be pruned simultaneously over
both dimensions. This results in a smaller result set if the two dimensions indexed are
not highly correlated. Also a set intersection operation is not required. Similarly, a
multi-dimensional bitmap over a combination of attributes and values identifies those
records that match the query combination and avoids a further bit operation in order
to answer the query criteria.
100
In this thesis, the use of Multi-Attribute Bitmaps is proposed to improve the
performance of queries in comparison with traditional single attribute indexes. The
goals of this research are to:
• Evaluate of query performance of multi-attribute bitmap enhanced data man-
agement system compared to the current bitmap index query resolution model.
• Support ad hoc queries where information about expected queries is not avail-
able and support the experimental analysis provided herein. We propose an
expected cost-improvement based technique to evaluate potential attribute set
combinations.
• Given a selection query over a set of attributes and a set of single and multiple
attribute bitmaps available for those attributes, determine the bitmap query
execution steps that results in the best query performance with minimal over-
head.
The proposed approaches to attribute combination and query processing are based
on measuring the expected query cost savings of options. Although the techniques
incorporate heuristics in order to control analysis search space and query execution
overhead, experiments show significant opportunity for query execution performance
improvement.
4.2 Background
Bitmap Compression. Bitmap tables need to be compressed to be effective
on large databases. The two most popular compression techniques for bitmaps are
the Byte-aligned Bitmap Code (BBC) [2] and the Word-Aligned Hybrid (WAH) code
101
128 bits 1,20*0,4*1,78*0,30*1
31-bit groups 1,20*0,4*1,6*0 62*0 10*0,21*1 9*1
groups in hex 400003C0 00000000 00000000 001FFFFF 000001FF
WAH (hex) 400003C0 80000002 001FFFFF 000001FF
Figure 4.1: A WAH bit vector.
[59]. Unlike traditional run length encoding, these schemes mix run length encoding
and direct storage. BBC stores the compressed data in Bytes while WAH stores it in
Words. WAH is simpler because it only has two types of words: literal words and fill
words. The most significant bit indicates the type of word we are dealing with. Let
w denote the number of bits in a word, the lower (w−1) bits of a literal word contain
the bit values from the bitmap. If the word is a fill, then the second most significant
bit is the fill bit, and the remaining (w −2) bits store the fill length. WAH imposes
the word-alignment requirement on the fills. This requirement is key to ensure that
logical operations only access words.
Figure 4.1 shows a WAH bit vector representing 128 bits. In this example, we
assume 32 bit words. Under this assumption, each literal word stores 31 bits from the
bitmap, and each fill word represents a multiple of 31 bits. The second line in Figure
4.1 shows how the bitmap is divided into 31-bit groups, and the third line shows the
hexadecimal representation of the groups.
The last line shows the values of WAH words. Since the first and third groups do
not contain greater than a multiple of 31 of a single bit value, they are represented as
literal words (a 0 followed by the actual bit representation of the group). The fourth
group is less than 31 bits and thus is stored as a literal. The second group contains a
multiple of 31 0’s and therefore is represented as a fill word (a 1, followed by the 0 fill
102
B1 400003C0 80000002 001FFFFF 000001FF
B2 00000100 C0000003 00001F08
B1 AND B2 00000100 80000002 001FFFFF 00000108
Figure 4.2: Query Execution Example over WAH compressed bit vectors.
value, followed by the number of fills in the last 30 bits). The first three words are
regular words, two literal words and one fill word. The fill word 80000002 indicates a
0-fill of two-word long (containing 62 consecutive zero bits). The fourth word is the
active word, it stores the last few bits that could not be stored in a regular word.
Another word with the value nine, not shown, is needed to stores the number of
useful bits in the active word. Logical operations are performed over the compressed
bitmaps resulting in another compressed bitmap.
Bit operations over the compressed WAH bitmap file have been shown to be faster
than BBC (2-20 times) [57] while BBC gives better compression ratio. In this paper
we use WAH to compress the bitmaps due to the impact on query execution times.
Relationship Between Bitmap Index Size and Bitmap Query Execution
Time. Using WAH compression, queries are executed by performing logical opera-
tions over the compressed bitmaps which result in another compressed bitmap. As
an example, consider the two WAH compressed bitvectors in Figure 4.2.
First, the first word of each bitvector is decoded, and it is determined that they
are both literal words. Therefore, the logical AND operation is performed over the
two words and the resultant word is stored as the first word in the result bitmap.
When the next word is read from both bitmaps, we have two fill words. The fill word
from B1 is a zero-fill word compressing 2 words, and the fill word from B2 is a one-fill
103
word compressing 3 words. In this case we operate over 2 words simultaneously by
ANDing the fills and appending a fill word into the result bitmap that compresses
two words (the minimum between the word counts). Since we still have one word
left from B2, we only read a word from B1 this time, and AND it with the all 1
word from B2. Finally, the last two words are read from the two bitmaps and are
ANDed together as they are literal words. A full description of the query execution
process using WAH can be found in [59]. Query execution time is proportional to
the size of the bitmap involved in answering the query [56]. To illustrate the effect
of bit density and consequently bitmap size on query times over compressed bitmaps
we perform the following experiment. We generated 6 uniformly random bitmaps
representing 2 million entries for a number of different bit densities and ANDed the
bitmaps together. Figure 4.3 shows average query time against the total bitmap
length of the bitmaps used as operands in the series of operations.
The expected size of a random bitmap with N bits compressed using WAH is
given by the formula [59]:
m
R
(d) ≈
N
w −1

1 −(1 −d
B
)
2w−2
−d
2w−2

where w is the word size, and d is the bit density.
The lower the bit density, the more quickly and more effectively intermediate
results can be compressed as the number of AND operations increases. Figure 3 shows
a clear relationship between query times and bitmap length. The different slopes of
the curve are due to changes in frequency of the type of bit operations performed
over the word size chunks of the bitmap operands. At low bit densities, there is a
greater portion of fill word to fill word operations. As the bit density increases, more
fill word to literal word operations occur. At high bit densities, literal word to literal
104
Figure 4.3: Query Execution Times vs Total Bitmap Length, ANDing 6 bitmaps, 2M
entries
word operations become more frequent. As these inter-word operations have different
costs, the overall query time to bitmap length relationship is not linear. However it
is clear that query execution time is related to overall bitmap length.
Relationship Between Number of Bitmap Operations and Bitmap Query
Execution Time. Figure 4.4 shows AND query times over WAH-compressed bitmaps
as the number of bitmaps ANDed varies. Again the bitmaps are randomly generated
with the reported bit densities for 2 million entries and the number of bitmaps in-
volved in a AND query is varied. There is clear relationship between query times and
the number of bitmap operations.
105
Figure 4.4: Query Execution Times vs Number of Attributes queried, Point Queries,
2M entries versus
Since query performance is related to the total bitmap length and number of
bitmap operations, a logical question is ”can we modify the representation of the data
in such a way that we can decrease these factors?” One could combine attributes and
generate bitmaps over multiple attributes in order to accomplish this. This is because
a combination over multiple attributes has a lower bit density and is potentially more
compressible, and already has an operation that may be necessary for query execution
built into its construction. However, several research questions arise in this scenario
such as which attributes should be combined, how can the additional indexes be
utilized, and what is the additional burden added by the richer index sets.
106
Technical Motivation for Multi-Attribute Bitmaps. Figure 4.5 shows a
simple example combining two single attributes with cardinality 2, into one 2-attribute
bitmap with cardinality 4. In the table, the combined column labeled b1.b2 means
that the value of attribute 1 maps to b1 and the value for attribute 2 maps to b2.
Such a representation can have the effect of reducing the size of the bitmaps that are
used to answer a query and can also reduce the number of operations. As attributes
are combined, bitcolumns become sparser, and the opportunity for compressing that
column increases, potentially decreasing the size of a bitcolumn. Additionally, if a bit
operation is already incorporated in the multi-attribute bitmap construction, then it
does not need to be evaluated as part of the query execution.
One could think of multi-attribute bitmaps as the AND bitmap of two or more
bitmaps from different attributes. One could argue that multi-attribute bitmaps are
reducing the dimensionality of the dataset but increasing the cardinality of the at-
tributes involved in the new dataset. Although bitmap indexes have historically been
discouraged for high-cardinality attributes, the advances in compression techniques
and query processing over compressed bitmaps have made bitmaps an effective solu-
tion for attributes with cardinalities on the order of several hundreds [58].
As an illustrative example of enhancing compression from combining attributes,
consider a bitmap index built over two attributes, each with cardinality 10, where
the data is uniformly randomly distributed independently for each attribute for 5000
tuples. Using WAH compression, there are very few instances where there are long
enough runs of 0’s to achieve any compression (since out of 10 random tuples, we
would expect 1 set bit for a value). In such a test which matched this scenario, very
107
Attribute 1 Attribute 2 Combined
Tuple b1 b2 b1 b2 b1.b1 b1.b2 b2.b1 b2.b2
t
1
1 0 0 1 0 1 0 0
t
2
0 1 1 0 0 0 1 0
t
3
1 0 1 0 1 0 0 0
t
4
1 0 0 1 0 1 0 0
t
5
0 1 0 1 0 0 0 1
t
6
0 1 1 0 0 0 1 0
Figure 4.5: Simple multiple attribute bitmap example showing the combination two
single attribute bitmaps.
little compression was achieved and bitmaps representing each of the 20 attribute-
value pairs were between 640 and 648 bytes long.
As an alternative, we can generate a bitmap index for the combination of the two
attributes. Now the cardinality of the combined attributes is 100, and the expected
number of set bits in a run of 100 tuples for a single value is 1. We can expect much
greater compression for this situation. In the given scenario, the generated size of the
100 2-D bitmaps was between 220 and 400 bytes.
Now consider a range query such that attribute 1 must be between values 1 and
2, and attribute 2 must be equal to 3. The operations using the traditional bitmap
processing to answer the query is ((ATT1=1) OR (ATT1=2)) AND (ATT2=3). The
time required is dependent on the total length of the bitmaps involved. Each of
these bitmaps as well as the resultant intermediate bitmap from the OR operation is
648 bytes, for a total of 2592 bytes. For the multi-attribute case, the query can be
answered by (ATT1=1,ATT2=3) OR (ATT1=2,ATT2=3). These bitmaps are 268
and 260 bytes respectively, for a total of 528 bytes. In this scenario, we can reduce
108
both the size of the bitmaps involved in the query execution and also the number of
bitmaps involved.
This example demonstrates the opportunity to improve query execution by rep-
resenting multiple attribute values from separate attributes in a bitmap. In general,
point queries would always benefit from the multi-attribute bitmaps as long as all the
combined attributes are queried. However, if only some of the attributes were queried
or a large range from one or more attributes were queried, then the single attribute
bitmaps would perform better than the multi-attribute bitmaps. This is the reason
why we propose to supplement the single attribute bitmaps with the multi-attribute
bitmaps and not to replace the single attribute bitmaps. Multi-attribute bitmaps
would only be used in the queries that result in an improved query execution time.
Supplementing single attribute bitmaps with additional bitmaps that improve
query execution time has also been done in [50], where lower resolution bitmaps were
used in addition to the single attribute bitmaps or high resolution bitmaps. One could
think of this approach as storing the OR of several bitmaps from the same attribute
together.
4.3 Proposed Approach
In this section we present our proposed solution for the problems presented in the
previous section. The primary overall tasks can be summarized as multi-attribute
bitmap index selection and query execution. In index selection, we determine which
attributes are appropriate or the most appropriate to combine together. This can be
performed with varying degrees of knowledge about a potential query set. Once can-
didate attribute combinations are determined, the combined bitmaps are generated.
109
In query execution, the candidate pool of bitmaps are explored to determine a plan
for query execution.
4.3.1 Multi-attribute Bitmap Index Selection for Ad Hoc
Queries
In many cases the attributes that should be combined together as multi-attribute
bitmaps are obvious given the context of the data and queries. Whenever multiple
attributes are frequently queried together over a single bitmap value, that combination
of attributes is a candidate for multi-attribute indexing. As an example, consider a
database system behind a car search web form. If the form has categorical entries
for color and model, then a multi-attribute bitmap built over these two attributes
can directly filter those data objects that meet both search criteria without further
processing.
In order to support situations where no contextual information is available about
queries and in order to allow for comparative performance evaluation, we provide a
cost improvement based candidate evaluation system. The goal behind bitmap index
selection is to find sets of attributes that, if combined, can yield a large benefit in terms
of query execution time. The analogy within the traditional multi-attribute index
selection domain would be finding attributes that are not individually effective in
pruning data space, but can effectively prune the data space when combined together.
An effective index selection technique for bitmap indexes should be cognizant of
the relationships between attributes. As well, the selectivity estimated for a given
attribute combination should accurately reflect the true data characteristics.
Since query execution time is dependent both on the total size of the operands
involved in a bitmap computation and also the number of bitmap operations, we use
110
these measures to determine the potential benefit for a given attribute combination.
Algorithm 3 shows how candidate multi-attribute indexes are evaluated and selected
in terms of potential benefit.
Determining Index Selection Candidates
SelectCombinationCandidates (Ordered set of attributes A,
dLimit, cardLimit)
notes:
attSets - is a list of Attribute Sets and corresponding goodness
measures
selected - are a set of recommended attribute sets to build indexes
over
1: attSets = {[{a
i
}, 0] |a
i
∈ A}
2: currentDim = 1
3: while (currentDim < dLimit AND newSetsAdded)
4: unset newSetsAdded
5: for each as
i
∈ attSets, s.t. |as
i
| = currentDim
6: for each attribute a
j
, j = (maxAtt(as
i
)+1) to |A|
7: if (as
i
.card*a
j
.card¡cardLimit)
8: as
j
= [as
i
∪ {a
j
}, detGoodness(as
i
, a
j
)]
9: if goodness(as
j
) > goodness(as
i
)
10: add as
j
to attSets; set newSetsAdded
11: currentDim + +
12: sort attSets by goodness
13: ca = ∅
14: while ca = A
15: attribute set as = next set from attSets
16: if as ∩ ca = ∅
17: ca = ca ∪ as.attributes
18: include as in selected
Algorithm 3: Selecting Attribute Combination Candidates.
111
The algorithm is somewhat analogous to the Apriori method for frequent itemset
mining in that a given candidate set is only evaluated if it’s subset is considered bene-
ficial. The algorithm takes several parameters as input. These include an array, A, of
the attributes under analysis. The parameter dLimit provides a maximum attribute
set combination size. The maximum cardinality allowed for a given prospective com-
bination can be controlled by the cardLimit parameter.
The algorithm evaluates increasingly larger attribute sets (loop starting at line 3).
Each unique combination is evaluated (lines 5-6). Initially all size 1 attribute sets are
combined with each other size 1 attribute set such that each unique size 2 attribute
set is evaluated. In the next iteration of the outer loop, each potentially beneficial
size 2 set is evaluated with relevant size 1 attributes (note that by evaluating only
those singleton sets of attributes numbered larger than the largest in the current set,
we can evaluate all unique set instances). If the number of bitmaps generated by
this combination is excessive (line 7), the set will not be evaluated. The card value
associated with an attribute set is the number of value combinations possible for the
attribute set, and reflects the number of bitmaps that would have to be generated to
index based on the set. A goodness measure is evaluated for the proposed combination
(line 8). If the goodness of the combination exceeds the goodness of its ancestor, then
the combination set is added to the list of attribute sets for further evaluation during
the next loop iteration. The term newSetsAdded in line 3 is a boolean condition
indicating if any attribute set was added during the last loop iteration.
Once the loop terminates, the sets in attSets are sorted by their goodness mea-
sures (line 12). Then sets of attributes that are disjoint from the covered attributes,
112
ca, are greedily selected from the sorted sets and added to the recommended bitmap
indexes.
Estimating Performance Gain for a Potential Index
detGoodness (attribute set as
i
, attribute a
j
)
1: set probQT and probSR
2: goodness=0
3: for each bitcolumn bc
k
in as
i
4: for each bitcolumn bc
j
in a
j
5: bitcolumn cb = AND(bc
k
, bc
j
)
6: savings = bc
k
.length + bc
j
.length+
bc
k
.creationLength −cb.length
7: if savings > 0
8: goodness+ = savings ∗ cb.matches
9: goodness∗ = (probQT ∗ probSR)
|as
i
|+1
10: return goodness
Algorithm 4: Determine goodness of including an attribute in a multi-attribute
bitmap.
The goodness of a particular attribute set is computed by Algorithm 4. The
variables, probQT and probSR describe global query characteristics and are used to
estimate diminishing benefits as index dimensionality increases. The probQT pa-
rameter represents the probability that an attribute is queried. As with traditional
multi-attribute index, a 2-attribute index is not as useful for a single attribute query.
The probSR parameter is the probability that the query range for an attribute will
be small enough such that a multi-attribute index over the query criteria will still be
beneficial.
113
Each unique value combination of the entire attribute set (combinations generated
in lines 4-5) is evaluated in terms of bitmap length savings over a corresponding
single attribute bitmap solution. The goodness of an attribute set is measured as the
weighted average of the bitlength savings of each value combination (line 8). The
savings of a particular combination is measured as the sum of the compressed lengths
of the bitmap operands and the compressed length of the bitmaps needed to create
the first operand, minus the compressed length of the combined bitmap. For size 1
attribute sets, the bitcolumn creation length is 0. However, for a multiple attribute
bitmap this value is the length of all the constituent bitmaps used to create it. The
total savings reflects the difference in total bitmap operand length to answer the
combination under analysis.
If the savings for a particular combination is negative, then its negative savings
is not accumulated (line 7). This is because this option will not be selected during
the query execution for this combination and will not incur an additional penalty.
Each combination is weighted by its frequency of occurrence. This strategy assumes
that the query space is similar in distribution to the data space. The final goodness
measure for an attribute set is also modified based on the probability that benefit
will actually be realized for the query (line 9).
This technique finds attributes for which their combinations will lead to a reduc-
tion in the number of operations as well as a reduction in the constituent bitmap
lengths. Both of these factors can provide a performance gain depending on the
query criteria. Our approach takes into account the characteristics of the data by
weighting and measuring the effectiveness of specific inter-attribute combinations.
The index selection solution space is controlled by limiting parameters and through
114
greedy selection of potential solutions. Although the indexes recommended are not
guaranteed to be optimal, they can be determined efficiently. Later experiments show
that the reduction in the number of total operations has a more substantial effect on
query execution time than reduction of bitmap length, (i.e. seemingly less than ideal
attribute combinations can still provide performance gains).
The behavior of the algorithm can be affected with slight alteration. For example,
if index space is critical, we can order the pairs in line 12 of the Select Combination
Candidates algorithm (Algorithm 3) by goodness over combined attribute size. If the
query is equally likely to occur in the data space, then the weighting in line 8 of the
determine Goodness algorithm (Algorithm 4) can be removed.
Once bitmap combinations are recommended, each combined bitmap can be gen-
erated by assigning an appropriate identifier to the bitmap and ANDing together the
applicable single attribute bitmaps. Once multi-attribute bitmaps are available in the
system, we need a mechanism to decide when single attribute and/or multi-attribute
bitmaps would be used. In other words, we need to decide on a query execution plan
for each query.
4.3.2 Query Execution with Multi-Attribute Bitmaps
One issue associated with using multiple representations for data indexes, is that
the query execution becomes more complex. In order to take advantage of the most
useful representation, the potential execution plans must be evaluated and the op-
tion that is deemed better should be performed. To be effective, the complexity
added by the additional representations can not be more costly than the benefits
attained by incorporating them. We take a number of measures in order to limit the
115
overhead associated with selecting the best query execution option. We keep track
of those attribute sets that have bitmap representations (either single attribute or
multi-attribute). We use a unique mapping for bitmap identification. We explore
potential query execution options using inexpensive operations.
Global Index Identification
In order to identify the potential index search space, we maintain a data structure
of those attributes and combined attributes for which we have bitmap indexes con-
structed. If we have available a single attribute bitmap for the first attribute, we will
include a set identified as [1] in this structure. If we have available a multi-attribute
bitmap built over attributes 1,2, and 3 then we include the set [1,2,3] in the struc-
ture. Upon receiving a query, we compare the set representation of the attributes
in the query to the sets in the index identification structure. Those sets that have
non-empty intersections with the query set are candidates for use in query execution.
Bitmap Naming
We need to be able to easily translate a query to the set of bitmaps that match the
criteria of a query. In order to do this we devise a one-to-one mapping that directly
translates query criteria to bitmap coverage. The key is made up of the attributes
covered by the bitmap and the values for those attributes. For example, a multi-
attribute bitmap over attributes 9 and 10 where the value for attribute 9 is 5 and the
value for attribute 10 is 3 would get ”[9,10]5.3” as its key. The attributes are always
stored in numerical order so that there is only one possible mapping for various set
orders, and the values are given in the same order as the attributes. Finding relevant
bitmaps is a matter of translating the query into its key to access the bitmaps.
116
Query Execution
During query execution we need a lightweight method to check which potential
query execution will provide the fastest answer to the query. The first step is to
compare the set of attributes in the query to the sets of attributes for which we have
constructed bitmap indexes. Those indexes which cover an attribute involved in the
query are candidates to be used as part of the solution for the query. Once candidate
indexes are determined, we compute the total length of the bitmaps required to answer
the query.
A recursive function is used to compute the total length of all matching bitmaps
in order to handle arbitrary sized attribute sets. The complete unique hash key is
generated in the base case of the recursive method and the length of the matching
bitmaps is obtained and returned to the calling caller. For the recursive case, the
unique key is built up and passed to the recursive call, and the callee length results
are tabulated.
We just need to sum the lengths of bitmaps that match the query criteria, which
we can compute using the recursive function. For example, if the total length of
matching bitmaps for the multi-attribute bitmap over attributes [1,2] is less than the
total length for both attributes [1] and [2], then the multi-attribute bitmap should be
used to answer the query. However, if the multi-attribute bitmaps have greater total
length, then the single attribute bitmaps should be used.
The logic applied to determine the query execution plan needs to be lightweight
so as to not add too much overhead to handle the increased complexity of bitmap
representation. So rather than incorporating logic that guarantees an optimal minimal
total bitmap length, we approximate this behavior using heuristics. We order the
117
potential attribute sets in descending order based on the amount of length saved by
using the index. As a baseline, all potential single attribute indexes get a value of
0 for the amount of length saved. Multi-attribute indexes get a value which is the
difference between the length of the single attribute indexes that would be required
to answer the query and the length of the multi-attribute index to answer the query.
Then we process the query by greedily selecting non-covered attribute sets from this
order until the query is answered. Those multi-attribute indexes that have positive
benefit compared to single attribute indexes will have positive value, and will be
processed before any single attribute index. Those that have negative values, would
actually slow down the query, and will not be processed before their constituent single
attribute indexes.
Algorithm 5 provides the algorithm for executing selection queries when the entire
bitmap index has both single attribute and multi-attribute bitmap indexes available to
process the query. It involves two major steps. In the first step, the potential indexes
are prioritized based on how promising they are with respect to saving total bitmap
length during query execution. The second step performs the bit operations required
to answer the query. The algorithm assumes that the bitmap indexes available are
identified and that we can access any bitcolumn through a unique hashkey.
In lines 1-5, we compute the total size of the bitmaps that match the query
considering those attributes that intersect between the query and the index attribute
set currently under analysis. For example, consider a query over values 1 and 2 for
attribute 1, and and over value 3 for attribute 2 where we have bitmaps available for
the attribute sets [1], [2], and [1,2] combined. Table 4.1 shows the query matching
keys for each of the potential indexes. At the end of this section, we will have a total
118
SelectionQueryExecution (Available Indexes AI,
BitColumn Hashtable BI, query q)
notes:
AI - contains identification of sets of indexed attributes,
with length and saved fields and query matching
keys list
BI - provides access to bitcolumn using unique key
B1 - a bitcolumn of all 1’s
// Prioritize Query Plan
1: for each attribute set as in AI
2: if as.attributes ∩ q.attributes = ∅
3: generate query matching keys for as
4: attach query matching keys to as
5: sum query matching bitmap lengths
6: for each attribute set as in AI
7: if as.numAtts == 1
8: as.saved = 0
9: else
10: for each size 1 attribute subset a in as
11: as.saved += a.length
12: as.saved -= as.length
13: sort AI by saved field
// Execute Query
14: attribute set ca = null
15: queryResult = B1
16: while ca = q.attributes
17: as = next attribute set from AI
18: if as.attributes ∩ ca = ∅
19: ca = ca ∪ as.attributes
20: subresult = BI.get(as.keys[0])
21: for i = 2 to |as.keys|
22: subresult = OR(subresult,BI.get(as.keys[i]))
23: queryResult = AND(queryResult, subresult)
24: return queryResult
Algorithm 5: Query Execution for Selection Queries when Combined Attribute
Bitmaps are Available.
119
query matching bitmap length for each available index. Note that the computation
of bitmap length in line 5 is not actually a simple sum. The bit density of each
matching bitmap is estimated based on its length. Since the matching bitmaps will
not overlap, these bit densities are summed to arrive at an effective bit density of the
resultant OR bitmap. Length of the resultant bitmap is then derived using effective
bit density. If the bit density is within the range of about 0.1 to 0.9, then its value
can not be determined. In these cases, we conservatively estimate its value as 0.5.
This does not adversely affect query operation since the multi-attribute option would
not be selected anyway.
Available Index Matching Keys
[1] [1]1,[1]2
[2] [2]3
[1, 2] [1, 2]1.3 , [1, 2]2.3
Table 4.1: Sample Query Matching Keys for Query Attributes 1 and 2.
In lines 6-12, we use this initial information in order to compute the total bitmap
length saved by using a multi-attribute bitmap in relation to using the relevant single
attribute alternatives. For a multiple attribute index, we sum up the query matching
bitmap length associated with each single attribute subset of the multi-attribute
attribute set. The difference between this length and the total length of the query
matching bitmaps for the multi-attribute set is the amount of bitmap length saved by
using the multi-attribute set. All single attribute indexes are assigned a saved value
of 0. Those multi-attribute indexes that reflect a cost savings will get a positive value
120
for saved, while those that are more costly will get a negative value. In line 13, we
sort the potential indexes in terms of this saved value.
In lines 14 and 15, we initialize a set that indicates the attributes covered by the
query so far and initialize the query result to all 1’s. Then we start processing the
query in order of how promising the prospective indexes were an continue until all
query attributes are covered (line 15). We get the next best index, and check if it’s
attributes have already been covered (line 17,18). If not, we will process a subquery
using the index. We update the covered attribute set and OR together all bitcolumns
that match the queries relevant to the query and the index. We AND together this
result to build the subqueries into the overall query result. When we have covered all
query attributes, the query is complete and we return the result.
4.3.3 Index Updates
Bitmap indexes are typically used in write-once read-mostly databases such as
data warehouses. Appends of new data are often done in batches and the whole index
is rebuilt from scratch once the update is completed. Bitmap updates are expensive
as all columns in the bitmap index need to be accessed. In [12], we propose the use of
a padword to avoid accessing all bitmaps to insert a new tuple but only to update the
bitmaps corresponding to the inserted values. In other words, the number of bitmaps
that need to be updated when a new tuple is appended in a table with A attributes
is A as opposed to Σ
A
i=1
c
i
, where c
i
is the number of bitmap values (or cardinality)
for attribute A
i
. This approach, which clearly outperforms the alternative, is suitable
when the update rate is not very high and we need to maintain the index up-to-date
in an online manner.
121
Updating multi-attribute bitmaps is even more efficient than updating traditional
bitmaps. If multi-attributes bitmaps are also padded we only need to access the
bitmaps corresponding to the combined values (attributes) that are being inserted.
For example, consider the previous table with A attributes where multi-attribute
bitmaps are built over pair of attributes, the cost of updating the multi-attribute
bitmaps is half of the cost of updating the traditional bitmaps, as only
A
2
bitmaps
need to be accessed.
4.4 Experiments
4.4.1 Experimental Setup
Experiments were executing using a 1.87GHz HP Pavilion PC with 2GB RAM
and Windows Vista operating system. The bitmap index selection, bitmap creation
and query execution algorithms were implemented in Java.
Data
Several data sets are used during experiments. They are characterized as follows:
• synthetic - 12 attributes, 1M records, attributes 1-3 have cardinality 2, 4-6 have
cardinality 5, 7-9 have cardinality 10, and 10-12 have cardinality 25. Attributes
1,4,7,10 follow uniform random distribution, attributes 2,5,8,and 11 follow Zipf
distribution with s parameter set to 1, attributes 3,6,9, and 12 follow Zipf
distribution with s parameter set to 2.
• uniform - uniformly distributed random data set. It has 12 attributes, each
with cardinality 10, 1M records.
122
• census - a discretized version of the USCensus1990raw data set. The raw dataset
is from the (U.S. Department of Commerce) Census Bureau website. The dis-
cretized dataset comes from the UCI Machine Learning Repository. We used the
first 1 million instances and 44 attributes. The cardinality of these attributes
varies from 2 to 16. The data set can be obtained from [47]
• HEP - a bin representation of High-Energy Physics data. It has 12 attributes.
The cardinality per dimensional varies between 2 and 12. There are a total of
122 bitcolumns for this data set. This data set is not uniformly distributed.
This data set is made up of more than 2 million records.
• Landsat - a bin representation of satellite imagery data. The data set is 60
attributes, each cardinality 10. It contains over 275000 records.
Queries
Query sets were generated by selecting random points from the data and con-
structing point queries from those points. These points remain in the data set, so
that it is guaranteed that there will be at least one match for each query. For those
cases where range queries are used, the range is expanded around a selected point so
that the range is guaranteed to contain at least one match.
Metrics
Our goal is to compare performance of a system reflecting the current state of the
art, where only single attribute bitmaps are available, against our proposed approach
in which both single and multi-attribute bitmaps are available. We present compar-
isons in terms of query execution time, number of bitmaps involved in answering a
123
query, and total bitmap size accessed to answer a query. Query execution assumes
the bitmap indexes are in main memory. Further performance enhancement would
be possible during query execution if the required bitmaps need to be read from disk.
4.4.2 Experimental Results
Index Selection
We executed the index selection algorithm over the synthetic data set. We set the
cardLimit parameter to 100 and set the query probability parameters to 1 (i.e. no
degradation at higher number of attributes). The attribute sets that were selected
to combine together were [1,4,7], [2,5,8], and [3,6,9]. This solution makes sense in
the context of the performance metric of weighted benefit. We would expect more
expected benefit for combinations for which the number of value combinations ap-
proaches the cardinality limit, because a sparser bitmap can be more effectively com-
pressed. We would also expect more expected benefit from combining attributes that
are independent. Therefore set [1,4,7] is selected first, because it’s value combination
cardinality is 100, and the values are independent. The other combinations have
greater inter-attribute dependency because they follow Zipf distributions. Attributes
10, 11, and 12 do not combine with any others in this scenario because there are
no free attributes that would be within cardinality constraints. The average query
execution of 100 point queries using only single attribute bitmaps was 41.47ms, while
it was 17.39ms using the recommended indexes. The time for using recommended
index set includes an average overhead of 0.25ms to explore the richer set of options
and select the best query execution plan.
After executing index selection with parameters probQT and probSR set to 0.5,
we instead get 2 attribute recommendations. The proposed set of indexes is [7,8],
124
Range Length
Query A1 A2
Q1 1 1
Q2 1 2
Q3 1 5
Q4 2 5
Q5 2 5
Q6 5 5
Table 4.2: Query Definitions for 2 Attributes with cardinality 10 and uniform random
distribution.
[1,10], [2,11], [4,9], [5,6], and [3,12]. Each of these has a value combination cardinality
greater than or equal to 50 and selection order is dependent on cardinality and then
inter-attribute independence.
Query Execution
Controlled Query Types. We ran 6 query types over bitmaps built over 2
attributes with uniform random distribution and cardinality 10. The queries are
detailed in Table 4.2 and varied in the length of the range. A range of 1, means that
the query for that attribute is a point query. Q1, for example, is a point query on
both attributes, and Q3 selects one value from the first attribute and five values from
the second attribute. Note that in this case, 5 is the worst case scenario. Any range
bigger than five could take the complement of the non-queried range to answer the
query.
Figures ?? and 4.8 show the query execution time, number of bitmaps involved in
answering each query type, and total size of the bitmaps involved in answering the
query, respectively. In these figures, The ‘Only 1-Dim’ label refers to the case where
only Single Attribute Bitmaps are available, the ‘Only M-Dim’ label refers to the
125
case where the queries are forced to use Multi-Attribute Bitmaps, and the ‘Proposed’
label refers to the case where the query execution plan is selected considering both
indexes available.
For queries Q1-Q4, the multi-attribute bitmaps are always selected, and they are
significantly faster than the single-attribute options. For queries Q5-Q6, if single and
multi-attribute bitmaps are available, single attribute is always selected. However,
the best is always slightly faster than the proposed technique due to the overhead of
evaluating and selecting the best option. For these queries, the overhead was on the
order of 0.2ms. The proposed technique usually selects the option that involves the
shortest original bitmaps required to answer the query. The exception here is Q5.
The single dimension index is used because it involves fewer bitmap operations and
involves less total bitmap operand length over the set of operations that need to take
place.
Figure 4.6: Query Execution Time for each query type.
126
Figure 4.7: Number of bitmaps involved in answering each query type.
Query Execution for Point Queries. Figure 4.9 demonstrates the potential
query speedup associated with our proposed query processing technique when multi-
attribute bitmaps are available to answer queries. The single-D bar shows the average
query execution time required to answer 100 point queries using traditional bitmap
index query execution, where points are randomly selected from the data set (there
is at least one query match). For the multi-attribute test, indexes were selected and
generated with a maximum set size of 2. So a suite of bitcolumns representing 2-
attribute combinations is available for query execution as well the single attribute
bitmaps. Each single attribute is represented in exactly one attribute pair. The
results show significant speedup in query execution for point queries. Point queries
over the uniform, landsat, and hep datasets experience speedups of factors of 5.01,
3.23, and 3.17 respectively.
127
Figure 4.8: Total size of the bitmaps involved in answering each query type.
Similar results were achieved for the census data set. Indexes were selected without
a dimensionality limit and with a cardinality limit of 100. The query execution time
using only single attribute bitmaps was 83.01 ms while the query execution time was
only 24.92 ms when both, single attribute and multi-attribute bitmaps were available.
Query Execution Over Range Queries. Figure 4.10 shows the impact of
using the single-attribute bitmaps instead of the proposed model as the queries trend
from being pure point queries to gradually more range-like queries. Average query
execution times over 100 queries are shown. The queries are constructed by making
the matching range of an attribute either a single value, or half of the attribute
cardinality. Whether an attribute is a point or a range is randomly determined. The
x-axis shows the probability that a given pair of attributes in the query represents a
range (e.g .1 means that there is 90% probability that 2 query attributes are both over
single values). The graph shows significant speedup for point queries. Although the
128
Figure 4.9: Query Execution Times, Point Queries, Traditional versus Multi-
Attribute Query Processing
speedup ratio decreases as the queries become more range-like, the multi-attribute
alternative maintains a performance advantage until nearly all the query attributes
are over ranges. This is because there are still some cases where using the multi-
attribute bitmap is useful.
Index Size
Figure 4.11 shows bitmap index sizes for each data set. The lower portion of the
bar shows the total size of the bitmaps for all of the single attribute indexes. The
upper portion of the bar shows the size of the suite of 2-attribute bitmaps (each at-
tribute represented exactly once in the suite). The entire bar represents the space
required for both the single and 2-attribute bitmaps. Even though the multi-attribute
129
Figure 4.10: Query Execution Times as Query Ranges Vary, Landsat Data
bitmaps represent many more columns than the single attribute bitmaps(e.g. the sin-
gle dimension landsat are covered by 600 bitcolumns, while its 2-attribute counterpart
is represented by 3000 bitmaps (30 attribute pairs, 100 bitmaps per pair)), the com-
pression associated with the combined bitmaps means that the space overhead for
the multi-attribute representations is not as severe as may be expected.
130
Figure 4.11: Single and Multi-Attribute Bitmap Index Size Comparison
Bitmap Operations versus Bitmap Lengths
Since query execution enhancement is derived from two sources, reduction in
bitmap operations and reduction in bitmap length, it is valuable to know the rel-
ative impacts of each one. To this end, we tested query execution after performing
index selection using different strategies.
Table 4.3 shows the query speedup for point queries using different strategies to
determine the attributes that should be combined together. The speedup is measured
over 100 random point queries taken from HEP data. For each listed strategy, a suite
of size 2 attribute sets is determined differently. Randomly selected attributes simply
combines random pairs of attributes. The other 2 strategies use Algorithm 3 to
generate goodness factors for each combination based on bitmap length saved and
weighted by result density. The goodness using best greedily selects non-intersecting
131
Attribute Combination Strategy Query Speedup
1 Algorithm 1, Selecting Best
Combinations
3.15
2 Randomly Selected At-
tributes
2.79
3 Algorithm 1 Selecting Worst
Beneficial Combinations
2.67
Table 4.3: Query Speedup, HEP Data, Using Different Attribute Combination Strate-
gies
pairs starting from the top of the list, while the other technique greedily selects
non-overlapping pairs from the bottom. The results show that significant speedup
can be achieved simply by combining reasonable attributes together, and that query
execution can be further enhanced by careful selection of the indexes.
4.5 Conclusion
In this paper, we propose the use of multi-attribute bitmap indexes to speed up
the query execution time of point and range queries over traditional bitmap indexes.
We introduce an index selection technique in order to determine which attributes
are appropriate to combine as multi-attribute bitmaps. The technique is driven by a
performance metric that is tied to actual query execution time. It takes into account
data distribution and attribute correlations. It can also be tuned to avoid undesirable
conditions such as an excessive number of bitmaps or excessive bitmap dimensionality.
We introduce a query execution model when both traditional and multi-attribute
bitmaps are available to answer a query. The query execution takes advantage of the
multi-attribute bitmaps when it is beneficial to do so while introducing very little
overhead for cases when such bitmaps are not useful. Utilizing our index selection
132
and query execution techniques, we were able to consistently achieve up to 5 times
query execution speedup for point queries while introducing less than 1% overhead
for those queries where multi-attribute indexing was not useful.
133
CHAPTER 5
Conclusions
Data retrieval for a given application is characterized by the query types, query
patterns over time, and the distribution and characteristics of the data itself. In order
to maximize query performance, the query processing and access structures should
be tailored to match the characteristics of the application.
The optimization query framework presented herein describes an I/O-optimal
query processing technique to process any query within the broad class of queries
that can be stated as a convex optimization query. For such applications, the query
processing follows a access structure traversal path that matches the function being
optimized while respecting additional problem constraints.
While resolving queries, the possible query execution plans evaluated by a data
management system are limited by available indexes. The online high-dimensional in-
dex recommendation system provides a cost-based technique to identify access struc-
tures that will be beneficial for query processing over a query workload. It considers
both the query patterns and the data distribution to determine the benefit of different
alternatives. Since workloads can shift over time, a method to determine when the
current index set is no longer effective is also provided.
134
The multi-attribute bitmap indexes presents both query processing and index
selection for the traditionally single attribute bitmap index. Query processing selects
a query execution plan to reduce the overall number of operations. The index selection
task finds attribute combinations that lead to the greatest expected query execution
improvement.
The primary end goal of the research presented in this thesis is to improve query
execution time. Both the run-time query processing and design-time access methods
are targeted as potential sources of query time improvement. By ensuring that the
run-time traversal and processing of available access structures matches the charac-
teristic of a given query and that the access structures available match the overall
query workload and data patterns, improved query execution time is achieved.
135
BIBLIOGRAPHY
[1] S. Agrawal, S. Chaudhuri, L. Koll´ar, A. P. Marathe, V. R. Narasayya, and
M. Syamala. Database tuning advisor for microsoft sql server 2005. In VLDB,
pages 1110–1121, 2004.
[2] G. Antoshenkov. Byte-aligned bitmap compression. In Data Compression Con-
ference, Nashua, NH, 1995. Oracle Corp.
[3] E. Barucci, R. Pinzani, and R. Sprugnoli. Optimal selection of secondary indexes.
IEEE Transactions on Software Engineering, 1990.
[4] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University
Press, 1961.
[5] S. Berchtold, C. B¨ohm, D. A. Keim, and H.-P. Kriegel. A cost model for nearest
neighbor search in high-dimensional data space. PODS, pages 78–86, 1997.
[6] S. Berchtold, D. Keim, and H. Kriegel. The x-tree: An index structure for high-
dimensional data. In Proceedings of the International Conference on Very Large
Data Bases, pages 28–39, 1996.
[7] A. Blum and P. Langley. Selection of relevant features and examples in machine
learning. AI, 1997.
[8] C. Bohm. A cost model for query processing in high dimensional data spaces.
ACM Trans. Database Syst., 25(2):129–178, 2000.
[9] C. Bohm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces:
Index structures for improving the performance of multimedia databases. ACM
Comput. Surv., 33(3):322–373, 2001.
[10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University
Press, New York, NY. USA, 2004.
[11] N. Bruno and S. Chaudhuri. To tune or not to tune? a lightweight physical
design alerter. In VLDB, pages 499–510, 2006.
136
[12] G. Canahuate, M. Gibas, and H. Ferhatosmanoglu. Update conscious bitmap
indices. In SSDBM, 2007.
[13] A. Capara, M. Fischetti, and D. Maio. Exact and approximate algorithms for
the index selection problem in physical database design. IEEE Transactions on
Knowledge and Data Engineering, 1995.
[14] Y.-C. Chang, L. Bergman, V. Castelli, C.-S. Li, M.-L. Lo, and R. J. Smith. The
onion technique: indexing for linear optimization queries. ACM SIGMOD, 2000.
[15] S. Chaudhuri and V. Narasayya. Autoadmin ’what-if’ index analysis utility. In
Proceedings ACM SIG-MOD Conference, pages 367–378, 1998.
[16] S. Chaudhuri and V. R. Narasayya. An efficient cost-driven index selection tool
for microsoft SQL server. In The VLDB Journal, pages 146–155, 1997.
[17] S. Choenni, H. Blanken, and T. Chang. On the selection of secondary indexes
in relational databases. Data and Knowledge Engineering, 1993.
[18] R. L. D. C. Costa and S. Lifschitz. Index self-tuning with agent-based databases.
In XXVIII Latin-American Conference on Informatics (CLIE), Montevideo,
Uruguay, 2002.
[19] A. Dogac, A. Y. Erisik, and A. Ikinci. An automated index selection tool for
oracle7: Maestro 7. Technical Report LBNL/PUB-3161, TUBITAK Software
Research and Development Center, 1994.
[20] H. Edelstein. Faster data warehouses. Information Week, December 1995.
[21] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middle-
ware. Journal of Computer and System Sciences, 2003.
[22] C. Faloutsos, T. Sellis, and N. Roussopoulos. Analysis of object oriented spatial
access methods. In SIGMOD ’87: Proceedings of the 1987 ACM SIGMOD in-
ternational conference on Management of data, pages 426–439, New York, NY,
USA, 1987. ACM Press.
[23] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, and A. E. Abbadi. Vector approx-
imation based indexing for non-uniform high dimensional data sets. In CIKM
’00.
[24] M. Frank, E. Omiecinski, and S. Navathe. Adaptive and automated index selec-
tion in rdbms. In International Conference on Extending Database Technologhy
(EDBT), Vienna, Austria, 1992.
137
[25] V. Gaede and O. Gunther. Multidimensional access methods. ACM Comput.
Surv., 30(2):170–231, 1998.
[26] A. Gionis, P. Indyk, and R. Motwani. Similarity searching in high dimensions
via hashing. In Proceedings of the Int. Conf. on Very Large Data Bases, pages
518–529, Edinburgh, Sootland, UK, September 1999.
[27] M. Grant, S. Boyd, and Y. Ye. Disciplined convex programming. Kluwer (Non-
convex Optimization and its Applications series), Dordrecht, 2005. In press.
[28] C.-W. C. Guang-Ho Cha. The gc-tree: a high-dimensional index structure
for similarity search in image databases. IEEE Transactions on MultiMedia,
4(2):235–247, June 2002.
[29] I. Guyon and A. Elissef. An introduction to variable and feature selection. Jour-
nal of Machine Learning Research, 2003.
[30] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate gen-
eration. In W. Chen, J. Naughton, and P. A. Bernstein, editors, 2000 ACM
SIGMOD Intl. Conference on Management of Data, pages 1–12. ACM Press, 05
2000.
[31] G. R. Hjaltason and H. Samet. Ranking in spatial databases. In Symposium on
Large Spatial Databases, pages 83–95, 1995.
[32] V. Hristidis, N. Koudas, and Y. Papakonstantinou. PREFER: A system for the
efficient execution of multi-parametric ranked queries. In SIGMOD Conference,
2001.
[33] I. Inc. Informix decision support indexing for the enterprise data warehouse.
http://www.informix.com/informix/corpinfo/- zines/whiteidx.htm.
[34] M. Ip, L. Saxton, and V. Raghavan. On the selection of an optimal set of indexes.
IEEE Transactions on Software Engineering, 1983.
[35] S. Kai-Uwe, E. Schallehn, and I. Geist. Autonomous query-driven index tun-
ing. In International Database Engineering & Applications Symposium, Coimbra,
Portugal, 2004.
[36] R. Kohavi and G. John. Wrappers for feature subset selection. AI, 1997.
[37] F. Korn and S. Muthukrishnan. Influence sets based on reverse nearest neighbor
queries. In SIGMOD, 2000.
[38] M. S. Lobo. Robust and Convex Optimization with Applications in Finance. PhD
thesis, The Department of Electrical Engineering, Stanford University, March
2000.
138
[39] B. S. Manjunath. Airphoto dataset. http://vivaldi.ece.ucsb.edu/Manjunath/research.htm,
May 2000.
[40] B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of
image data. IEEE Transactions on Pattern Analysis and Machine Intelligence,
18(8):837–842, August 1996.
[41] R. P. Mount. The Office of Science Data-Management Challenge. DOE Office
of Advanced Scientific Computing Research, March–May 2004.
[42] Y. Nesterov and A. Nemirovsky. A general approach to polynomial-time algo-
rithms design for convex programming. Technical report, Centr. Econ. & Math.
Inst., USSR Acad. Sci., Moscow, USSR, 1988.
[43] Y. Nesterov and A. Nemirovsky. Interior-Point Polynomial Algorithms in Convex
Programming. SIAM, Philadelphia, PA 19101, USA, 1994.
[44] J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm for mining frequent
closed itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining
and Knowledge Discovery, pages 21–30, 2000.
[45] S. Ponce, P. M. Vila, and R. Hersch. Indexing and selection of data items in
huge data sets by constructing and accessing tag collections. In Proceedings of
the 19th IEEE Symposium on Mass Storage Systems and Tenth Goddard Conf.
on Mass Storage Systems and Technologies, 2002.
[46] M. Ramakrishna. In Indexing Goes a New Direction., volume 2, page 70, 1999.
[47] U. M. L. Repository. The us census1990 data set.
http://kdd.ics.uci.edu/databases/census1990/USCensus1990-desc.html.
[48] N. Roussopoulos, S. Kelly, and F. Vincent. Nearest neighbor queries. In SIG-
MOD, 1995.
[49] A. Shoshani, L. Bernardo, H. Nordberg, D. Rotem, and A. Sim. Multi-
dimensional indexing and query coordination for tertiary storage management.
In Proceedings of the 11th International Conference on Scientific and Statistical
Data(SSDBM), 1999.
[50] R. R. Sinha and M. Winslett. Multi-resolution bitmap indexes for scientific data.
ACM Trans. Database Syst., 32(3):16, 2007.
[51] M. Theobald, G. Weikum, and R. Schenkel. Top-k query evaluation with prob-
abilistic guarantees. VLDB, 2004.
139
[52] G. Valentin, M. Zuliani, D. Zilio, G. Lohman, and A. Skelley. Db2 advisor: An
optimizer smart enough to recommend its own indexes. In Proceedings Interna-
tional Conference Data Engineering, 2000.
[53] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance
study for similarity-search methods in high-dimensional spaces. In VLDB, 1998.
[54] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance
study for similarity-search methods in high-dimensional spaces. In Proceedings
of the 24th International Conference on Very Large Databases, pages 194–205,
1998.
[55] K. Whang. Index selection in relational databases. In International Conference
on Foundations on Data Organization (FODO), Kyoto, Japan, 1985.
[56] K. Wu, W. Koegler, J. Chen, and A. Shoshani. Using bitmap index for interactive
exploration of large datasets. In Proceedings of SSDBM, 2003.
[57] K. Wu, E. Otoo, and A. Shoshani. Compressing bitmap indexes for faster search
operations. In SSDBM, 2002.
[58] K. Wu, E. Otoo, and A. Shoshani. On the performance of bitmap indices for high
cardinality attributes. Lawrence Berkeley National Laboratory. Paper LBNL-
54673., March 2004.
[59] K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indexes with efficient
compression. ACM Transactions on Database Systems, 2006.
[60] Z. Zhang, S. Hwang, K. C.-C. Chang, M. Wang, C. Lang, and Y. Chang. Boolean
+ ranking: Querying a database by k-constrained optimization. In SIGMOD,
2006.
140

ABSTRACT

The proliferation of massive data sets across many domains and the need to gain meaningful insights from these data sets highlight the need for advanced data retrieval techniques. Because I/O cost dominates the time required to answer a query, sequentially scanning the data and evaluating each data object against query criteria is not an effective option for large data sets. An effective solution should require reading as small a subset of the data as possible and should be able to address general query types. Access structures built over single attributes also may not be effective because 1) they may not yield performance that is comparable to that achievable by an access structure that prunes results over multiple attributes simultaneously and 2) they may not be appropriate for queries with results dependent on functions involving multiple attributes. Indexing a large number of dimensions is also not effective, because either too many subspaces must be explored or the index structure becomes too sparse at high dimensionalities. The key is to find solutions that allow for much of the search space to be pruned while avoiding this ‘curse of dimensionality’. This thesis pursues query performance enhancement using two primary means 1) processing the query effectively based on the characteristics of the query itself and 2) physically organizing access to data based on query patterns and data characteristics. Query performance enhancements are described in the context of several novel applications

ii

including 1) Optimization Queries, which presents an I/O-optimal technique to answer queries when the objective is to maximize or minimize some function over the data attributes, 2) High-Dimensional Index Selection, which offers a cost-based approach to recommend a set of low dimensional indexes to effectively address a set of queries, and 3) Multi-Attribute Bitmap Indexes, which describes extensions to a traditionally single-attribute query processing and access structure framework that enables improved query performance.

iii

ACKNOWLEDGMENTS

First of all, I would like to thank my wife, Moni Wood, for her love and support. She has made many sacrifices and expended much effort to help make this dream a reality. I would not have been to complete this dissertation without her. I also would not have been able to accomplish this without the help of my family. I thank my mother Marilyn, my sister Susan and my grandmother Margaret for their support. I thank my late father Murray for providing an example to which I have always aspired. Moni’s family was also understanding and supportive of the travails of Ph.D research. I want to express my appreciation to my adviser, Hakan Ferhatosmanoglu, for his guidance and encouragement over the years. He has been generous with his time, effort, and advice during my graduate studies. I am thankful for the technical discussions we have had and the opportunities he has given me to work on interesting projects. Many people in the department have helped me to get to this point. I specifically want to thank my dissertation committee members, Dr. Atanas Rountev and Dr. Hui Fang, for their efforts and valuable comments on my research. I wish to thank Dr. Bruce Weide for his help and support during my studies, Dr. Rountev for my early exposure to research and paper writing, and Dr. Jeff Daniels for giving me the opportunity to work on an interdisciplinary geo-physics project. iv

Many friends and members of the Database Research Group provided valuable guidance, discussions, and critiques over the past several years. I want to especially thank Guadalupe Canahuate for her help and friendship during this time. Also thanks to her family for providing enough Dominican coffee to get through all the paper deadlines. My Ph.D. research was supported by NSF grant NSF IIS -0546713.

v

VITA

May 10, 1969 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Omaha, Nebraska 1993 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.S. Electrical Engineering, Ohio State University, Columbus Ohio 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Computer Science and Engineering, Ohio State University, Columbus Ohio 1989-1992 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Electrical Engineering Co-op, British Petroleum, Various Locations 1994-2002 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Project Manager, Acquisition Logistics Engineering, Worthington, Ohio 2003-present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Assistant, Office Of Research, Department of Computer Science and Engineering, Department of Geology Ohio State University, Columbus, Ohio

PUBLICATIONS
Research Publications Michael Gibas, Guadalupe Canahuate, Hakan Ferhatosmanoglu, Online Index Recommendations for High-Dimensional Databases Using Query Workloads. IEEE Transactions on Knowledge and Data Engineering (TDKE) Journal, February 2008 (Vol. 20, no.2), pp. 246-260

vi

Indexing Incomplete Databases. Developing Indices for High Dimensional Databases Using Query Access Patterns. International Conference on Scientific and Statistical Database Management (SSDBM). 14 pp. Germany. OSU-CISRC-4/06-TR45. ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’04). Hakan Ferhatosmanoglu. Encyclopedia of Geographical Information Science. LexisNexis. September. 2006/March. Michael Gibas. International Conference on Extending Database Technology (EDBT). Hakan Ferhatosmanoglu. 2005/October Atanas Rountev. Austria. Banff. pp. Michael Gibas. Vienna. 15 pp. Guadalupe Canahuate. Dayton. Hakan Ferhatosmanoglu. Ohio State Technical Report. 33rd International Conference on Very Large Data Bases (VLDB 07). Fast Semantic-based Similarity Searches in Text Repositories. Ohio State Technical Report. (E-GIS) Michael Gibas. 2007. Indexes for Databases with Missing Data. Munich. ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE’04). Canada. A General Framework for Modeling and Processing Optimization Queries. OSUCISRC-10/07-TR74. Masters Thesis. Ning Zheng. June 2004 Atanas Rountev. pages 14-16. Tan Apaydin. Michael Gibas. Update Conscious Bitmap Indexes. Hakan Ferhatosmanoglu. pages 1-11. Scott Kagan. Hakan Ferhatosmanoglu. and Michael Gibas. Static and Dynamic Analysis of Call Chains in Java. 2007/July Guadalupe Canahuate. pp. July 2004 Fatih Altiparmak. and Michael Gibas. and Hakan Ferhatosmanoglu. Michael Gibas. Hakan Ferhatosmanoglu. Conference on Using Metadata to Manage Unstructured Text. OH. Scott Kagan. Guadalupe Canahuate. 884-901 Michael Gibas. Michael Gibas. June 2004 FIELDS OF STUDY vii . Relationship Preserving Feature Selection for Unlabelled Time-Series.Michael Gibas. 10691080 Guadalupe Canahuate. High-Dimensional Indexing. Evaluating the Imprecision of Static Analysis.

Major Field: Computer Science and Engineering viii .

. . . . . . . . . . . 1. . .3 Non-Covered Access Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . ix 2. . . . . . . . . . . .2 Common Query Types Cast into Optimization Query Model 2. . . . . . . . . . . . . . . . . . . . . . . . .2 Query Processing over Non-hierarchical Access Structures . . . . . . . . . Conclusions . .3 Optimization Query Applications . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . Related Work . . . . . I/O Optimality . . . . . . . . .4. . . . . . . . . . . . .7 . . . . .4 Approach Overview . . . . .4. . . . . . Query Model Overview . . . . . . . . .3. . . . . . . . 1 1 2 6 7 9 10 10 13 15 16 19 21 26 28 28 33 45 ii iv vi xi Modeling and Processing Optimization Queries . . . . . . . .5 2. . . . 2. . . . . . . . . . Novel Database Management Applications . . . . . . . . . . . . . . . . . . . .4 2. . . . . . . . . . . . . . . . . . . . . . . . . .1 2.1 1. . . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . .2 2. . . . . . . . .1 Background Definitions . . . Introduction .6 2. . . . . . . . . . . . . . . . . Query Processing Framework . . . . . .3. . .2 2. . . . . . . . . . 2. . . . . . . . . List of Figures . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . .4. . . . . . .TABLE OF CONTENTS Page Abstract . . . . Performance Evaluation . . . . . . . . . . . . . . . . . .3 Introduction . . . . . . . . . . . Overall Technical Motivation . . . . . . . . . . . . . . . . . . . . . Vita . . . . .1 Query Processing over Hierarchical Access Structures . . . . . . . . . . . . . . . . 2. . . Chapters: 1. .3. . .

. . . .5 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Index Selection . . . . . . . . . . . . . 136 x . . . . . . . . . . . . Proposed Approach . . . . . . . . . . . . . . . . . . . . 4. . . .1 Experimental Setup . . . . .3. . . 3. Empirical Analysis . . . . . . . . . . 4. . .1 High-Dimensional Indexing . . . .5 System Enhancements . . . . . . . . . . . .3. . . . . 3. . . . .2. . . . . . . . . Conclusion . . . . Approach . . . . . . . . . . .4. . 4. . . . . . . .3. . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . . .3. . . . . .4 Automatic Index Selection . . . . . . . . . . . . . . . . . . . . . . . . . . .1 3. . . . . . . . .3. . . . . . . . Multi-Attribute Bitmap Indexes . . . 4. . . . . . . . . . Conclusions . . . .3. . . . . . .4 4. . . .4 Proposed Solution for Online Index Selection 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 3. . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . .3 Proposed Solution for Index Selection . . . . .4 3. Background .3 Index Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4. .3 Introduction . . . . . . . . . . . . . . . . 3.1 Problem Statement . . . 3. . . . . 3.4. . . . . . . . . . . 3. .2. . . . . .3. . . . . . . . . . . . . . . . .2 Introduction . . 3. . . . .5 5. . . . . . . . . . . . . . . . . . . . . . .1 4. . . . . . . . . . . . . . . . . . . . 134 Bibliography . . . . . . . . . . .2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 4. . . . . . Related Work . . . . . .2. . . . . . . . . . . . 4. . . . . Experiments . . . . . . . . . . . . .2 Feature Selection . . . 3. . . . . . . . . . . . . . . . . . . .1 Multi-attribute Bitmap Index Selection for Ad Hoc Queries 4. . . . . . . . . . .2 Experimental Results . . . . . . . . . .3. Online Index Recommendations for High-Dimensional Databases using Query Workloads . . . . . . . . . . .2 Experimental Results . . 4. . . . . . . . . . . . . . . . . . . . . . . . .3. . . . 3. 3. 48 48 54 54 55 55 57 57 57 59 61 71 76 78 78 80 95 98 98 101 109 110 115 120 121 121 123 131 3. . . . .1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . .2 Query Execution with Multi-Attribute Bitmaps . . . . . . . . . . . . . .

g1 (α. . . . . . . . . Accesses. . . . . . . 43 44 xi . . g2 (α.4 2. . . A) : y − 6 ≤ 0. . . . . . . . .7 36 38 2. . . . 8-D Color Histogram. . . A) : −y + 3 ≤ 0. . . R*-tree. . . . 8-D Histogram. . . . . . . . . . k = 1 . . .11 Pages Accesses vs. . . . . . . . . . . . . 100k pts Page Accesses. . . . . . . .10 Accesses vs. . . . . . . Random weighted kNN vs kNN. . . . . R*-tree. . . NN Query. . . . . . Constraints. Optimization Query System Functional Block Diagram . . . . . . . Example of Hierarchical Query Processing . . . . . R*-Tree. . . . . . R*-Tree. CPU Time. Optimizing Random Function. . . . . . Index Type. . 60-D Sat. . . . . . 5-D Uniform Random. . Page 12 20 25 2. . . . . .5 Cases of MBRs with respect to R and optimal objective function contour 30 Cases of partitions with respect to R and optimal objective function contour . . . . . . . . 100k Points . . . h1 (β. . . . . . . . . . 32 2. 8-D Histogram. Accesses. . . . . . . . . . . . . . . . A) : x − 5 = 0. . . .6 35 2. . . . .LIST OF FIGURES Figure 2. . . A) : (x − 1)2 + (y − 2)2 . . 100k pts . . . . . . . . .8 2. . .1 Sample Optimization Objective Function and Constraints F (θ. . . . . . . . . . . . . . . Random weighted kNN vs kNN. . . . . . . . VA-File. . . . 50k Points . . .2 2. . . . o(min). . . . . . . 2.9 40 2. . . . .3 2. . . Random weighted kNN vs kNN. . Random Weighted Constrained k-LP Queries. . . . . . . . 100k pts . 8-D Histogram. . 100k pts . . . . .

. . . . .3. . . . . . . . . . . . . . . . . synthetic Query Workload .5 86 3. . . 90 Comparative Cost of Online Indexing as Major Change Threshold Changes. . . . . . . . . Baseline Comparative Cost of Adaptive versus Static Indexing. . . . .3 93 94 A WAH bit vector. . . . . . 106 xii 4. . 2M entries . . . . . . . . . . . synthetic Query Workload . . . Comparative Cost of Online Indexing as Support Changes. . . .2 3. . Point Queries. . . . . . Costs in Data Object Accesses for Ideal. . . . . . . . . . . . . . random Dataset. . . . . . .6 87 3. . . . . . Dynamic Index Analysis Framework . . . . . . . . . . . . .11 Comparative Cost of Online Indexing as Multi-Dimensional Histogram Size Changes. . . . . . . . . . . . . . . . Baseline Comparative Cost of Adaptive versus Static Indexing. . . . . .8 Comparative Cost of Online Indexing as Support Changes. . stock Dataset. . 2M entries versus . . synthetic Query Workload . . . . . . . . stock Dataset. . . random Dataset. . . . . AutoAdmin. . . . . ANDing 6 bitmaps. . . . . . . Naive. . 4. . . . . . . . . . 3. . . . . random Dataset. 3. . . and the Proposed Index Selection Technique using Relaxed Constraints . . . . . . synthetic Query Workload . . . .10 Comparative Cost of Online Indexing as Major Change Threshold Changes. . hr Query Workload . . . .7 89 3. . . . . . . . Baseline Comparative Cost of Adaptive versus Static Indexing. . . . . . . . . . . . hr Query Workload . random Dataset.1 3. . . . . . . . . . . . . . . . .2 4. . . . . . . . . . . . . . . . .4 . . . . . . . . . . . . . . . . . . . . . .3 Index Selection Flowchart . . 103 Query Execution Times vs Total Bitmap Length. . . . . . Sequential Scan. stock Dataset. .4 85 3. . . 105 Query Execution Times vs Number of Attributes queried. . . . 102 Query Execution Example over WAH compressed bit vectors. . . .9 92 3. . hr Query Workload . . . . . . . . . . . . 61 73 81 3. . . . . . clinical Query Workload . .1 4. . . . . . . . . . . . . . . . . . . random Dataset. . . .

.7 4. . . . . . .6 4. . . . . . . . . 125 Number of bitmaps involved in answering each query type. . .4. 128 4. . . . . 129 4. . .10 Query Execution Times as Query Ranges Vary. . . . .5 Simple multiple attribute bitmap example showing the combination two single attribute bitmaps. . . . . . . . . . . . .11 Single and Multi-Attribute Bitmap Index Size Comparison . . .9 4. . . . . 127 Query Execution Times. . . . Landsat Data . 130 xiii . . . . . . . . . . 126 Total size of the bitmaps involved in answering each query type. . . . . Traditional versus Multi-Attribute Query Processing . . . .8 4. . . Point Queries. . . . . . . . . . . . . 108 Query Execution Time for each query type. . . . .

he would proceed at once with the diligence of the bee to examine straw after straw until he found the object of his search.CHAPTER 1 Introduction If Edison had a needle to find in a haystack. .Nikola Tesla. In the case of our proverbial haystack. knowing that a little theory and calculation would have saved him ninety per cent of his labor. we would be much better served by limiting our search to a small region within the haystack. I was a sorry witness of such doings. the query response time during data exploration tasks using a database system is affected both by the physical organization 1 . 1. it is necessary to find information that is in some respect interesting within the data. Two primary challenges of this task are specifying the meaning of interesting within the context of an application and searching the data set to find the interesting data efficiently. we can think of each piece of straw as a data object and needles as a data objects that are interesting.1 Overall Technical Motivation Analogous to our physical haystack example.. 1931 In order to make better use of the massive quantities of data being generated by and for modern applications.. While we could guarantee that we find the most interesting needle by examining each and every piece of straw.

As such. it is highly desirable to process generic queries at run-time in a way that is effective for each particular query. 2) data organization such that the access structures available during query resolution are appropriate for the overall query patterns and data distribution and allow for significant data space pruning during query resolution. similarity-based analyses and complex mathematical and statistical operators are used to query such data [41]. queries can vary widely on a per query basis. and data constraints 2 .2 Novel Database Management Applications Optimization Queries. This thesis targets query performance enhancement through two sources 1) runtime query processing such that the search paths and intermediate operations during query resolution are appropriate for a given generic query. 1. Besides traditional queries. and business data usually requires iterative analyses that involve sequences of queries with varying parameters. Therefore.of the data and by the way in which the data is accessed. motivated and described in the next section. Additionally. scientific. Query performance enhancement is explored using three introduced applications. function parametric values. query performance can be enhanced by traversing the data more effectively during query resolution or by organizing access to the data so that it is more suitable for the query. Depending on the application. Exploration of image. the object attributes queried. Physically reorganizing data or data access structures is time consuming and therefore should be performed for a set of queries rather than by trying to optimize an organization for a single query. Many of these data exploration queries can be cast as optimization queries where a linear or non-linear function over object attributes is minimized or maximized (our needle).

The model covers those queries where some function is being minimized or maximized over object attributes. 3 . With respect to the physical organization of the data. (ii) take advantage of query constraints to prune search space. access structures or indexes that cover query attributes can be used to prune much of the data space and greatly reduce the required number of page accesses to answer the query. due to the oft-cited curse of dimensionality. For access structures where the underlying partitions are convex. Because scientific data exploration and business decision support systems rely heavily on function optimization tasks. the query processing framework is used to access the most promising partitions and data objects and is proven to be I/O-optimal. Such a framework should meet the following requirements: (i) have the expressive power to support solution of each query type. However. indexes built over many dimensions actually perform worse than sequential scan. This problem can be addressed by using sets of lower dimensional indexes that effectively answer a large portion of the incoming queries. High-Dimensional Index Selection.can vary on a per-query basis. This approach can work well if the characteristics of future queries are well known. A general optimization model and query processing framework are proposed for optimization queries. The query processing framework can be used in those cases where the function being optimized and the set of constraints is convex. possibly within a set of constraints. and because such tasks encompass such disparate query types with varying parameters. and (iii) maintain efficient performance when the scoring criteria or optimization function changes. an important and significant subset of data exploration tasks. these application domains can benefit from a general model-based optimization framework. but in general this is not the case.

and an index set built based on historical queries may be completely ineffective for new queries. A control feedback mechanism is incorporated so that the performance of newly arriving queries can be monitored. and data statistics. This owes to the fact that they are a column-based storage structure and only the data needed to address a query is ever retrieved from disk and that their base operations are performed using hardware supported fast bitwise operations. and is effective in answering queries. Static bitmap indexes have traditionally built over each attribute individually. and if the current index set is no longer effective for the current query patterns. Bitmap indexes have been shown to be effective as an data management system for many domains such a data warehousing and scientific applications. Historical query patterns are data mined to find those attribute sets that are frequently queried together. Multi-Attribute Bitmap Indexes. In the case of physical organization. the index set can be changed. The query model for selection queries is simple and involves ORing together bitcolumns that match query 4 . A set of indexes is obtained that in effect reduces the dimensionality of the query search space. query performance is affected by creating a number of low dimension indexes using the historical and incoming query workloads. This flexible framework that can be tuned to maximize indexing accuracy or indexing speed which allows it to be used for the initial static index recommendation and also online dynamic index recommendation. Potential indexes are evaluated to derive an initial index set recommendation that is effective in reducing query cost for a large portion of queries such that the set of indexes is within some indexing constraint. The nature of the database management system used and the way in which data is physically accessed also affects query execution times.

Methods to determine attribute combination candidates are provided. 5 . the bitmap index structure is limited to those domains that it is naturally suited for without change. A query processing model where multi-attribute bitmaps are available for query resolution is presented. For many queries. As well query resolution does not take advantage of potential multi-attribute pruning and possible reductions in the number of binary operations.criteria within an attribute and ANDing inter-attribute intermediate results. Multi-attribute bitmap indexes are introduced to extend the functionality and applicability of traditional bitmap indexes. As a result. query execution times are improved by reducing size of bitmaps used in query execution and by reducing the number of binary operations required to resolve the query without excessive overhead for queries that do not benefit from multi-attribute pruning.

. While the answer to a given query can be found by computing the function for all data objects and finding those objects that yield the most extreme values. minimizing the I/O required to answer a query is a primary strategy in query optimization. the one that heralds new discoveries..CHAPTER 2 Modeling and Processing Optimization Queries The most exciting phrase to hear in science. this approach may not be feasible for large data sets. Scientific discovery and business intelligence activities are frequently exploring for data objects that are in some way special or unexpected. while ensuring that. Given the time expense of I/O. Such data exploration can be performed by iteratively finding those data objects that minimize or maximize some function over the data attributes. The optimization query framework presented in this chapter provides a method to generically handle an important broad class of queries. given an access structure.Isaac Asimov One challenge associated with data management is determining a cost-effective way to process a given query.’ . 6 . is not ‘Eureka!’ (I found it!) but ‘That’s funny . the query is I/O-optimally processed over that structure.

1 Introduction Motivation and Goal Optimization queries form an important part of realworld database system utilization. Such a framework should meet the following requirements: (i) have the expressive power to support existing query types and the power to formulate new query types. There has been a rich set of literature on processing specific types of queries. These queries can all be considered to be a type of optimization query with different objective functions. (ii) take advantage of query constraints to proactively prune search space. Besides traditional queries. the object attributes queried. Many of these data exploration queries can be cast as optimization queries where a linear or non-linear function over object attributes is minimized or maximized. and data constraints can vary on a per-query basis. function parametric values. and because such tasks encompass such disparate query types with varying parameters. However. similar-ity-based analyses and complex mathematical and statistical operators are used to query such data [41].2. such as the query processing algorithms for Nearest Neighbors (NN) [48] and its variants [37]. scientific. Because scientific data exploration and business decision support systems rely heavily on function optimization tasks. 7 . and (iii) maintain efficient performance when the scoring criteria or optimization function changes. these application domains can benefit from a general model-based optimization framework. top-k queries [21. Additionally. especially for information retrieval and data analysis. and business data usually requires iterative analyses that involve sequences of queries with varying parameters. they either only deal with specific function forms or require strict properties for the query functions. Exploration of image. etc. 51].

3. We propose a generic optimization model to define a wide variety of query types that includes simple aggregates. Since the queries we are considering do not have a specific objective function but only have a “model” for the objective (and constraints) with parameters that vary for different users/queries/iterations. In conjunction with a technique to model optimization query types. and complex user-defined functional analysis. application of specialized techniques requires that the user know of. • We define query processing algorithms for convex model-based optimization queries utilizing data/space partitioning index structures without changing the 8 . and apply the appropriate tools for the specific query type. a unified query technique for a general optimization model has not been addressed in the literature. Furthermore.Despite an extensive list of techniques designed specifically for different types of optimization queries. Contributions The primary contributions of this work are listed as follows. • We propose a general framework to execute convex model-based optimization queries (using linear and non-linear functions) and for which additional constraints over the attributes can be specified without compromising query accuracy or performance. possess. By applying this single optimization query model with I/O-optimal query processing. Example optimization query applications are provided in Section 2. we also propose a query processing technique that optimally accesses data and space partitioning structures to answer the query.3. we refer to them as model-based optimization queries. range and similarity queries. we can provide the user with a flexible and powerful method to define and execute arbitrary optimization queries efficiently.

• We introduce a generic model and query processing framework that optimally addresses existing optimization query types studied in the research literature as well as new types of queries with important applications. It is also not applicable to constrained linear queries because convex hulls need to be recomputed for each query with a different set of constraints. Query 9 . 2. The access structure is a sorted list of records for an arbitrary linear function. The solution followed in the Onion technique is based on constructing convex hulls over the data set. • We prove the I/O optimality for these types of queries for data and space partitioning techniques when the objective function is convex and the feasible region defined by the constraints is convex (these queries are defined as CP queries in this paper). It is infeasible for high dimensional data because the computation of convex hulls is exponential with respect to the number of dimensions and is extremely slow. The computation of convex hulls over the whole data set is expensive and in the case of high-dimensional data.index structure properties and the ability to efficiently address the query types for which the structures were designed. The Prefer technique [32] is used for unconstrained and linear top-k query types. An example linear model query selects the best school by ranking the schools according to some linear function of the attributes of each school. One is the “Onion technique” [14]. which deals with an unconstrained and linear query model.2 Related Work There are few published techniques that cover some instances of model-based optimization queries. infeasible in design time.

3. an . However. our technique organizes data retrieval to answer more general queries. A). . and requires no additional storage space.1 Query Model Overview Background Definitions Let Re(a1 . A). . A) be certain functions over attributes in A. θ2 .processing is performed for some new linear preference function by scanning the sorted list until a record is reached such that no record below it could match the new query. 1 ≤ i ≤ u. Boolean + Ranking [60] offers a technique to query a database to optimize boolean constrained ranking functions. addresses only single dimension access structures (i. In contrast with the other techniques listed. Let gi (α. . . θm ) is a vector of parameters for function F . Denote by A a subset of attributes of Re. . a2 . . it is limited to boolean rather than general convex constraints. can not optimize on multiple dimensions simultaneously). a2 . but does not provide a guarantee on minimum I/O. our proposed solution is proven to be I/O-optimal for both hierarchical and space partitioned access structures with convex partition boundaries. assume all attributes have the same domain. . . and therefore largely uses heuristics to determine a cost effective search strategy. and α and β are vector parameters for the constraint functions. covers a broader spectrum of optimization queries. 2. While the Onion and Prefer techniques organize data to efficiently answer a specific type of query (LP). Without loss of generality. hj (β.e. . This avoids the costly build time associated with the Onion technique. . where u and v are positive integers and θ = (θ1 . 1 ≤ j ≤ v and F (θ. and still does not incorporate constraints. .3 2. 10 . . an ) be a relation with attributes a1 .

11 . A) ≤ 0. Without loss of generality. 1 ≤ j ≤ v (if not empty). (ii) a set of constraints (possibly empty) specified by inequality constraints gi (α. The constraints limit the feasible region to the line passing through x = 5. NN queries over an attribute set A can be considered as model-based optimization queries with F (θ. a model-based optimization query Q is given by (i) an objective function F (θ. The answer of a model-based optimization query is a set of tuples with maximum cardinality k satisfying all the constraints such that there exists no other tuple with smaller (if minimization) or larger (if maximization) objective function value F satisfying all constraints as well. Similarly.2). a2 . an ). 1 ≤ i ≤ u (if not empty) and equality constraints hj (β.. . For instance.g. The objective function finds the nearest neighbor to point a (1. This definition is very general. Euclidean) and the optimization objective is minimization. with optimization objective o (min or max). where 3 ≤ y ≤ 6.Definition 1 (Model-Based Optimization Query) Given a relation Re(a1 .1 shows a sample objective function with both inequality and equality constraints. and answer set size integer k. we will consider minimization as the objective optimization throughout the rest of this chapter. . Figure 2. . weighted NN queries and linear/non-linear optimization queries can all be considered as specific modelbased optimization queries. top-k queries. A) = 0. The points b and d represent the best and worst point within the constraints for the objective function respectively. and almost any type of query can be considered as a special case of model-based optimization query. constraint parameters α and β. . Point c could be the actual point in the database that meets the constraints and minimizes the objective function. and (iii) user/query adjustable objective parameters θ. A) as the distance function (e. A) .

(ii) and (iii) form a well known type of problem in Operations Research literature Convex Optimization problems. The conditions (i). The only assumptions are that the queries can be formulated into convex functions and the feasible regions defined by the constraints of the queries are convex (e. o(min). users can ask any form of queries with any coefficients as long as these assumptions hold.g. A) : (x − 1)2 + (y − 2)2 . A) : y − 6 ≤ 0. polyhedron regions). A function is convex if it is second differentiable and 12 . A). A) is convex. A) : x − 5 = 0. A) : −y + 3 ≤ 0.. Notice that the definition of a CP query does not have any assumptions on the specific form of the objective function.1: Sample Optimization Objective Function and Constraints F (θ. 1 ≤ i ≤ u (if not empty) are convex. h1 (β. and (iii) hj (β. 1 ≤ j ≤ v (if not empty) are linear or affine. k=1 Definition 2 (Convex Optimization (CP) Query) A model-based optimization query Q is a Convex Optimization (CP) query if (i) F (θ. g1 (α. (ii) gi (α. A). g2 (α.Figure 2. Therefore.

top-k queries. 2. a2 . . linear programming problems (LP) and convex quadratic programming problems (QP) [27]. wn > 0 are the weights and can be different for each query. . .The objective function of a Linear Optimization query is L(a1 . . gi and hj . .its Hessian matrix is positive definite. a0 . The formulation of some common query types as CP optimization queries are listed and discussed below. Traditional Nearest Neighbor queries are the subset of weighted nearest neighbor queries where the weights are all equal to 1. Linear Optimization Queries . if one travels in a straight line from inside a convex region to outside the region. linear optimization queries. More intuitively. . . . + wm (an − a0 )2 . . Therefore. an ) such that an objective function is minimized.3. . . including NN queries. . an ) = w1 (a1 − a0 )2 + w2 (a2 − a0 )2 + . . 1 2 n where w1 . .The Weighted Nearest Neighbor query asking NN for a point (a0 .2 Common Query Types Cast into Optimization Query Model Although appearing to be restricted in functions F . . a0 ) can be considered as an optimization n 1 2 query asking for a point (a1 . w2 . Euclidean Weighted Nearest Neighbor Queries . . . . it is not possible to re-enter the convex region. . For Euclidean WNN the objective function is WNN(a1 . a2 . + cn an 13 . . the set of CP problems is a superset of all least square problems. . and angular similarity queries. . an ) = c1 a1 + c2 a2 + . they cover a wide variety of linear and non-linear queries. . . a2 .

i = 1. where C is any constant. an ) s. . an ) = C.A range query asks for all data (a1 .t. Therefore. . . Points that are not within the constraints are pruned by those constraints.where (c1 . The common score functions are sum and average. but could be any arbitrary function over the attributes. Constructing an objective function as f (a1 . all form the solution set of the range query. . and constraints gi = li − ai <= 0. . .The objective functions of top-k queries are score (or ranking) functions over the set of attributes. n Any points that are within the constraints will get the objective value C. c2 . Clearly. 2. cn ) ∈ Rn are coefficients and can be different for different queries. . and the constraints form a polyhedron. . . sum and average are special cases of linear optimization queries. a2 . where [li . gn+i = ai − ui <= 0. but also ‘irregular’ range queries such as l ≤ a1 + a2 ≤ u and l ≤ 3a1 − 2a2 ≤ u 14 . ui ]n is the range for dimension i. . Range Queries . Since linear functions are convex and polyhedrons are convex. the top-k query is a CP query. Those points that remain all have an objective value of C and because they all have the same maximum(minimum) objective value. . li ≤ ai ≤ ui . · · · . a2 . range queries can be formed as special cases of CP queries with constant objective functions applying the range limits as constraints. linear optimization queries are also special cases of CP queries with a parametric function form. for some (or all) dimensions i. Top-k Queries . . If the ranking function in question is convex. Also note that our model can not only deal with the ‘traditional’ hyper-rectangular range queries as described above.

In such a case. Some hard exclusion criteria is known and some attributes are more important than others with respect to estimating participation probability and suitability. kNN with Adaptive Scoring . we will change the nearest neighbor scoring weights dependent on the current population set. and solved using our framework.3 Optimization Query Applications Any problem that can be rewritten as a convex optimization problem can be cast to our model. In a case where we want half of the participants to be male and 15 .2.3.In some situations we may have an objective function that changes based on the characteristics of a population set we are trying to fill. We provide a short list of potential optimization query applications that could be relevant during scientific data exploration or business decision support where the problem being optimized is continuously evolving. Weighted Constrained Nearest Neighbors . As an example. This type of query would be applicable in situations where attribute distances vary in importance and some constraints to the result set are known. Weighted Constrained Nearest Neighbors (WCNN) find the closest weighted distance object that exists within some constrained space.Weighted nearest neighbor queries give the ability to assign different weights for different attribute distances for nearest neighbor queries. A potential application is clinical trial patient matching where an administrator is looking to find the best candidate patients to fill a clinical trial. this generic framework can be used to solve the problem by efficiently accessing available access structures. In cases where the optimization problem or objective function parameters are not known in advance. again consider the patient clinical trial matching application where we want to maintain a certain statistical distribution over the trial participants.

the data attributes may be subject to lower and upper bounds. functional attributes. For example. Queries over Changing Attributes . constraints. 2. All of the discussed applications can be stated as an objective function with a objective of maximization or minimization. Weights. and optimization functions themselves can all change on a per-query basis. We can solve a continuous CP problem over a convex partition of space by optimizing the function of interest within the 16 . This demonstrates the power and flexibility that the user has to define data exploration queries and the examples represent a small subset of the query types that are possible. The solution can optionally be subject to some convex criteria over the data attributes. The goal in CP is optimize (minimize or maximize) some convex function over data attributes. In the examples listed above.4 Approach Overview We propose the use of Convex Optimization (CP) in order to traverse access structures to find optimal data objects. we can adjust weights of the objective optimization function to increase the likelihood that future trial candidates will match the currently underrepresented gender. each query or each member of a set of queries can be rewritten as an optimization query in our model. a series of optimization queries may search a stock database for maximum stock gains over different time intervals.The attributes involved in optimization queries can vary based on the iteration of the query.3. A database system that can effectively handle the potential variations in optimization queries will benefit data exploration tasks. The attributes involved in each query will be different. Additionally.half female.

the partition is found to be infeasible for the problem. Most access structures are built over convex partitions.constraints of that space. Particularly common partitions are Minimum Bounding Rectangles (MBRs). Given the function. and a set of partitions. 17 . partition constraints. Rectangular partitions can be represented by lower and upper bounds over data attributes and can be directly addressed by the CP problem bounds. Given a function. The solution of these CP problems allows us to prune partitions and search paths that can not contain an optimal answer during query processing. If no solution exists (problem constraints and partition constraints do not intersect). The partition and its children can be eliminated from further consideration. model constraints. The access structure partitions to search and evaluate during query processing are determined by solving CP problems that incorporate the objective function. problem constraints. If the problem under analysis has constraints in addition to the constraints imposed by the partition boundaries. the partition the yields the best CP result with respect to the function under analysis is the one that contains some space that optimizes the problem under analysis better than the other partitions. Other convex partitions can be addressed with data attribute constraints (such as x + y < 5). we can use CP to find the optimal answer for the function within the intersection of the problem constraints and partition constraints. they can also be added to the CP problem as constraints. With our technique we endeavor to minimize the I/O operations and access structure search computations required to perform arbitrary model-based optimization queries. These partition functional values can be used to order partitions according to how promising they are with respect to optimizing the function within problem constraints. and problem constraints. and partition constraints.

which can be applied to existing space or data partitioning-based indices without losing the capabilities of those indexing techniques to answer other types of queries. We focus on query processing techniques by presenting a technique for general CP objective functions. The basic idea of our technique is to divide the query processing into two phases: First.e. we found very little time difference in the functions we examined. CP problems can be so efficiently solved that Boyd and Vandenberghe stated “if you formulate a practical problem as a convex optimization problem. (i. Details for our technique for different index types are provided in the following section.” ([10] page 8). 18 . search through the index by pruning the impossible partitions and access the most promising pages. Current implementations of CP algorithms can solve problems with hundreds of variables and constraints in very short times on a personal computer [38]. Second.The CP literature discusses the efficiency of CP problem solution. solve the CP problem without considering the database. We are in effect replacing the bound computations from existing access structure distance optimization algorithms with a CP problem that covers the objective function and constraints. then you have solved the original problem. Although such computations can be more complex. It can be solved in polynomial time in the number of dimensions in the objective functions and the number of constraints ([43]). In [42]. it is stated that a CP problem can be very efficiently solved with polynomial-time algorithms. The readers can refer to [43] for a detailed review on interior-point polynomial algorithms for CP problems. in continuous space). This “relaxed” optimization problem can be solved efficiently in the main memory using many possible algorithms in the convex optimization literature.

there exists a lower bound for each partition in the database that can be used to prune the data space before retrieving pages from disk. By solving CP subproblems using index partition boundaries as additional constraints. partition constraints.2. Figure 2. and objective function. A lightweight query compiler validates the user input.2 is a functional block diagram of the optimization query processing system that shows interactions between components. The bounds can be defined for any type of convex partition (such as minimum bounding regions) used for indexing the data. we find the best possible objective function value achievable by a point in the partition. The query processor invokes a CP solver to find bounds for partitions that could potentially contain answers to the query. converts the function and constraints into appropriate format for the query processor. The generic query processing framework for a CP query is described below. we can ensure that we access pages in the order of how promising they are. Thus. The user gets back those data object(s) in the database that answer the query within the constraints. This framework can be applied to indexing structures in which the partitions on which they are built are convex because the partition boundaries themselves are used to form the new CP problems. Query processing proceeds first by the user providing an objective function and constraints to the system. and executes the appropriate query processing algorithm using the specified access structure. A popular taxonomy of access structures divides the structures into the categories of hierarchical and non-hierarchical structures.4 Query Processing Framework By solving CP problems associated with the problem constraints. Hierarchical structures are usually 19 .

20 .2. “worst” or upper bounds) in place of minimums. We terminate when we come to a partition that could not contain an optimal answer to the query.4.Figure 2. The overall idea is to traverse the access structure and solve the problem in sequential order of how good partitions and search paths could be. Solutions to maximization problems can be performed by using maximums (i.1 and non-hierarchical structures in 2.2: Optimization Query System Functional Block Diagram partitioned based on data objects while non-hierarchical structures are usually based on partitioning space.4.e. The implementations described address a minimization objective function and prune based on comparison to minimum values. Therefore we provide algorithms for hierarchical structures in Subsection 2. These two families have important differences that affect how queries are processed over them.

BSP-tree. If the user desires only the single best result.. i.1 Query Processing over Hierarchical Access Structures Algorithm 1 shows the query processing over hierarchical access structures. P-tree. if all objects that meet the range constraints are desired.4. 21 .5 we show that both of these algorithms are in fact optimal. A partial list of structures covered by this algorithm includes the R-tree. Only the data points within the constrained region will be returned. Now.e. Oct-tree. For range queries. k = 1. B-tree. X-tree. SS-tree. This algorithm is applicable to any hierarchical access structure where the partitions at each tree level are convex. LSD-tree. k-d-B-tree. R+-tree. Buddy-tree. Quad-tree. R*-tree. be an arbitrary value in order to return the k best data objects. GBD-tree. In Section 2. 9]. Extended k-DTree. SKD-tree. Therefore. and SR-tree [25.The number of results that are returned by the query processing is dependent on an input k. the number of data points. It can however. we describe the implementations of this framework based on hierarchical and non-hierarchical indexing techniques. B+-tree. and only the points that are within the lowest-level partitions that intersect the constrained region will be examined during query processing. 2. the value of k should be set to n. the query processing will be at least as I/O-efficient as standard range query processing over a given access structure. no unnecessary MBRs (or partitions) given an access structure are accessed during the query processing.

rather than distance. If a point is not within the original problem constraints. our algorithm also prunes out any partitions and search paths that do not fall within problem constraints. The algorithm takes as input the convex optimization function of interest OF . If this partition is a leaf partition. In line 5.The algorithm is analogous to existing optimal NN processing [31]. we initialize the structures that we maintain throughout a query. then it will not have a feasible solution and will be discarded from further consideration. If the partition under analysis is not a leaf. we remove the first promising partition. and a desired result set size k. In lines 1-3. we solve the CP problem corresponding to the objective function. Points with objective function values lower than the objective function value of the k th best point so far will cause the best point and best point objective function value vectors to be updated accordingly. We keep a vector of the k best points found so far as well as a vector of the objective function values for these points. original problem constraints. Additionally. original problem constraints. We keep a worklist of promising partitions along with the minimum objective value for the partition. We initialize the worklist to contain the tree root partition. the set of convex constraints C. we find its child partitions. and partition constraints (line 11). For each of these child partitions. and the constraints imposed by the partition boundaries. except that partitions are traversed in order of promise with respect to an arbitrary convex function. then we access the points in the partition and compute their objective function values. This yields 22 . We solve a new CP problem for each partition that we traverse using the objective function. We then process entries from the worklist until the worklist is empty.

the ith best point found so far F (b[i]) . If the minimum objective function value is less than the k th best objective function value (i.F (b[k]) ← +∞ L ← (root) while L = ∅ remove first partition P from L if P is a leaf access and compute functions for points in P within constraints C update vectors b and F if better points discovered else for each child Pchild of P minObjV al = CPSolve(OF ..Pchild .CPQueryHierarchical (Objective Function OF .C) if minObjV al < F (b[k]) insert Pchild into L according to minObjV al end for end if prune partitions from L with minObjV al > F (b[k]) end while return vector b Algorithm 1: I/O Optimal Query Processing for Hierarchical Index Structures. it is possible for a point within the intersection of the problem constraints and partition constraints to 23 ... minObjVal) pairs 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: b[1].k) notes: b[i] .b[k] ←null.e.. Constraints C.the objective function value of the ith best point L . the best possible objective function value achievable within the intersection of the original problem constraints and partition constraints.worklist of (partition. F (b[1]).

Point B. If there is no feasible solution for the partition within problem constraints. We initialize our worklist with the root and start by processing it.a. We do not process A further. then the partition is inserted into the worklist according to its minimum objective value. Query Processing Example As an example of query processing. B. and B.2.2. We process B.3. constrained 1-NN query). consider Figure 2.2.3 with an optimization objective of minimizing the distance to the query point within the constrained region (i. After a partition is processed.2.2 and since it is a leaf.a is the closer of the two.2. B.1. Since point e is closest.1. Since B is not a leaf.beat the k th best point). we may have updated our k th best objective function value. and f respectively. Since it is not a leaf. we process it next. and B. we prune it from the 24 . B. and the best point objective function is updated.3.e. we examine its child partitions. We solve for the minimum objective function value for A and do not obtain a feasible solution because A and the constrained region do not intersect. the partition will be dropped from consideration. We prune any partitions in the worklist that can not beat this value. We insert partition B into our worklist and since it is the only partition in the worklist. This point is designated as the best point thus far. we know we can do no worse than the distance from the query point to B.3 can not do better than this. After processing B. point c.2. we compute objective function values for the points in B. We find the minimum objective function for partition B and get the distance from the query point to the point where the constrained region and the partition B intersect. Since partition B. We find minimum objective function values for each and obtain the distances from the query point to point d. we insert partitions into the worklist in the order B. e. we examine its child partitions A and B.

a. We update the best point to be B. We examine only points in partitions that could contain points as good as the best solution.1.worklist.1.3: Example of Hierarchical Query Processing 25 .1. Point B. Since the worklist is now empty.a. Point B.1. we have completed the query and return the best point.a gets a value which is better than the current best point.1. Figure 2. We examine points in B. The optimal point for this optimization query this query is B.b is not a feasible solution (it is not in the constrained region) so it is dropped.

Grid-File. Those partitions that do not intersect the constraints will be bypassed. EXCELL. VA+-File. and result set size k as inputs. we show that the general framework for CP queries can also be implemented on top of non-hierarchical structures. Algorithm 2 can be applied to non-hierarchical access structures where the partitions are convex. 23]. but we use the objective function rather than a distance and also consider original problem constraints to find lower bounds. OF . In stage 2. we process the unpruned partitions by accessing objects contained by the first partition and computing actual objective values.4. For those that do have a feasible solution. This technique is similar to the technique identified in [53] for finding NN for a non-hierarchical structure. VA-File. 9.2 Query Processing over Non-hierarchical Access Structures In this section. 53. and Pyramid Technique [25. and continue accessing points contained by subsequent partitions until a partition’s lower bound is greater than the k th best value.2. By accessing partitions 26 . we place unique unpruned partitions in increasing sorted order based on their lower bound. We maintain a sorted list of the top-k objects we have found so far. We use a two stage approach detailed in Algorithm 2 to determine which data objects optimize the query within the constraints. the lower bound of the CP problem is recorded. In stage 1. We take the objective function. A partial list of structures where the algorithm can be applied is Linear Hashing. we process each populated partition vin the structure by solving the CP problem that corresponds to the original objective function and constraints combined with the additional constraints imposed by the partition boundaries (line 5). At the end of stage 1. We initialize the structures to be used during query processing (lines 1-3). original problem constraints C.

we access only those points in cell representations that could contain points as good as the actual solution.b[k] ←null.in order of how good they could be. minObjV al) into L according to minObjV al Stage 2: 8: for i = 1 to |L| 9: if minObjV al of partition L[i] > F (b[k]) 10: break 11: Access objects that partition L[i] contains 12: for each object o 13: if OF (o) < F (b[k]) 14: insert o into b 15: update F (b) Algorithm 2: I/O Optimal Query Processing over Non-hierarchical Structures. 2: F (b[1]). P . Constraints C.the objective function value of the ith best point L .F (b[k]) ← +∞ 3: L ← ∅ Stage 1: 4: for each populated partition P 5: minObjV al = CPSolve(OF .. CPQueryNonHierarchical (Objective Function OF . according to the CP problem.the ith best point found so far F (b[i]) .worklist of (partition.. 27 . C) 6: if minObjV al < +∞ 7: insert (P .. minObjVal) pairs Initialize: 1: b[1].k) notes: b[i] ..

3 Non-Covered Access Structures A review of access structure surveys yielded few structures that are not covered by the algorithms. space filling curves which are used for approximate ordering). We denote the objective function contour which goes through the actual optimal data point in the database as the optimal contour. 9].5 I/O Optimality In this section.e. 2. we define the optimality as follows. 28 . They include structures that contain non-convex regions (BANG-File. Since attribute information is lost. we use the optimal contour. hB-tree. The k th optimal contour would pass through all points in continuous space that yield the same objective function value as the k th best point in the database. These include structures that either do not guarantee spatial locality (i. we will show that the proposed query processing technique achieves optimal I/O for each implementation in Section 2. BV-tree) [25.4.2. they can not be applied to general functions over the original attributes). Definition 3 (Optimality) An algorithm for optimization queries is optimal iff it retrieves only the pages that intersect the intersection of the constrained region R and the optimal contour. They include structures built over values computed for a specific function (such as distance in Mtrees. Note that M-trees would still work to answer queries for the functions they are built over and the structures built with non-convex leaves could still be optimally traversed down to the level of non-convexity. Following the traditional definition in [5]. This contour also goes through all other points in continuous space that yield the same objective function value as the optimal point. but could replace them with the k th optimal contour.4. In the following arguments.

i. It is easy to see that any partition intersecting the intersection of optimal contour and feasible region R will be accessed since they are not pruned in any phase of the algorithm. Figure 2. a is an optimal data point in the data base.4 shows different cases of the various position relations between a partition. D: intersects the optimal contour and R. We will use minObjV al(P artition. R) to denote the minimum objective function value that is computed for the partition and constraints R. the proof is given in the following. Lemma 1 The proposed query processing algorithm is I/O optimal for CP queries under convex hierarchical structures.For hierarchical tree based query processing. 29 . We need to prove only these partitions are accessed.e. Proof. C: intersects the optimal contour but not R. A: intersects neither R nor the optimal contour. R and the optimal contour under 2 dimensions.. other partitions will be pruned without accessing the pages. B: intersects R but not the optimal contour. E: intersects the intersection of the optimal contour and R. but not the intersection of optimal contour and R.

4: Cases of MBRs with respect to R and optimal objective function contour A and C will be pruned since they are infeasible regions. Let CP0 be the data partition that contains an optimal data point. . ≥ minObjV al(CPk . CP1 . We shall show that a page in case D is pruned by the algorithm. and CPk be the root partition that contains CP0 . CPk−1 . R) ≥ minObjV al(CP1 . . B is pruned because minObjV al(B.Figure 2. .. we have F (a) ≥ minObjV al(CP0 . when the CP for these partitions are solved. they will be eliminated from consideration since they do not intersect the constrained region and minObjV al(A. CP1 be the partition that contains CP0 . . R) = minObjV al(C. R) > F (a) and E will be accessed earlier than B in the algorithm because minObjV al(E. R) < minObjV al(B. R) ≥ . Because the objective function is convex. R) 30 . . R) = +∞. . R).

R) is larger than constrained function value of any partition containing the optimal data point. R) > F (a) ≥ minObjV al(CP0 .4. Partitions A. whether it contains data or other partitions. B. D. This only can occur after CP0 has been accessed because minObjV al(D.and minObjV al(D. unless that MBR could contain an optimal answer. This contradicts with the assumption that D will be accessed. If CP0 is accessed earlier than D. Similarly. If D will be accessed at some point during the search. however. then D must be the first in the MBR list at some time. R) ≥ minObjV al(CP1 . Lemma 2 The proposed query processing algorithm is I/O optimal for CP queries over non-hierarchical convex structures. that is. and so on. the algorithm is also optimal in terms of the access structure traversal. we will never retrieve the objects within an MBR (leaf or non-leaf). 31 . R) During the search process of our algorithm. . R) ≥ . . Figure 2. However. we can prove the I/O optimality of proposed algorithm over non-hierarchical structures. ≥ minObjV al(CPk . and E are the same as defined for Figure 2. including D. The I/O-optimality relates to accessing data from leaf partitions. the algorithm prunes all partitions which have a constrained function value larger than F (a). and CPk−1 by CPk−2 . Proof. C. CPk is replaced by CPk−1 . until CP0 is accessed.5 shows different possible cases of the partition position. The proof applies to any partition.

and the best data point found in E will prune partition B. Then an optimal algorithm retrieves only partition E. y). for a query F (x.Figure 2. a.5: Cases of partitions with respect to R and optimal objective function contour Without loss of generality. and partition E contains an optimal data point. we only need to show that partition D will not be retrieved. R)). assume that only these five partitions contain some data points. Partition B will not be retrieved because partition E will be retrieved before partition B (because minObjV al(E. 32 . R) < minObjV al(B. Thus. Partitions C and A will not be retrieved because they are infeasible and our algorithm will never retrieve infeasible partitions.

i. the virtual solution corresponding to (minObjV al(D. F (a) ≥ minObjV al(D.Because partition D does not intersect the intersection of the optimal contour and R but E does. Suppose partition D is also retrieved later. R). Weighted Constrained kNN. Hence. This contradicts the assumption that partition D does not intersect the intersection of R and the optimal contour.000 33 . Therefore. a 64-dimensional color image histogram dataset of 100. we compare cpu times against a non-general optimization technique. R) > minObjV al(E. This means partition D cannot be pruned by data point a.e. and the data point a is found at that time. R)) is contained by the optimal contour (because of the convexity of function F ). 2. R). We index and perform que-ries over four real data sets.6 Performance Evaluation The goals of the experiments are to show i) that the proposed framework is general and can be used to process a variety of queries that optimize some objective ii) that the generality of the framework does not impose significant cpu burden over those queries that already have an optimal solution. One of these is Color Histogram data. and Constrained Weighted Linear Optimization. which also means the virtual solution corresponding to (minObjV al(D. partition E must be retrieved before partition D since the algorithm always retrieves the most promising partition first. Therefore. we can find at least one virtual point x∗ (without considering the database) in partition D such that x∗ is both in R and the optimal contour. We performed experiments for a variety of optimization query types including kNN. and iii) that non-relevant data subspace can be effectively pruned during query processing. minObjV al(D. Weighted kNN. R)) ∈ D ∩ R∩optimal contour. Where possible..

unweighted nearest neighbor queries. as the weights of a weighted nearest neighbor query become more skewed. our algorithm matches the I/O-optimal access offered by traditional R-tree nearest neighbor branch and bound searching. For unconstrained. 34 . the queries are weighted nearest neighbor queries. Weighted kNN Queries Figure 2. We execute 100 queries for each trial. We vary the value of k between 1 and 100. the optimal contour is a hyper-ellipse.data points. These datasets are widely used for performance evaluation of index structures and similarity search algorithms [39.000 60-dimensional vectors representing texture feature vectors of Landsat images [40]. the optimal contour is a hyper-sphere around the query point with a radius equal to the distance of the nearest point in the data set. We assume the index structure is in memory and we measure the number of additional page accesses required to access points that could be kNN according to the query processing. Intuitively. If instead. The figure shows that we achieve nearly the same I/O performance for weighted kNN queries using our query processing algorithm over the same index that is appropriate for traditional non-weighted kNN queries. The vectors represent color histograms computed from a commercial CD-ROM. and randomly assign weights between 0 and 1 for the k-WNN queries. 26]. The second is Satellite Image Texture (Landsat) data. which consists of 100.6 shows the performance of our query processing over R*-trees for the first 8-dimensions of the histogram data for both kNN and k weighted NN (k-WNN) queries. Because the weights are 1 for the kNN queries. We used a clinical trials patient data set and a stock price data set to perform real world optimization queries. The index is built over the first 8 dimensions of the histogram data and the queries are generated to find the kNN or k-WNN to a random point in the data space over the same 8 dimensions.

we could generate an index based on attribute values transformed based on the weights and yield an optimal contour that was hyper-spherical. R*-tree. this would require generating an index for every potential set of query weights. 100k pts the hyper-volume enclosed by the optimal contour increases. Random weighted kNN vs kNN. We could then traverse the access structure using traditional nearest neighbor techniques. However. and the opportunity to intersect with access structure partitions increases.6: Accesses.Figure 2. This indicates that these weighted functions do not cause the optimal contour to cross significantly more access structure partitions than their non-weighted counterparts. 8-D Histogram. For a weighted nearest neighbor query. Results for the weighted queries track very closely to the non-weighted queries. which is not practical for applications where weights are not typically known prior to the 35 .

7 shows the average cpu times required per query to traverse the access structure using convex problem solutions for the weighted kNN queries compared to the standard nearest neighbor traversal used for unweighted kNN queries. For similar applications where query weights can vary. Random weighted kNN vs kNN. 8-D Histogram. we would be better served to process the query over a single reasonable access structure in an I/O-optimal way than by building specific access structures for a specific set of weights.7: CPU Time. R*-tree. Figure 2. Figure 2. and these weights are not known in advance. 100k pts 36 .query.

Part of this difference is due to the slightly more complicated objective function (weighted versus unweighted Euclidean distance).000 objects. the same characteristics that make high-dimensional R-tree type structures ineffective for traditional queries also make them ineffective for optimization queries. the probability that a point’s objective function value can prune other partitions decreases.8 shows page access results for kNN and k-WNN queries over 60-dimensional. If the dataspace were more uniformly populated. Figure 2. The experiments assume the VA-file is in memory and measure the number of objects that need to be accessed to answer the query. For this reason. much of the data in this data set is clustered. while the remaining part is due to incorporating data set constraints (unused in this example) into computing minimum partition function values. 5-bit per attribute. the partitions overlap with each other more. part is due to the additional partitions that the optimal contour crosses. Similar to the R*-tree case. However. As data set dimensionality increases. As partitions overlap more. and some vector representations are heavily 37 . Therefore. we would expect to see object accesses much closer to k. Hierarchical data partitioning access structures are not effective for high-dimensional data sets. VA-file built over the landsat dataset. the results show that the performance of the algorithm is not significantly affected by re-weighting the objective function. we perform similar experiments for high-dimensional data using VA-files.The figure shows that the generic CP-driven solution takes slightly longer to perform than a specialized solution for nearest neighbor queries alone. Sequential scan would result in examining 100.

For example. a computation needs to occur for every approximated object. it can not overcome the inherent weaknesses of that structure. 60-D Sat.Figure 2. 100k pts populated. A consequence of VA-File access structures is that. at high dimensionality. It should be noted that while the framework results in optimal access of a given structure. Times 38 . If a best answer comes from one of these vector representations. hierarchical data structures are ineffective. without some other data reorganization. VA-File. and the framework can not overcome this. As distance computations need to be performed for each vector in traditional VA-file nearest neighbor algorithms.8: Accesses. we need to look at each object assigned the same vector representation. following the VA-file processing model so too must a CP problem be solved for each vector for general optimization queries. Random weighted kNN vs kNN.

The graph shows interesting results. we access fewer partitions when this particular problem is less constrained. Weights are uniformly randomly generated and are between the values of -10 and 10. corner [0. and the maximum constrained value for the negatively weighted attributes will yield the optimal result within the constrained region (e. We vary k and we vary the constrained area region fixed at the origin as a 8-D hypercube with the indicated value. It takes longer to perform each query but the ratio of the general optimization query times to the traditional kNN CPU times is similar to the R*-tree case. the corner corresponding to the minimum constraint value for the positively weighted attributes.e.+]).g.0] would be an optimal minimization answer for a constrained unit cube and a linear function with a weight vector [+. When weights can be negative. However. Figure 2. The number of page accesses is not directly related to the constraint size.for this experiment reflect this fact. leading to denser packing of access structure partitions near the origin. Constrained Weighted Linear Optimization Queries We also explore Constrained Weighted-Linear Optimization queries. For this data set. Because of the sparser density of partitions around the point that optimizes the query.-. The rate of increase in page accesses to accommodate different values of k does vary with the constraint size. the radius of the k th -optimal contour increases more in these less dense regions than 39 . We generated 100 queries for each data point in the form of a summation of weighted attributes over the 8 dimensions.006 (i. about 1. very little computational burden is added to allow query generality). many points are clustered around the origin.9 shows the number of page accesses required to answer randomly weighted LP minimization queries for the histogram data set using the 8-D R*-Tree access structure.1.

R*-Tree. This means that there are less than k objects in our constrained region. leading to a faster increase in the number of page accesses required when k increases. and it. the sequential scan of this data set is 834 pages. 100k pts more clustered regions. For our query processing.Figure 2. we only need to examine those points that are in partitions 40 . For comparative purposes.9: Page Accesses. We should also note what happens when there are less than k optimal answers in the data set. as well as the other competing techniques are not appropriate for convex constrained areas. Note that the Onion technique is not feasible for this dataset dimensionality. 8-D Color Histogram. Random Weighted Constrained k-LP Queries.

It corresponds to a vector access ratio of 0. Unlike nearest neighbor and linear optimization type queries.49x2 − 0. randomly weighted queries.51x5 .09x1 − 0.0007.28x3 − 0. A sample function. We also wanted to explore that constrained optimization problems for real applications. Functions contain both quadratic and linear terms and were generated over 5 dimensions in the form 5 i=1 (random() ∗ x2 − random() ∗ xi ). is to minimize 0. we achieved an average of 11.5 vector accesses to find the k = 10 results.49x2 +0. the continuous optimal solution is not intuitively obvious by examining the function. This means that on average the number of vectors within the intersection of the constrained space and 10th -optimal contour is 11. Using the VA-file structure with 5 bits assigned per attribute and 100 randomly selected. 4 5 41 .91x4 − 0. we generated a set of 100 random functions. We performed Weighted Constrained kNN experiments to emulate a patient matching data exploration task described in Section 2. Arbitrary Functions over Different Access Structures In order to show that the framework can be used to find optimal data points for arbitrary convex functions. where coefficients are rounded to two decimal places. A technique that did not evaluate constraints prior to or during query processing will need to examine every point. To test this. we performed a number of queries over a set of 17000 real patient records.3.76x2 − 0. We established constraints on age and gender and queried to find the 10 weighted nearest neighbors in the data set according to 7 measured blood analytes.that intersect the constrained region. i where xi is the data attribute value for the ith dimension.3.55x2 + 0. Queries with Real Constraints The previous example showed results with synthetically generated constraints.4x2 + 1 2 3 0.91x2 +0.5.

10 shows results over the 100 functions.0025 for all the access structures. the average ratio of the data points accessed is below 0. The R*-tree looks to avoid some of the partition overlaps produced by the R-tree insertion heuristics and it shows better performance. The VAM-split R*-tree bulk loads the data into the 42 . the R*-tree. Even the worst performing structure over the worst case function still prunes over 99% of the search space. and these optimal answers were scattered throughout the data space. We explored 2 non-hierarchical structures. we modified code to include an optimization function that uses CP to traverse the index structure using the algorithm appropriate for the structure. Each structure performed well for these optimization queries. It shows the minimum. and the VA-File 32000. Figure 2. maximum. For each of these index types. We added a new set of 50000 uniform random data points within the 5-dimension hypercube and built each of the index structures over this data set. the grid file and the VAfile. the grid file 16000. Results between hierarchical access structures are as expected. These include 3 hierarchical structures. including the R-tree. For each structure we try to keep the number partitions relatively close to each other. For the hierarchical structures these are leaf partitions and for the non-hierarchical structures they are grid cells. We ran the set of random functions over each structure and measured the number of objects in the partitions that we retrieve. and average object access ratio to guarantee discovery of the optimal answer over the set of functions. The number of partitions for the hierarchical structures is between 12000 and 13000. Each index yielded the same correct optimal data point for each function.We explored the results of our technique over 5 index structures. and the VAM-split bulk loaded R*-tree.

5-D Uniform Random. Optimizing Random Function. Index Type. Incorporating Problem Constraints during Query Processing The proposed framework processes queries in conjunction with any set of convex problem constraints. this yields better performance than the dynamically constructed R*-tree. For a constrained NN type query we could process the query according to traditional NN techniques and when we find a result.10: Accesses vs. and therefore achieves better access ratios. check it against the constraints until we find a point that meets the 43 . We compare the proposed framework to alternative methods for handling constraints in optimization query processing. 50k Points index and can avoid more partition overlaps than an index built using dynamic insertion.Figure 2. As expected. The VA-File has greater resolution than the grid-file in this particular test.

Figure 2. R*-Tree. and we can terminate the search.11 shows the number of pages accessed using our technique compared to these two alternatives as the constrained area varies. and then perform distance computations on the result set and select the best one.constraints. the more likely a discovered NN will fall within the constraints. As the problem becomes less constrained. we could perform a query on the constraints themselves. NN Query.11: Pages Accesses vs. Figure 2. The Selectivity line is a 44 . Constraints. 100k Points In the figure. Constrained kNN queries were performed over the 8-D R*-tree built for the Color Histogram data set. and then check if the result meets the constraints. the R*-tree line shows the number of page accesses required when we use traditional NN techniques. 8-D Histogram. Alternatively.

The CP line shows the results for our framework. 2. query constraints are convex. Our search drills down to the leaf partition that either contains the optimal answer within the problem constraints. or yields the best potential answer within constraints. We provide specific implementations of the processing framework for hierarchical data partitioning and non-hierarchical space partitioning access structures. We experimentally show that the framework can be used to answer a number of different types of optimization queries including nearest neighbor queries.lower bound on the number of pages that would be read to obtain the points within the constraints. we prune search paths that can not lead to feasible solutions.7 Conclusions In this chapter we present a general framework to model optimization queries. as the problem becomes more constrained. We present a query processing framework to answer optimization queries in which the optimization function is convex. We demonstrate better results with respect to page accesses except in cases where the constraints are very small (and we can prune much of the space by performing a query over the constraints first) or when the NN problem is not constrained (where our framework reduces to the traditional unconstrained NN problem). Clearly. linear function optimization. When there are problem constraints. which processes original problem constraints as part of the search process and does not further process any partitions that do not intersect constraints. and non-linear function optimization queries. 45 . and the underlying access structure is made up of convex partitions. the fewer pages will need to be read to find potential answers. We prove the I/O optimality of these implementations.

the performance will not be as good. This means that given an access structure we will only read points from partitions that have minimum objective function values lower than the k th actual best answer. Because the framework is built to best utilize the access structure. and partitions and search paths that do not intersect with problem constraints are pruned during the access structure traversal. If the optimal contour and problem constraints can be isolated to a few leaf partitions. it captures the benefits as well as the flaws of the underlying structure. The proposed framework offers I/O-optimal access of whatever access structure is used for the query. Furthermore. it demonstrates how well the particular access structure isolates the optimal contour within the problem constraints to access structure partitions.The ability to handle general optimization queries within a single framework comes at the price of increasing the computational complexity when traversing the access structure. Since constraints are handled during the processing of partitions in our framework. This results in a significant reduction in required accesses compared to alternative methods in the cases where the inclusion of constraints makes these alternatives less effective. the access structure will yield nice results. if the optimal contour crosses many partitions. As such. we only access points in partitions that could potentially have better answers than the actual optimal database answers. However. Essentially. We show a slight increase in CPU times in order to perform convex programming based traversal of access structures in comparison with equivalent nonweighted and unconstrained versions of the same problem. we do not need to examine points that do not intersect with the constraints. the framework can be used to measure page access performance associated with using different indexes and index types to 46 .

answer certain classes of optimization queries. The system does not need to compute a queried function for all data objects in order to find the optimal answers. The technique provides optimization of arbitrary convex functions. weights. Database researchers and administrators can use this technique as a benchmarking tool to evaluate the performance of a wide range of index structures available in the literature. one does not need to know the objective function. This makes the framework appropriate for applications and domains where a number of different functions are being optimized or when optimization is being performed over different constrained regions and the exact query parameters are not known in advance. To use this framework. in order to determine which structures can most effectively answer the optimization query type. or constraints in advance. 47 . and does not incur a significant penalty in order to provide this generality.

.CHAPTER 3 Online Index Recommendations for High-Dimensional Databases using Query Workloads When solving problems. rendering statically determined access structures ineffective. access structures that are appropriate for the query must be available to the query processor at query run-time. In order to effectively prune search space during query resolution. 3. Appropriate access structures should match well with query selection criteria and allow for a significant reduction in the data search space. such as business data warehouses and scientific data repositories.1 Introduction An increasing number of database applications. a method for monitoring performance and recommending access structure changes is provided. This chapter describes a framework to determine a meaningful set of appropriate access structures for a set of queries considering query cost savings estimated from both query attributes and data characteristics. Since query patterns can change over time. dig at the roots instead of just hacking at the leaves. As the number of dimensions/attributes and overall size of data sets increase. it becomes essential to efficiently retrieve specific queried data from the database in order to effectively utilize 48 . deal with high dimensional data sets. D’Angelo.Anthony J.

Another possibility would be to build an index over each single dimension. To answer a query. Range queries are consistently executed over five of the attributes. the answer set contains about 10 results). Multi-dimensional indexing. The query selectivity over each attribute is 0. However.1. However.000 data objects with several hundred attributes. We could build a multi-dimensional index over the data set so that we can directly answer any query by only using the index. each of these potential solutions has inherent problems. Indexing support is needed to effectively prune out significant portions of the data set that are not relevant for the queries. so the overall query selectivity is 1/105 (i.000. The effectiveness of this approach is limited to the amount of search space that can be pruned by a single dimension (in the example the search space would only be pruned to 100. for high-dimensional data sets. performance of multi-dimensional index structures is subject to Bellman’s curse of dimensionality [4] and degrades rapidly as the number of dimensions increases. As the dimensionality of the index increases.000 objects). For the given example. data is placed in a partition that contains the data point and could overlap with other partitons. consider a uniformly distributed data set of 1. such an index would perform much worse than sequential scan. the overlaps between partitions increase and 49 . all potentially matching search paths must be explored. To illustrate these problems.e. and RDBMS index selection tools all could be applied to the problem. dimensionality reduction. An ideal solution would allow us to read from disk only those pages that contain matching answers to the query.the database. For data-partitioning indexes such as the R-tree family of indexes.

where partitions do not overlap and data points are associated with cells which contain them (e. For spacepartitioning structures.at high enough dimensions the entire data space needs to be explored. A 100 dimensional index with only a single split per attribute results in 2100 cells. they also select attributes based on data statistics to support classification accuracy rather than focusing on query performance and workload in a database domain. As well. Commercial RDBMS’s have developed index recommendation systems to identify indexes that will work well for a given workload. 50 . However. Therefore. and transform the query in the same way that the data was transformed. traditional feature selection techniques are based on selecting attributes that yield the best classification capabilities. They also introduce a significant overhead in the processing of queries.g. the selected features may offer little or no data pruning capability given query attributes. Another possible solution is to apply feature selection to keep the most important attributes of the data according to some criteria and index the reduced dimensionality space. These tools are optimized for the domains for which these systems are primarily employed and the indexes that the systems provide. the problem is the exponential explosion of the number of cells. The index structure can become intractably large just to identify the cells. grid files). They are targeted towards lower dimension transactional databases and do not produce results optimized for single high-dimensional tables. the dimensionality reduction approaches are mostly based on data statistics. However. Another possible solution would be to use some dimensionality reduction technique. index the reduced dimension data space. This phenomenon is thoroughly described in [54]. and perform poorly especially when the data is not highly correlated.

the search criterion is expected to consist of 10 to 30 parameters. However. only a small subset of the overall data dimensions are popular for a majority of queries and recurring patterns of dimensions queried occur. rather than maintaining data energy as is the case for traditional approaches. Their long term effectiveness is dependent on the stability of the query workload. Researchers have proposed workload based index recommendation techniques. However. and results in multiple and potentially overlapping sets of dimensions. Each such collision generates on the order of 1-10MBs of raw data.Our approach is based on the observation that in many high dimensional database applications. rather than a single set. Large Hadron Collider (LHC) experiments are expected to generate data with up to 500 attributes at the rate of 20 to 40 per second [45]. forcing their collision. Query pattern evolution over time presents another challenging problem. The queries are predominantly range queries and involve mostly around 5 dimensions out of a total of 200. This approach is also analogous to dimensionality reduction or feature selection with the novelty that the reduction is specifically designed for reducing query response times. For example. Our reduction considers both data and access patterns. We address the high dimensional database indexing problem by selecting a set of lower dimensional indexes based on joint consideration of query patterns and data statistics. Another example is High Energy Physics (HEP) experiments [49] where sub-atomic particles are accelerated to nearly the speed of light. query access patterns may change over time becoming completely dissimilar from the 51 . The new set of low dimensional indexes is designed to address a large portion of expected queries and allow effective pruning of the data space to answer those queries. which corresponds to 300TBs of data per year consisting of 100-500 million objects.

for each query. different database uses at different times of the month or day). or the construction of an entirely new index set is beneficial. For this reason. To make this approach practical in the presence of query pattern change. a change in the focus of user knowledge discovery (e. When the current query patterns are substantially different from the query patterns used to recommend the database indexes. or simply random variation of query attributes. Pattern change could be the result of periodic time variation (e. Because of the need to proactively monitor query patterns and query performance quickly. We generate this abstract representation of the query workload by mining patterns in the workload. the index set should evolve with the query patterns. the system performance will degrade drastically since incoming queries do not benefit from the existing indexes. current events cause an increase in queries for certain search attributes).g. To estimate the query cost. a change in the popularity of a search attribute (e. The query workload representation consists of a set of attribute sets that occur frequently over the entire query set that have non-empty intersections with the attributes of the query.g. the index selection technique we have developed uses an abstract representation of the query workload and the data set that can be adjusted to yield faster analysis. the data set is represented by a multi-dimensional histogram where each unique value represents an approximation of data and contains a count of the number of 52 .patterns on which the index set were originally determined. a researcher discovery spawns new query patterns). the replacement of an existing index. we introduce a dynamic mechanism to detect when the access patterns have changed enough that either the introduction of a new index. There are many common reasons why query patterns change.g.

Introduction of a flexible index selection technique designed for high-dimensional data sets that uses an abstract representation of the data set and query workload. The resolution of the abstract representation can be tuned to achieve either a high ratio of index-covered queries for static index selection or fast index selection to facilitate online index selection. 2. The contributions of this work can be summarized as follows: 1. Analysis speed and granularity is affected by tuning the resolution of the abstract representations. This process is iterated until some indexing constraint is met or no further improvement is achieved by adding additional indexes. The number of potential indexes considered is affected by adjusting data mining support level. As new queries arrive. we monitor the ratio of potential performance to actual performance of the system in terms of cost and based on the parameters set for the control feedback loops. the estimated cost of using that index for the query is computed. Introduction of a technique using control feedback to monitor when online query access patterns change and to recommend index set changes for highdimensional data sets. we make major or minor changes to the recommended index set. we propose a control feedback system with two loops. a fine grain control loop and a coarse control loop.records that match that approximation. Initial index selection occurs by traversing the query workload representation and determining which frequently occurring attribute set results in the greatest benefit over the entire query set. 53 . For each possible index for each query. The size of the multi-dimensional histogram affects the accuracy of the cost estimates associated with using an index for a query. In order to facilitate online index selection.

We conclude in Section 3. It takes into consideration both data and query characteristics and can be applied to perform real-time index recommendations for evolving query patterns. Experimental analysis showing the effects of varying abstract representation parameters on static and online index selection performance and showing the effects of varying control feedback parameters on change detection response. Section 3.2. Our work differs from the related index selection work in that we provide a index selection framework that can be tuned for speed or accuracy.2 Related Work High-dimensional indexing.3 explains our proposed index selection and control feedback framework. Section 3.5. and the GC-tree [28]. 3. Section 3. 3. Presentation of a novel data quantization technique optimized for query workloads. each of these methods provides less than ideal solutions for the problem of fast high-dimensional data access.2 presents the related work in this area. 4. Our technique is optimized to take advantage of multi-dimensional pruning offered by multi-dimensional index structures.3. and DBMS index selection tools are possible alternatives for addressing the problem of answering queries over a subspace of high-dimensional data sets. The rest of the chapter is organized as follows. feature selection.1 High-Dimensional Indexing A number of techniques have been introduced to address the high-dimensional indexing problem.4 presents the empirical analysis. such as the X-tree [6]. As described below. While these index 54 .

2. almost every commercial Relational Database Management System (RDBMS) provides the users with an index recommendation tool based on a query workload and using the query optimizer to obtain cost estimates. The query workload should be a good representative of the types of queries an application supports. Then a candidate index selection step evaluates the queries in a given query workload and eliminates from consideration those candidate 55 . 17. they still suffer performance degradation at higher index dimensionality. A query workload is a set of SQL data manipulation statements. 24. selected features may have no overlap with queried attributes. an iterative process is utilized to find an optimal configuration. These techniques are also focused on maximizing data energy or classification accuracy rather than query response.3 Index Selection The index selection problem has been identified as a variation of the Knapsack Problem and several papers proposed designs for index recommendations [34. 3. 16. 3. Currently. single dimension candidate indexes are chosen.structures have been shown to increase the range of effective dimensionality.2 Feature Selection Feature selection techniques [7. Microsoft SQL Server’s AutoAdmin tool [15. 36. 1] selects a set of indexes for use with a specific data set given a query workload. These earlier designs could not take advantage of modern database systems’ query optimizer. 3. As a result. 55. First. In the AutoAdmin algorithm. 13] based on optimization rules. 29] are a subset of dimensionality reduction targeted at finding a set of untransformed attributes that best represent the overall data set.2.

These index structures can not achieve the same level of result pruning that can be offered by an index technique that indexes multiple dimensions simultaneously (such as R-tree or grid file). The process is iterated for increasingly wider multi-column indexes until a maximum index width threshold is reached or an iteration yields no improvement in performance over the last iteration. single and multi-level B+-trees are evaluated. In the case of SQL Server. The cost function computed by the query optimizer is used to calculate the benefit of using the index. Costs are estimated using the query optimizer which is limited to considering those physical designs offered by the DBMS. For DB2. 56 .indexes which would provide no useful benefit. Remaining candidate indexes are evaluated in terms of estimated performance improvement and index cost. MAESTRO (METU Automated indEx Selection Tool)[19] was developed on top of Oracle’s DBMS to assist the database administrator in designing a complete set of primary and secondary indexes by considering the index maintenance costs based on the valid SQL statements and their usage statistics automatically derived using SQL Trace Facility during a regular database session. IBM has developed the DB2Adviser [52] which recommends indexes with a method similar to AutoAdmin with the difference that only one call to the query optimizer is needed since the enumeration algorithm is inside the optimizer itself. As a result the indexes suggested by the tool often do not capture the query performance that could be achieved for multidimensional queries. The SQL statements are classified by their execution plan and their weights are accumulated.

and the query set Q is a instance DS is a set of n tuples t = {(a1 . do not reflect the pruning that could be achieved by indexing multiple dimensions together. In [35] a cost model is used to identify beneficial indexes and decide when to create or drop an index at runtime.1 Approach Problem Statement In this section we define the problem of index selection for a multi-dimensional space using a query workload. Microsoft Research has proposed a physical design alerter [11] to identify when a modification to the physical design could result in improved performance. the }. Q).3.. and Q (the query set) is a set of subsets of DS. 57 .4 Automatic Index Selection The idea of having a database that can tune itself by automatically creating new indexes as the queries arrive has been proposed [35. DS ⊆ D is a finite subset (the data set).. where d is the dimensionality of the dataset. where D is the domain. DS. ad )|ai ∈ set of range queries. [18] proposes an agent-based database architecture to deal with automatic index creation. In our case. the domain is d . . we define a workload as follows: Definition 4 A workload W is a tuple W = (D. More formally.These commercial index selection tools are coupled to physical design options provided by their respective query optimizers and therefore.. 3.3 3. 3.2. a2 . 18]. A query workload W consists of a set of queries that select objects within a specified subspace in the data domain.

an optional analysis time constraint ta . an optional indexing constraint C. attribute order has no impact on the amount of pruning possible. the time to retrieve the data pages from disk. when no index is present. reduces to scanning all the points in the dataset and testing whether the query conditions are met.e. i. we are aiming to recommend the set of indexes that yields the lowest estimated cost for every query in a workload for any query that can benefit from an index.Finding the answers to a query. either as the minimum number of indexes or a time constraint. In the context of this problem an index is considered to be the set of attributes that can be used to prune subspace simultaneously with respect to each attribute. Our problem can be defined as finding a set of indexes I. The overall goal of this work is to develop a flexible index selection framework that can be tuned to achieve effective static and online index selection for highdimensionsal data under different analysis constraints. For static index selection. Therefore. a query workload W . that provides the best estimated cost over W . when no constraints are specified. from which the queries 58 . The assumption is that the time spend doing I/O dominates the time required to perform the simple bound comparisons. given a multi-dimensional dataset DS. The index can identify a smaller set potential matching objects and only those data pages containing these objects need to be retrieved from disk. In the case that an index is present. In the case when a constraint is specified. we want to recommend a set of indexes within the constraint. the cost of answering the query can be lower. In this scenario we can define the cost of answering the query as the time it takes to scan the dataset. The degree to which an index prunes the potential answer set for a query determines its effectiveness for the query.

we want an online index selection system that differentiates between low-cost index set changes and higher cost index set changes and also can make decisions about index set changes based on different cost-benefit thresholds. Typically.can benefit the most. we are striving to develop a system that can recommend an evolving set of indexes for incoming queries over time. a cost model is embedded into the query optimizer to decide on the query plan. The cost associated with an index is calculated based on the number of estimated matches derived from the histogram and the dimensionality of the index. whether the query should be answered using a sequential scan or using an existing index. While maintaining the original query information for later use to determine estimated query cost. The histogram captures data correlations between only those attributes that could be represented in a selected index. Instead of using the query optimizer to estimate query cost. When there is a time-constraint.3. we need to automatically adjust the analysis parameters to increase the speed of analysis. such that the benefit of index set changes outweighs the cost of making those changes. we conservatively estimate the number of matches associated with using a given index by using a multi-dimensional histogram abstract representation of the dataset. 3. we apply one abstraction to the query workload to convert each 59 . Increasing the size of the multi-dimensional histogram enhances the accuracy of the estimate at the cost of abstract representation size. we need to be able to estimate the cost of executing the queries.2 Approach Overview In order to measure the benefit of using a potential index over a set of queries. with and without the index. Therefore. For online index selection.

By varying the support. The Database Administrator (DBA) has to decide when to run this wizard and over which workload. online index selection. 60 . Lowering the confidence threshold improves analysis time by eliminating some lower dimensional indexes from consideration but can result in recommending indexes that cover a strict superset of the queried attributes. index selection considering a broader analysis space or frequent. The assumption is that the workload is going to remain static over time and in case it does change. All of the commercial index wizards work in design time. we affect the speed of index selection and the ratio of queries that are covered by potential indexes. The flexibility afforded by the abstract representation we use allows it to be used for infrequent. the DBA would collect the new workload and run the wizard again. In the following two sections we present our proposed solution for index selection which is used for static index selection and as a building block for the online index selection. We further prune the analysis space using association rule mining by eliminating those subsets above a certain confidence threshold. We perform frequent itemset mining over this abstraction and only consider those sets of attributes that meet a certain support to be potential indexes. Our technique differs from existing tools in the method we use to determine the potential set of indexes to evaluate and in the quantization-based technique we use to estimate query costs.query into the set of attributes referenced in the query.

and a Multi-dimensional Histogram H. • Potential Index Set P 61 . The description of the outputs and how they are generated is given below.3.Figure 3. and the index selection loop.3 Proposed Solution for Index Selection The goal of the index selection is to minimize the cost of the queries in the workload given certain constraints. the query cost computation.1: Index Selection Flowchart 3. Figure 3. Given a query workload. Initialize Abstract Representations The initialization step uses a query workload and the dataset to produce a set of Potential Indexes P . We identify three major components in the index selection framework: the initialization of the abstract representations. confidence. a dataset.1 provides a list of the notations used in the descriptions. In the following subsections we describe these components and the dataflow between them.1 shows a flow diagram of the index selection framework. our framework produces a set of suggested indexes as an output. a Query Set Q. and several analysis parameters. Table 3. according to the support. the indexing constraints. and histogram size specified by the user.

used as a workload-optimized abstract representation of the data set S Suggested Indexes. and attributes 1 and 2 are queried together in greater than 10 percent of the queries. a representation of the query workload. the set of attribute sets under consideration as a suggested indexes Q Query Set. then a representation of the set 62 . It consists of the attribute sets in P that intersect with the query attributes. For instance. This set is computed using traditional data mining techniques. support of a set of attributes A is defined as: n SA = i=1 1 if A ⊆ Qi 0 otherwise n where Qi is the set of attributes in the ith query and n is the number of queries. Formally.Symbol P Description Potential Set of Indexes. for each query. the set of attribute sets currently selected as recommended indexes i attribute set currently under analysis support the minimum ratio of occurrence of an attribute in a query workload to be included in P conf idence the maximum ratio of occurrence of an attribute subset to the occurrence of a set before the subset is pruned from P histogram size the number of bits used to represent a quantized data object Table 3. and estimated query costs H Multi-Dimensional Histogram.1: Index Selection Notation List The potential index set P is a collection of attribute sets that could be beneficial as an index for the queries in the input query workload. if the input support is 10%. P consists of the sets of attributes that occur together in a query at a ratio greater than the input support. query ranges. Considering the attributes involved in each query from the input query workload to be a single transaction.

is defined as: 63 . Formally. all subsets of attribute sets meeting the support will also be included as a potential index (in the example above both the sets {1} and {2} will be included). where A and B are disjoint. Note that because a subset of an attribute set that meets the support requirement will also necessarily meet the support. confidence of an association rule {set of attributes A}→{set of attributes B }. Note that our particular system is built independently from a query optimizer. While data mining the frequent attribute sets in the query workload in determining P . the number of potential indexes increases. but the sets of attributes appearing in the predicates from a query optimizer log could just as easily be substituted for the query workload in this step.of attributes {1. an index built over the subset will likely not provide much benefit over the query workload if an index is built over the attributes in the set. As the input support is decreased. Confidence is the ratio of a set’s occurrence to the occurrence of a subset. the input confidence is used to prune analysis space. If a set occurs nearly as often as one of its subsets. The confidence of an association rule is defined as the ratio that the antecedent (Left Hand Side of the rule) and consequent (Right Hand Side of the rule) appear together in a query given that the antecedent appears in the query.2} will be included as a potential index. we also maintain the association rules for disjoint subsets and compute the confidence of these association rules. In order to enhance analysis speed with limited effect on accuracy. Such an index will only be more effective in pruning data space for those queries that involve only the subset’s attributes.

another possible solution in the case where a large number of attributes are frequent together would be to partition a large closed frequent itemset into disjoint subsets for further examination. then we will prune the attribute set {1} from P . then the confidence {2}→{1} = 0. a more efficient algorithm such as the FP-Tree [30] could be applied if the attribute sets associated with queries are too large for the apriori technique to be efficient. We can also set the confidence level based on attribute set cardinality. if every time attribute 1 appears.n CA→B = i=1 n 1 if (A ∪ B) ⊆ Qi 0 otherwise 1 if A ⊆ Qi 0 otherwise i=1 where Qi is the set of attributes in the ith query and n is the number of queries. If we have set the confidence input to 0. • Query Set Q 64 . The confidence can contain more than one value depending on set cardinality.6. we want to be more conservative with respect to pruning attribute subsets. While it is desirable to avoid examining a high-dimensional index set as a potential index.0. but we will keep attribute set {2}. Techniques such as CLOSET [44] could be used to arrive at the initial closed frequent itemsets. In our example. While the apriori algorithm was appropriate for the relatively low attribute query sets in our domain.5. If attribute 2 appears without attribute 1 as many times as it appears with attribute 1. Since the cost of including extra attributes that are not useful for pruning increases with increased indexed dimensionality. attribute 2 also appears then the confidence of {1}→{2} = 1.

we only aggregate the count of rows with each unique bucket representation because we are just interested in estimating 65 . • Multi-Dimensional Histogram H An abstract representation of the data set is created in order to estimate the query cost associated with using each query’s possible indexes to answer that query. Data for an attribute that has been assigned b bits is divided into 2b buckets. In order to handle data sets with uneven data distribution. each query has an identified set of possible indexes for that query. This representation is in the form of a multi-dimensional histogram H. These are the indexes in the potential index set P that share at least one common attribute with the query. The input histogram size dictates the number of bits used to represent each unique bucket in the histogram. A single bucket represents a unique bit representation across all the attributes represented in the histogram. The histogram is built by converting each record in the data set to its representation in bucket numbers. The number of bits that each of the represented attributes gets is proportional to the log of that attribute’s support. we define the ranges of each bucket so that each bucket contains roughly the same number of points. There is no reason to sacrifice data representation resolution for attributes that will not be evaluated. As we process data rows. then it can not be part of an attribute set appearing in P . If a single attribute does not meet the support. These bits are designated to represent only the single attributes that met the input support in the input query workload. At the end of this step.The query set Q is the abstract representation of the query workload is initialized by associating the potential indexes that could be beneficial for each query with that query. This gives more resolution to those attributes that occur more frequently in the query workload.

0 1 01. for attributes 2 and 3 values from 1-5 quantize to 0. A higher accuracy in representation is achieved by using more bits to quantize the attributes that are more frequently queried. 3 and 4 quantize to 01.0.2 shows a simple multi-dimensional histogram example.0.query cost.1 1 01.0 1 11. For illustration.1.0 2 00. and 2 bits to quantize attribute 1. values 1 and 2 quantize to 00. Note that the multi-dimensional histogram is based on a scalar quantizer designed on data and access patterns. Note that we do not maintain any entries in the histogram for bit representations that have no occurrences. This histogram covers 3 attributes and uses 1 bit to quantize attributes 2 and 3. The .1. and values from 6-10 quantize to 1.’s in the ’value’ column denote attribute boundaries (i.0 2 11. A1 2 4 1 6 3 2 5 8 3 9 Sample Dataset A2 A3 Encoding 5 5 0000 8 3 0110 4 3 0000 7 1 1010 2 2 0100 2 6 0001 6 5 1010 1 4 1100 8 7 0111 3 8 1101 Histogram Value Count 00. Table 3. For attribute 1. 5-7 quantize to 10.e.2: Histogram Example 66 . assuming it is queried more frequently than the other attributes.0.0. as opposed to just data in the traditional case.1 1 10.0. In this example.1.0 1 01. and 8 and 9 quantize to 11.1 1 Table 3. So we can not have more histogram entries than records and will not suffer from exponentially increasing the number of potential multi-dimensional histogram buckets for high-dimensional histograms. attribute 1 has 2 bits assigned to it).

our abstract query set representation has estimated costs for each index that could improve the query cost. At the end of this step.Query Cost Calculation Once generated. we also keep a current cost field. For example. one could apply a cost function that amortized the cost of loading an index over a certain number of queries or use a function tailored to the type of index that is used. we aggregate the number of matches we find in the multi-dimensional histogram looking only at the attributes in the query that also occur in the index (bits associated with other attributes are considered to be don’t cares in the query matching logic). Many cost functions have been proposed over the years. we also initialize an empty set of suggested indexes S. which is the index type used for this work. To estimate the query cost.F BF = d 1 Cef f +1 67 . we then apply a cost function based on the number of matches we obtain using the index and the dimensionality of the index. the expected number of data page accesses is estimated in [22] by: d Ann. For an R-Tree. For each query in the query set representation. the abstract representations of the query set Q and the multidimensional histogram H are used to estimate the cost of answering each query using all possible indexes for the query. The cost function can be varied to accurately reflect a cost model for the database system. For a given query-index pair.mm. which we initialize to the cost of performing the query using sequential scan. • Cost Function A cost function is used to estimate the cost associated with using a certain index for a query. At this point.

Each of the cost estimate formulas require a range radius.em. and Cef f is the capacity of a data page. These cost estimates also assume that data distribution is independent between attributes. d is the dataset dimensionality. A more recently proposed cost model is given in [8] where the expected number of pages accesses is determined as: d Ar. and that the data is uniformly distributed throughout the data space. However. we apply a cost estimate that is based on the actual matches that occur over the multi-dimension histogram over the attributes that form a potential index. In order to overcome these limitations. they have certain characteristics that make them less than ideal for the given situation. Using actual matches 68 . While these published cost estimates can be effective to estimate the number of page accesses associated with using a multi-dimensional index structure under certain conditions. The cost model for R-trees we use in this work is given by (d(d/2) ∗ m) where d is the dimensionality of the index and m is the number of matches returned for query matching attributes in the multi-dimensional histogram.where d is the dimensionality of the dataset and Cef f is the number of data objects per disk page. Therefore the formulas break down when assessing the cost of a query that is an exact match query in one or more of the query dimensions. N is the number of data objects.ui (r) = 2r · d N 1 +1− Cef f Cef f where r is the radius of the range query. this formula assumes the number of points N approaches to infinity and does not consider the effects of high dimensionality or correlations.

It also ties the cost estimate to the actual data characteristics (i. A penalty is imposed on a potential index by the dimensionality term. The cost estimate provided is conservative in that it will provide a result that is at least as great as the actual number of matches in the database. It could model query cost with greater accuracy. we use a greedy algorithm that takes into account the indexes already selected to iteratively select indexes that would be 69 . and we assumed that we would retrieve data from disk for all query matches. Index Selection Loop After initializing the index selection data structures and updating estimated query costs for each potentially useful index for a query. incorporates both data correlation between attributes and data distribution.eliminates the need for a range radius. while the published models will produce results that are dependent only on the range radius for a given index structure). for example by crediting complete attribute coverage for coverage queries.e. Given equal ability to prune the space. The cost function could be more complicated in order to more accurately model query costs. the multi-dimensional subspace pruning that can be achieved using different index possibilities is taken into account. and additional cost of traversing the higher dimension structure. such as B+-trees. We used this particular cost model because the index type was appropriate for our data and query sets. There is additional cost associated with higher dimensionality indexes due to the greater number of overlaps of the hyperspaces within the index structure. It could also reflect the appropriate index structures used in the database system. By evaluating the number of matches over the set of attributes that match the query. a lower dimensional index will translate into a lower cost.

At the end of a loop iteration. The input indexing constraints provides one of the loop stop criteria. The overall speed of this algorithm is coupled with the number of potential indexes analyzed. or total number of dimensions indexed. a check is made to determine if the index selection loop should continue. then the loop exits. If the index does not provide any positive benefit for the query. If no potential index yields further improvement. so the analysis time can be reduced by increasing the support or decreasing the confidence.appropriate for the given query workload and data set. For each index in the potential index set P . the current query cost is replaced by the improved cost. After each i is selected. The potential index i that yields the highest improvement over the query set Q is considered to be the best index. total index size. we traverse the queries in query set Q that could be improved by that index and accumulate the improvement associated with using that index for that query. The indexing constraint could be any constraint such as the number of indexes. This includes actions such as eliminating potential indexes that do not provide better cost estimates than the current cost for any query and pruning from consideration those queries whose best index is already a member of the set of suggested indexes. or the indexing constraints have been met. For the queries that benefit from i. no improvement is accumulated. when possible. Index i is removed from potential index set P and is added to suggested index set S. The set of suggested indexes S contains the results of the index selection algorithm. The improvement for a given query-index pair is the difference between the cost for using the index and the query’s current cost. we prune the complexity of the abstract representations in order to make the analysis more efficient. 70 .

we use control feedback to monitor the performance of the current set of indexes for incoming queries and determine when adjustments should be made to the index set.Different strategies can be used in selecting a best index. the output of a system is monitored and based on some function involving the input and output. By monitoring the query workload and detecting when there is a change on the query pattern that generated the existing set of indexes. this may result in recommending a lower-dimensioned index and later in the algorithm a higher-dimensioned index that always performs better. 3. 2. If the indexing constraint is based on total index size. we are able to maintain good performance as query patterns evolve. then benefit per index size unit may be a more appropriate measure. The recommendation set can be pruned in order to avoid recommending an index that is non-useful in the context of the complete solution.3. The strategy provided assumes a indexing constraint based on the number of indexes and therefore uses the total benefit derived from the index as the measure of index ’goodness’. the input to the system is readjusted through a control feedback loop. However. In our approach. Our system input is a set of indexes and a set of incoming queries rather than a simple input. Our situation is analogous but more complex than the typical electrical circuit control feedback system in several ways: 1. In a typical control feedback system. such as an electrical signal. The system output must be some parameter that we can measure and use to make decisions about changing the input. Query performance is the obvious 71 .4 Proposed Solution for Online Index Selection The online index selection is motivated by the fact that query patterns can change over time.

then the output overshoots the target and oscillates around the desired output before reaching it. because lower query performance could be related to other aspects rather than the index set. we may have a set of attributes that appears in queries frequently enough that our system indicates that it is beneficial to create an index over those attributes. then it fails to cause the output to change based on input changes in a timely manner. Control feedback systems can fail to be effective with respect to response time. However. Both situations are undesirable and should be designed out of the system. One is for fine grain control and is used to recommend minor. 72 .parameter to monitor.2 represents our implementation of dynamic index selection. Figure 3. The control system can be too slow to respond to changes. If the system is too slow. The other loop is for coarse control and is used to avoid very poor system performance by recommending major index set changes. We implement two control feedback loops. Our system input is a set of indexes and a set of incoming queries. We do not have a predictable function to relate system input and output because of the non-determinism associated with new incoming queries. For example. or it can respond too quickly. inexpensive changes to the index set. Our system simulates and estimates costs for the execution of incoming queries. but there is no guarantee that those attributes will ever be queried again. 3. System output is the ratio of potential system performance to actual system performance in terms of database page accesses to answer the most recent queries. Each control feedback loop has decision logic associated with it. our decision making control function must necessarily be more complex than a basic control system. If it responds too quickly.

In this case. For clarity.Figure 3. System The system simulates query execution over a number of incoming queries. we determine which of the current indexes in I most efficiently answers 73 . The abstract representation of the last w queries stored as W . This representation is similar to the one kept for query set Q in the static index selection. where w is an adjustable window size parameter.2: Dynamic Index Analysis Framework System Input The system input is made up of new incoming queries and the current set of indexes I. when a new query q arrives. a notation list for the online index selection is included as Table 3. which is initialized to be the suggested indexes S from the output of the initial index selection algorithm. W is used to estimate performance of a hypothetical set of indexes Inew against the current index set I.3.

Consider the possible new indexes Pnew to be the set of attribute sets that currently meet the input support and confidence over the last w queries. This information is used in the control feedback loop decision logic. The query performance using I is the summation of the costs of queries using the best index from I for the given query. The system also keeps track of the current potential indexes P . we compare the query performance using the current set of indexes I to the performance using a hypothetical set of indexes Inew . the set of attribute sets in consideration to be indexes before q arrives New Potential Indexes. and the current multi-dimensional histogram H. System Output In order to monitor the performance of the system. The hypothetical cost is calculated differently based on the comparison of P and Pnew .Symbol I Inew w W q P Pnew iq Description Current set of attribute sets used as indexes Hypothetical set of attribute sets used as indexes window size abstract represention of the last w queries the current query under analysis Current Potential Indexes.3: Notation List for Online Index Selection this query and replace the oldest query in W with the abstract representation of q. We also incrementally compute the attribute sets that meet the input support and confidence over the last w queries. the set of attribute sets in consideration to be indexes after q arrives the attribute set estimated to be the best index for query q Table 3. and the identified best index iq from P or Pnew for the new incoming query: 74 .

minor changes to index set. The hypothetical cost is the cost over the last w queries using Inew . P = Pnew and i is not in I. to the actual performance is used in the control loop decision logic. The ratio of the hypothetical cost. In this case we bypass the control loops since we could do no better for the system by changing possible indexes. P = Pnew and i is in I. which indicates potential performance. We compute the hypothetical cost of these queries to be the real number of matches from the database.1. In this case we bypass the control loops since we could do no better for the system by changing possible indexes. 4. and appropriate changes are made to update the system data structures. Increasing the input minor change threshold causes the frequency of minor changes to also increase. 75 . We recompute a new set of suggested indexes Inew over the last w queries. Hypothetical cost for other queries is the same as the real cost. This loop is entered in case 2 as described above when the ratio of hypothetical performance to actual performance is below some input minor change threshold. Fine Grain Control Loop The fine grain control loop is used to recommend low cost. P = Pnew and i is in I. Then the indexes are changed to Inew . 3. We traverse the last w queries and determine those queries that could benefit from using a new index from Pnew . 2. P = Pnew and i is not in I.

The system performance and rate of index change can be monitored and used in order to tune the control feedback system itself.5 System Enhancements In the following subsections.Coarse Control Loop The coarse control loop is used to recommend more costly. where the system continously adds and drops indexes. and a new set of suggested indexes Inew is generated. Increasing the input major change threshold increases the frequency of major changes. a system that is slow to respond results in recommending useful indexes long after they first could have a positive effect. abstract representations are recomputed. we present two system enhancements that provide further robustness and scalability to our framework. When this condition is 76 . Then the static index selection is performed over the last w queries. Appropriate changes are made to update the system data structures to the new situation.3. This loop is entered in case 4 as described above when the ratio of hypothetical performance to actual performance is below some input major change threshold. this indicates a non-responsive system. If the actual number of query results is much lower than the estimated cost using the recommended index set over a window of queries. Self Stabilization Control feedback systems can be either too slow or too responsive in reacting to input changes. It could also fail to recommend potentially useful indexes if thresholds are set so that the system is insensitive to change. 3. In our application. but changes with greater impact on future performance to the index set. A system that is too responsive can result in system instability.

However. At the other extreme would be to perform the index recommendation at each site. index-based pruning based on attributes and data at multiple locations would not be possible. However. A single set of indexes would be recommended for the entire system. 77 . This approach would allow for the creation of indexes that maximize pruning over the global set of dimensions. If the frequency of index change is too high with little or no improvement in query performance. an oversensitive system or unstable query pattern is indicated. Partitioning the Algorithm The system can also be applied to environments where the database is partitioned horizontally or vertically at different locations. and we can reperform to analysis applying a reduced support. This increases the probability that Pnew will be different from P . Recommended indexes would be determined based on global trends. it would not optimize query performance at the site level. This gives us a more complete P with respect to answering a greater portion of queries. This would maintain the abstract representation and collect global query patterns over the entire system. A solution at one extreme is to maintain a centralized index recommendation system. This allows tight control of the indexes to reflect the data at each site. Indexes would be recommended based on the query traffic to the data at that site. a different abstract representation would be built based on the data at each specific site given requests to the data at that site. In this case.detected. the system can be made more responsive by cutting the window size. We can reduce the sensitivity by increasing the window size or increasing the support level during recomputation of new recommended indexes.

A global set of indexes can be centrally recommended based on a global abstract representation of the data set with low support thresholds (the centralized location will store a larger set of globally good indexes). eliminating some traffic.000 records consisting of 100 dimensions of uniformly distributed integers between 0 and 999.4.a set of 6500 records consisting of 360 dimensions of daily stock market prices. The variation in data sets is intended to show the applicability of our algorithm to a wide range of data sets and to measure the effect that data correlation has on results. The query patterns emanating from each site can be mined in order to find which of those global indexes would be appropriate to maintain at the local site. • stocks .4 3. 3. The data is not correlated. This data is extremely correlated.a set of 100. 78 .1 Empirical Analysis Experimental Setup Data Sets Several data sets were used during the performance of experiments.This approach also requires traffic to each site that contains potentially matching data. Data sets used include: • random . A hybrid approach can provide the benefits of index recommendation based on global data and query patterns while optimizing the index set at different requesting locations.

while others are not at all correlated. the conf idence parameter for the experiments is 1. Analysis Parameters The effect of varying several analysis input parameters including support. and online indexing control feedback decision thresholds was analyzed. So. Unless otherwise stated each query in experiments is a point query with respect to 79 . query workload files were generated by merging synthetic query histories and query histories from real-world applications with different data sets. if a historical query involved the 3rd and 5th attribute. Unless otherwise specified. a SELECT type query is generated from the data set where the 3rd attribute is equal to the value of the 3rd attribute of the nth record and the 5th attribute is equal to the value of 5th attribute of the nth record. and has a variable query selectivity. and the nth record was randomly selected from the data set.a set of 33619 records of major league pitching statistics from between the years of 1900 and 2004 consisting of 29 dimensions of data. Histories and data were merged by taking a random record from a data set and the numerical identifier of the attributes involved in the synthetic or historical query in order to generate a point query. Query Workloads It is desirable to explore the general behavior of database interactions without inadvertently capturing the coupling associated with using a specific query history on the same database. Some dimensions are correlated with each other.0.• mlb . Therefore. multidimensional histogram size. This gives a query workload that reflects the attribute correlations within queries.

12. {8.2.4} together.3. And 256 bits are used for the multi-dimension histogram.3 shows how accurately the proposed static index selection technique can perform with relaxed analysis constraints. 20% {5.13.16.659 queries executed from a clinical application. There are no constraints placed on the number of selected indexes. synthetic .7}. some initial portion of the queries are used for some experiments.6. Due to the size of this query set. hr .01. 3. 20% {15. Attributes that are not covered by the query can be any value and still match the query. the estimated cost of using recommended indexes. a low support value. For this scenario. Over the last 300 queries. The query distribution file has 64 distinct attributes. and the true number of answers for 80 . 20%.2 Experimental Results Index Selection with Relaxed Constraints Figure 3. and the remaining 40% are between 1 to 5 attributes that could be any attribute. is used.860 queries executed from a human resources application.14}.35.the attributes covered by the query.3 shows the cost of performing a sequential scan over 100 queries using the indicated data sets.500 randomly generated queries. 0. The query distribution file has 54 attributes. 3. The distribution of the queries over the first 200 queries is 20% involve attributes {1. and the remaining queries involve between 1 and 5 attributes that could be any attribute.19}. the distribution shifts to 20% covering attributes {11. clinical . Figure 3. 2. 20% {18.4.17}. These query histories form the basis for generating the query workloads used in our experiments: 1.9}.

Table 3. regardless of any reasonable factor used. Note that the cost for sequential scan is discounted to account for the fact that sequential access is significantly cheaper than random access. the graph will show the cost of using our indexes much closer to the ideal number of page accesses than the cost of sequential scan. A factor of 10 is applied as the ratio of random access cost to sequential access cost. Sequential Scan. where n is the number of indexes suggested by the proposed algorithm.3: Costs in Data Object Accesses for Ideal. AutoAdmin [15]. AutoAdmin. The naive algorithm uses indexes for the first n most frequent itemsets in the query patterns.Figure 3. costs for indexes proposed by AutoAdmin and a naive algorithm are also provided. While the actual ratio of a given environment is variable.4 compares the proposed index selection algorithm with relaxed constraints against SQL Server’s index selection tool. and the Proposed Index Selection Technique using Relaxed Constraints the set of queries. 81 . For comparative purposes. Naive. using the data sets and query workloads indicated.

both algorithms suggest indexes which will improve all of the queries. These are all the queries that have selectivities low enough that an index can be beneficial.4: Comparison of proposed index selection algorithm with AutoAdmin in terms of analysis time and % queries improved For the stock dataset using both the clinical and hr workloads. our first index recommendation is a 2-dimension index built over attributes 11 and 44. the amount of the query improvement will be very similar using either recommended index set. but finds indexes that improve 87 % of the queries.Data Set/ Workload stock/ clinical stock/ hr mlb/ clinical Analysis Tool Time(s) AutoAdmin 450 Proposed 110 AutoAdmin 338 Proposed 160 AutoAdmin 15 Proposed 522 % Queries Improved 100 100 100 100 0 87 Number of Indexes 23 18 20 16 0 16 Table 3. Since the selectivity of these queries is low (the queries return a low number of matches using any index that contains a queried attribute). Our index selection takes longer in this instance. The proposed algorithm executes in less time and generates indexes which are more representative of the query patterns in the workload and allow for greater subspace pruning. However. SQL Server’s index recommendations are all single-dimension indexes for all attributes that appear in the workload. SQL Server quickly recommended no indexes. For the mlb dataset. the most common query in the clinical workload is to query over both attributes 11 and 44 together. It should be noted that the indexes selected by AutoAdmin are appropriate based on the cost models for the underlying index structure used in 82 . For example.

we maintain fewer potential indexes in P and need to analyze fewer attribute sets per index. Results show the total number of query-index pairs analyzed over the query set in the index selection loop and the total estimated improvement in query performance in terms of data objects accessed over the query set as the index selection parameters vary. the strategy yields nearly identical estimated cost although only 34-44% of the query/index pairs need to be evaluated. a single dimensional index is not able to prune the subspace enough to make the application of that index worthwhile compared to sequential scan. Effect of Support and Confidence Table 3. The results are shown for the stock data set over the clinical query workload. the single dimension indexes that are recommended are the least costly possible indexes to use for the query set. an initial set of indexes is recommended based on the first 100 queries given the stated analysis parameters. Using a confidence level of 0% is equivalent to using the maximal itemsets of attributes that meet support criteria as recommended indexes. As conf idence decreases.5 presents results on analysis complexity and expected query improvement as support and conf idence are varied. For this example. For this particular experiment. In each of the experiments that show the performance of the adaptive system over a sequence of queries. Baseline Online Indexing Results A number of experiments were performed to demonstrate the effectiveness and characteristics of the adaptive system. Since the underlying index type for SQL Server is B+-trees.SQL Server. which do not index multiple dimensions concurrently. This decreases analysis time but shows very little effect on overall performance. 83 .

These index set changes are considered to take place before the next query. The estimated cost of the last 100 queries given the indexes available at the time of the query are accumulated and presented. a major change threshold of 0. A number of data set and query history combinations are examined. changes are made to index set.4 through 3. The performance gets much worse and does not get better for the static system. a window size of 100. this workload drastically changes. For the online system.5: Comparison of Analysis Complexity and Query Performance as support and conf idence vary.9 and a minor change threshold of 0. 128 bits for each multi-dimension histogram entry. This demonstrates the evolving suitability of the current index set to incoming queries. clinical workload New queries are evaluated.6 show a baseline comparison between using the query pattern change detection and modifying the indexes and making no change to the initial index set. At query 200. The baseline parameters used are 5% support. and depending on the parameters of the adaptive system. stock dataset. and this is evident in the query cost for the static system. Figures 3. the performance degrades a little 84 .Support Confidence 2 4 6 8 10 Query/Index 100 50 229 229 198 198 185 185 178 178 170 170 Pairs 0 90 85 73 66 58 Total 100 57710 55018 46924 42533 37527 Improvement 50 0 57710 57603 55018 55001 46924 46907 42533 42516 37527 37510 Table 3. Figure 3. an indexing constraint of 10 indexes.4 shows the comparison for the random data set using the synthetic query workload.95.

random Dataset. 85 . synthetic Query Workload when the query patterns are changing and then improve again once the indexes have changed to match the new query patterns. Figure 3.4: Baseline Comparative Cost of Adaptive versus Static Indexing. The synthetic query workload was generated specifically to show the effect of a changing query pattern. Performance for the static system degrades substantially when the patterns change until the point that they change back. This real query workload also changes substantially before reverting back to query patterns similar to those on which the static index selection was performed.Figure 3. The adaptive system is better able to absorb the change.5 shows a comparison of performance for the random data set using the clinical query workload.

Figure 3.5: Baseline Comparative Cost of Adaptive versus Static Indexing, random Dataset, clinical Query Workload

86

Figure 3.6: Baseline Comparative Cost of Adaptive versus Static Indexing, stock Dataset, hr Query Workload

Figure 3.6 shows the comparison for the stock data set using the first 2000 queries of the hr query workload. The adaptive system shows consistently lower cost than the static system and in places significantly lower cost for this real query workload. The effect of range queries was also explored. The random data set was examined using the synthetic workload using ranges instead of points. As the ranges for a given attribute increased from a point value to 30% selectivity, the cost saved for top 3 proposed indexes (which were the index sets [1,2,3,4], [5,6,7], and [8,9] for all ranges), the overall cost saved decreased. This is due to the increase in the size of the answer 87

set. When the range for each attribute reached 30%, no index was recommended because the result set became large enough that sequential scan was a more effective solution. Online Index Selection versus Parametric Changes Changing online index selection parameters changes the adaptive index selection in the following ways: Support - decreasing the support level has the effect of increasing the potential number of times the index set will be recalculated. It also makes the recalculation of the recommended indexes themselves more costly and less appropriate for an online setting. Figure 3.7 shows the effect of changing support on the online indexing results for the random data set and synthetic query workload, while Figure 3.8 shows the effect on the stock data set using the first 600 queries of hr query workload. These graphs were generated using the baseline parameters and varying support levels. As expected, lower support levels translate to lower initial costs because more of the initial query set are covered by some beneficial index. For the synthetic query workload, when the patterns changed, but do not change again (e.g. queries 100-200 in Figure 3.7), lower support levels translated to better performance. However, for the hr query workload, the 10% support threshold yields better performance than the 6% and 8% thresholds for some query windows. The frequent itemsets change more frequently for lower support levels, and these changes dictate when decisions occur. These runs made decisions at points that turned out to be poor decisions, such as eliminating an index that ended up being valuable. Little improvement in cost performance is achieved between 4% support and 2% support. Over-sensitive control feedback can

88

random Dataset. Online Indexing Control Feedback Decision Thresholds .7: Comparative Cost of Online Indexing as Support Changes.Figure 3.9 shows the effect of varying the major change threshold in the coarse control loop for the random 89 . independent of the extra overhead that oversensitive control causes.increasing the thresholds decreases the response time of affecting change when query pattern change does occur. In a real system. synthetic Query Workload degrade actual performance. Figure 3. It also has the effect of increasing the number of times the costly reanalysis occurs. one would need to balance the cost of continued poor performance against the cost of making an index set change.

stock Dataset.Figure 3.8: Comparative Cost of Online Indexing as Support Changes. hr Query Workload 90 .

This indicates that this value should be carefully tuned based on the expected frequency of query pattern changes. Multi − dimensional Histogram Size . Experiments demonstrated that if the representation quality of the histogram was sufficient. a value of 0 translates to never making a change.data set and synthetic query workload.11 shows an example of the effect of varying the multi-dimensional histogram size.increasing the number of bits used to represent a bucket in the multi-dimensional histogram improves analysis accuracy at the cost of histogram generation time and space. One reason for this is that artificially higher costs are calculated because of the lower histogram resolution. As well. very little benefit was achieved through greater resolution. Figure 3. 91 . The graphs show no real benefit once the major change threshold increases (and therefore frequency of major changes) beyond 0. Figure 3. These graphs were generated using the baseline parameters (except that the random graph uses a support of 10%).9 are best in terms of response time when a change occurs.6. Figure 3.9 shows that a value of 1 or 0. The major change threshold is varied between 0 and 1. Here. and a value of 1 means making a change whenever improvement is possible.3 shows the best performance once the query patterns have stabilized. some beneficial index changes are not recommended because of these overly conservative cost estimates.10 shows the effect of changing the major change threshold for the stock data set and hr query workload. and only varying the major change threshold in the coarse control loop. Using a 32 bit size for each unique histogram bucket yields higher costs than by using greater histogram resolution. but a value of 0.

Figure 3.9: Comparative Cost of Online Indexing as Major Change Threshold Changes. random Dataset. synthetic Query Workload 92 .

Figure 3. stock Dataset. hr Query Workload 93 .10: Comparative Cost of Online Indexing as Major Change Threshold Changes.

Figure 3. random Dataset.11: Comparative Cost of Online Indexing as Multi-Dimensional Histogram Size Changes. synthetic Query Workload 94 .

5 Conclusions A flexible technique for index selection is introduced that can be tuned to achieve different levels of constraints and analysis complexity. The technique uses a generated multi-dimension histogram to estimate cost. However. The foundation provided here will be used to explore this tradeoff and to develop an improved utility for real-world applications. which may not be able to take advantage of knowledge about correlations between attributes. the system can be tuned to favor analysis time or pattern change recognition. 95 . The proposed technique affords the opportunity to adjust indexes to new query patterns. Indexes are recommended in order to take advantage of multi-dimensional subspace pruning when it is beneficial to do so. and as a result is not coupled to the idiosyncrasies of a query optimizer. more complex analysis can lead to more accurate index selection over stable query patterns. A more constrained. A low constraint.3. A limitation of the proposed approach is that if index set changes are not responsive enough to query pattern changes then the control feedback may not affect positive system changes. this can be addressed by adjusting control sensitivity or by changing control sensitivity over time as more knowledge is gathered about the query patterns. less complex analysis is more appropiate to adapt index selection to account for evolving query patterns. These experiments have shown great opportunity for improved performance using adaptive indexing over real query patterns. By changing the threshold parameters in the control feedback loop. A control feedback technique is introduced for measuring performance and indicating when the database system could benefit from an index change.

the frequency of index set change has been affected through the use of the online indexing control feedback thresholds. It is not feasible to perform real-time analysis of incoming queries and generate new indexes when the patterns change. The kinds of low cost changes that this threshold would trigger are not the kinds of changes that make enough impact to be that much better than the existing index set. and one potential index is now more valuable than a suggested index.From initial experimental results. This would change if the indexing constraints were very low. This could mean moving an index from local storage to main memory. the fine grain control threshold was not triggered unless we contrived a situation such that it would be triggered. The proposed online change detection system utilized two control feedback loops in order to differentiate between inexpensive and more time consuming system changes. or from remote storage to local storage depending 96 . moved to active status. it seems that the best application for this approach is to apply the more time consuming no-constraint analysis in order to determine an initial index set and then apply a lightweight and low control sensitivity analysis for the online query pattern change detection in order to avoid or make the user aware of situations where the index set is not at all effective for the new incoming queries. In practice. Alternatively. Potential indexes could be generated prior to receiving new queries. the frequency of performance monitoring could be adjusted to achieve similar results and to appropriately tune the sensitivity of the system. Index creation is quite time-consuming. and when indicated by online analysis. In this thesis. This monitoring frequency could be in terms of either a number of incoming queries or elapsed time.

97 . Additionally. the analysis could prompt a server to create a potential index as the analysis becomes aware that such an index is useful.on the size of the index. it could be called on by the local machine. and once it is created.

4. Oracle [2].Albert Einstein (attributed).1 Introduction Bitmap indexes are widely used in data warehouses and scientific applications to efficiently query large-scale data repositories. The multi-attribute access structure may not be effective for general queries as a multi-attribute index can only prune results for those attributes that match selection criteria of the query. e. They are successfully implemented in commercial Database Management Systems. Informix [33]. However.CHAPTER 4 Multi-Attribute Bitmap Indexes The significant problems we have cannot be solved at the same level of thinking with which we created them. It describes the changes required when modifying a traditionally single attribute access structure to handle multiple attributes in terms of enhancing the query processing model and adding cost-based index selection. IBM [46]. This chapter presents multi-attribute bitmap indexes.g. 98 .. . Query processing becomes more complex as the number of potential ways in which a query can be executed increases. this added pruning power comes with a cost. An access structure built over multiple attributes affords the opportunity to eliminate more of a potential result set within a single traversal of the structure than a single attribute structure.

A range query over a value-encoded bitmap can be answered by ORing together the bit vectors that correspond to the range for each attribute. or range of values for a given attribute. category. In their simplest and most popular form. General compression techniques are not desirable as they require bitmaps to be decompressed before executing the query. The CPU time required to answer queries is related the total length of the compressed bitmaps used in the query computation. 99 . and then ANDing together these intermediate results between attributes. and have found many other applications such as visualization of scientific data [56]. on the contrary. allow the query execution over the compressed bitmaps. Run length encoders. equality encoded bitmaps also known as projection indexes. The two most popular are Byte-Aligned Bitmap Code (BBC) [2] and Word Aligned Hybrid (WAH) [57]. In general. Queries are then executed by performing the appropriate bit operations over the bit vectors needed to answer the query. One disadvantage of bitmap indexes is the amount of space they require. the mapping can be used to encode a specific value. Bitmap indexes can provide very efficient performance for point and range queries thanks to fast bit-wise operations supported by hardware. Compression is used to reduce the size of the overall bitmap index. The resultant bit vector has set bits for the tuples that fall within the range queried. Each attribute is encoded independently.Sybase [20]. are constructed by generating a set of bit vectors where a ’1’ in the ith position of a bit vector indicates that the ith tuple maps to the attributevalue associated with that particular bit vector.

the bitvectors built over two or more attributes are sparser. However. the AND operation is used to combine the bitmaps. such as B+-Trees or projection indexes. if a 2-dimensional index such as an R-tree or Grid File exists over the two queried attributes. which allows for greater compression. 100 . a multi-dimensional bitmap over a combination of attributes and values identifies those records that match the query combination and avoids a further bit operation in order to answer the query criteria. are available over the two attributes queried. the query would find candidates for each attribute and then combine the candidate sets to obtain the final answer. This results in a smaller result set if the two dimensions indexed are not highly correlated. just as traditional multi-attribute indexes can be more effective in pruning data search space and reducing query time. Similarly.These traditional bitmap indexes are analogous to single dimension indexes in traditional indexing domains. Also a set intersection operation is not required. The resulting smaller bitmaps makes the processing of the bitmap faster which translates into faster query execution time. As an example. First. consider a 2-dimensional query over a large dataset. This cost savings comes from two sources. then the data space can be pruned simultaneously over both dimensions. the number of bitwise operations is potentially reduced as more than one attribute are evaluated simultaneously. the set intersection operation is used to combine the sets. In the case of B+-Trees. In order to avoid sequential scan over the data we build a index access structure that can direct the search by pruning the space. Second. multiattribute bitmap indexes could be applied to more efficiently answer queries. However. In the case of projection indexes. If single dimensional indexes.

4. Bitmap tables need to be compressed to be effective on large databases. determine the bitmap query execution steps that results in the best query performance with minimal overhead. Although the techniques incorporate heuristics in order to control analysis search space and query execution overhead. The two most popular compression techniques for bitmaps are the Byte-aligned Bitmap Code (BBC) [2] and the Word-Aligned Hybrid (WAH) code 101 . experiments show significant opportunity for query execution performance improvement. The goals of this research are to: • Evaluate of query performance of multi-attribute bitmap enhanced data management system compared to the current bitmap index query resolution model. • Support ad hoc queries where information about expected queries is not available and support the experimental analysis provided herein. We propose an expected cost-improvement based technique to evaluate potential attribute set combinations. The proposed approaches to attribute combination and query processing are based on measuring the expected query cost savings of options.2 Background Bitmap Compression. • Given a selection query over a set of attributes and a set of single and multiple attribute bitmaps available for those attributes.In this thesis. the use of Multi-Attribute Bitmaps is proposed to improve the performance of queries in comparison with traditional single attribute indexes.

The fourth group is less than 31 bits and thus is stored as a literal. Unlike traditional run length encoding.78*0. The most significant bit indicates the type of word we are dealing with. and each fill word represents a multiple of 31 bits.20*0. WAH imposes the word-alignment requirement on the fills.20*0. and the third line shows the hexadecimal representation of the groups.1 shows how the bitmap is divided into 31-bit groups. Figure 4. and the remaining (w − 2) bits store the fill length. these schemes mix run length encoding and direct storage. WAH is simpler because it only has two types of words: literal words and fill words. Let w denote the number of bits in a word. the lower (w − 1) bits of a literal word contain the bit values from the bitmap. we assume 32 bit words. [59]. then the second most significant bit is the fill bit.4*1. Under this assumption.4*1.6*0 400003C0 400003C0 62*0 00000000 00000000 80000002 10*0. followed by the 0 fill 102 . In this example. If the word is a fill.1: A WAH bit vector.128 bits 31-bit groups groups in hex WAH (hex) 1. The second line in Figure 4. they are represented as literal words (a 0 followed by the actual bit representation of the group).1 shows a WAH bit vector representing 128 bits. The second group contains a multiple of 31 0’s and therefore is represented as a fill word (a 1. BBC stores the compressed data in Bytes while WAH stores it in Words. Since the first and third groups do not contain greater than a multiple of 31 of a single bit value. each literal word stores 31 bits from the bitmap. This requirement is key to ensure that logical operations only access words. The last line shows the values of WAH words.21*1 001FFFFF 001FFFFF 9*1 000001FF 000001FF Figure 4.30*1 1.

Relationship Between Bitmap Index Size and Bitmap Query Execution Time. Using WAH compression.2: Query Execution Example over WAH compressed bit vectors. Therefore. In this paper we use WAH to compress the bitmaps due to the impact on query execution times. not shown. value. The fill word from B1 is a zero-fill word compressing 2 words.2. and it is determined that they are both literal words. First. Bit operations over the compressed WAH bitmap file have been shown to be faster than BBC (2-20 times) [57] while BBC gives better compression ratio. Logical operations are performed over the compressed bitmaps resulting in another compressed bitmap. two literal words and one fill word. we have two fill words. is needed to stores the number of useful bits in the active word. The fill word 80000002 indicates a 0-fill of two-word long (containing 62 consecutive zero bits). As an example. queries are executed by performing logical operations over the compressed bitmaps which result in another compressed bitmap.B1 B2 B1 AND B2 400003C0 00000100 00000100 80000002 C0000003 80000002 001FFFFF 00001F08 001FFFFF 000001FF 00000108 Figure 4. The fourth word is the active word. the first word of each bitvector is decoded. and the fill word from B2 is a one-fill 103 . Another word with the value nine. The first three words are regular words. When the next word is read from both bitmaps. consider the two WAH compressed bitvectors in Figure 4. followed by the number of fills in the last 30 bits). the logical AND operation is performed over the two words and the resultant word is stored as the first word in the result bitmap. it stores the last few bits that could not be stored in a regular word.

word compressing 3 words. In this case we operate over 2 words simultaneously by ANDing the fills and appending a fill word into the result bitmap that compresses two words (the minimum between the word counts). Since we still have one word left from B2, we only read a word from B1 this time, and AND it with the all 1 word from B2. Finally, the last two words are read from the two bitmaps and are ANDed together as they are literal words. A full description of the query execution process using WAH can be found in [59]. Query execution time is proportional to the size of the bitmap involved in answering the query [56]. To illustrate the effect of bit density and consequently bitmap size on query times over compressed bitmaps we perform the following experiment. We generated 6 uniformly random bitmaps representing 2 million entries for a number of different bit densities and ANDed the bitmaps together. Figure 4.3 shows average query time against the total bitmap length of the bitmaps used as operands in the series of operations. The expected size of a random bitmap with N bits compressed using WAH is given by the formula [59]: mR (d) ≈ N 1 − (1 − dB )2w−2 − d2w−2 w−1

where w is the word size, and d is the bit density. The lower the bit density, the more quickly and more effectively intermediate results can be compressed as the number of AND operations increases. Figure 3 shows a clear relationship between query times and bitmap length. The different slopes of the curve are due to changes in frequency of the type of bit operations performed over the word size chunks of the bitmap operands. At low bit densities, there is a greater portion of fill word to fill word operations. As the bit density increases, more fill word to literal word operations occur. At high bit densities, literal word to literal 104

Figure 4.3: Query Execution Times vs Total Bitmap Length, ANDing 6 bitmaps, 2M entries

word operations become more frequent. As these inter-word operations have different costs, the overall query time to bitmap length relationship is not linear. However it is clear that query execution time is related to overall bitmap length. Relationship Between Number of Bitmap Operations and Bitmap Query Execution Time. Figure 4.4 shows AND query times over WAH-compressed bitmaps as the number of bitmaps ANDed varies. Again the bitmaps are randomly generated with the reported bit densities for 2 million entries and the number of bitmaps involved in a AND query is varied. There is clear relationship between query times and the number of bitmap operations.

105

Figure 4.4: Query Execution Times vs Number of Attributes queried, Point Queries, 2M entries versus

Since query performance is related to the total bitmap length and number of bitmap operations, a logical question is ”can we modify the representation of the data in such a way that we can decrease these factors?” One could combine attributes and generate bitmaps over multiple attributes in order to accomplish this. This is because a combination over multiple attributes has a lower bit density and is potentially more compressible, and already has an operation that may be necessary for query execution built into its construction. However, several research questions arise in this scenario such as which attributes should be combined, how can the additional indexes be utilized, and what is the additional burden added by the richer index sets.

106

consider a bitmap index built over two attributes. potentially decreasing the size of a bitcolumn. Such a representation can have the effect of reducing the size of the bitmaps that are used to answer a query and can also reduce the number of operations. Using WAH compression. Additionally. bitcolumns become sparser. As an illustrative example of enhancing compression from combining attributes. One could argue that multi-attribute bitmaps are reducing the dimensionality of the dataset but increasing the cardinality of the attributes involved in the new dataset. very 107 . and the opportunity for compressing that column increases. the advances in compression techniques and query processing over compressed bitmaps have made bitmaps an effective solution for attributes with cardinalities on the order of several hundreds [58]. Although bitmap indexes have historically been discouraged for high-cardinality attributes. In the table. Figure 4. into one 2-attribute bitmap with cardinality 4.b2 means that the value of attribute 1 maps to b1 and the value for attribute 2 maps to b2.5 shows a simple example combining two single attributes with cardinality 2.Technical Motivation for Multi-Attribute Bitmaps. where the data is uniformly randomly distributed independently for each attribute for 5000 tuples. One could think of multi-attribute bitmaps as the AND bitmap of two or more bitmaps from different attributes. we would expect 1 set bit for a value). if a bit operation is already incorporated in the multi-attribute bitmap construction. the combined column labeled b1. there are very few instances where there are long enough runs of 0’s to achieve any compression (since out of 10 random tuples. In such a test which matched this scenario. each with cardinality 10. then it does not need to be evaluated as part of the query execution. As attributes are combined.

As an alternative.ATT2=3). The time required is dependent on the total length of the bitmaps involved. The operations using the traditional bitmap processing to answer the query is ((ATT1=1) OR (ATT1=2)) AND (ATT2=3). We can expect much greater compression for this situation.b2 0 0 0 0 1 0 Figure 4. we can reduce 108 . we can generate a bitmap index for the combination of the two attributes.ATT2=3) OR (ATT1=2.5: Simple multiple attribute bitmap example showing the combination two single attribute bitmaps. for a total of 528 bytes. In this scenario. and attribute 2 must be equal to 3.b1 0 0 1 0 0 0 Combined b1. for a total of 2592 bytes.b2 b2. little compression was achieved and bitmaps representing each of the 20 attributevalue pairs were between 640 and 648 bytes long. the generated size of the 100 2-D bitmaps was between 220 and 400 bytes. For the multi-attribute case.b1 1 0 0 1 0 0 1 0 0 0 0 1 b2.Tuple t1 t2 t3 t4 t5 t6 Attribute 1 b1 b2 1 0 0 1 1 0 1 0 0 1 0 1 Attribute 2 b1 b2 0 1 1 0 1 0 0 1 0 1 1 0 b1. the query can be answered by (ATT1=1. Each of these bitmaps as well as the resultant intermediate bitmap from the OR operation is 648 bytes. In the given scenario. Now consider a range query such that attribute 1 must be between values 1 and 2. Now the cardinality of the combined attributes is 100. These bitmaps are 268 and 260 bytes respectively. and the expected number of set bits in a run of 100 tuples for a single value is 1.

where lower resolution bitmaps were used in addition to the single attribute bitmaps or high resolution bitmaps. Multi-attribute bitmaps would only be used in the queries that result in an improved query execution time. point queries would always benefit from the multi-attribute bitmaps as long as all the combined attributes are queried.both the size of the bitmaps involved in the query execution and also the number of bitmaps involved. 109 . The primary overall tasks can be summarized as multi-attribute bitmap index selection and query execution. This can be performed with varying degrees of knowledge about a potential query set. then the single attribute bitmaps would perform better than the multi-attribute bitmaps. 4. Once candidate attribute combinations are determined. Supplementing single attribute bitmaps with additional bitmaps that improve query execution time has also been done in [50]. In general. One could think of this approach as storing the OR of several bitmaps from the same attribute together.3 Proposed Approach In this section we present our proposed solution for the problems presented in the previous section. the combined bitmaps are generated. However. This example demonstrates the opportunity to improve query execution by representing multiple attribute values from separate attributes in a bitmap. we determine which attributes are appropriate or the most appropriate to combine together. In index selection. if only some of the attributes were queried or a large range from one or more attributes were queried. This is the reason why we propose to supplement the single attribute bitmaps with the multi-attribute bitmaps and not to replace the single attribute bitmaps.

3.In query execution. that combination of attributes is a candidate for multi-attribute indexing. 4. The goal behind bitmap index selection is to find sets of attributes that. but can effectively prune the data space when combined together. If the form has categorical entries for color and model. Whenever multiple attributes are frequently queried together over a single bitmap value. we use 110 . the candidate pool of bitmaps are explored to determine a plan for query execution. Since query execution time is dependent both on the total size of the operands involved in a bitmap computation and also the number of bitmap operations. An effective index selection technique for bitmap indexes should be cognizant of the relationships between attributes. can yield a large benefit in terms of query execution time. The analogy within the traditional multi-attribute index selection domain would be finding attributes that are not individually effective in pruning data space. the selectivity estimated for a given attribute combination should accurately reflect the true data characteristics. As an example. we provide a cost improvement based candidate evaluation system.1 Multi-attribute Bitmap Index Selection for Ad Hoc Queries In many cases the attributes that should be combined together as multi-attribute bitmaps are obvious given the context of the data and queries. In order to support situations where no contextual information is available about queries and in order to allow for comparative performance evaluation. if combined. then a multi-attribute bitmap built over these two attributes can directly filter those data objects that meet both search criteria without further processing. As well. consider a database system behind a car search web form.

is a list of Attribute Sets and corresponding goodness measures selected .attributes 18: include as in selected Algorithm 3: Selecting Attribute Combination Candidates.card*aj . cardLimit) notes: attSets . j = (maxAtt(asi )+1) to |A| 7: if (asi .these measures to determine the potential benefit for a given attribute combination. 0] |ai ∈ A} 2: currentDim = 1 3: while (currentDim < dLimit AND newSetsAdded) 4: unset newSetsAdded 5: for each asi ∈ attSets. dLimit. Determining Index Selection Candidates SelectCombinationCandidates (Ordered set of attributes A. Algorithm 3 shows how candidate multi-attribute indexes are evaluated and selected in terms of potential benefit.are a set of recommended attribute sets to build indexes over 1: attSets = {[{ai }. |asi | = currentDim 6: for each attribute aj . set newSetsAdded 11: currentDim + + 12: sort attSets by goodness 13: ca = ∅ 14: while ca = A 15: attribute set as = next set from attSets 16: if as ∩ ca = ∅ 17: ca = ca ∪ as.t.card¡cardLimit) 8: asj = [asi ∪ {aj }. s. aj )] 9: if goodness(asj ) > goodness(asi ) 10: add asj to attSets. detGoodness(asi . 111 .

the sets in attSets are sorted by their goodness measures (line 12). These include an array. The algorithm evaluates increasingly larger attribute sets (loop starting at line 3). A. The card value associated with an attribute set is the number of value combinations possible for the attribute set. and reflects the number of bitmaps that would have to be generated to index based on the set. Then sets of attributes that are disjoint from the covered attributes. each potentially beneficial size 2 set is evaluated with relevant size 1 attributes (note that by evaluating only those singleton sets of attributes numbered larger than the largest in the current set. Once the loop terminates. Each unique combination is evaluated (lines 5-6). then the combination set is added to the list of attribute sets for further evaluation during the next loop iteration. If the number of bitmaps generated by this combination is excessive (line 7). The term newSetsAdded in line 3 is a boolean condition indicating if any attribute set was added during the last loop iteration. In the next iteration of the outer loop. Initially all size 1 attribute sets are combined with each other size 1 attribute set such that each unique size 2 attribute set is evaluated. of the attributes under analysis. The maximum cardinality allowed for a given prospective combination can be controlled by the cardLimit parameter. If the goodness of the combination exceeds the goodness of its ancestor. A goodness measure is evaluated for the proposed combination (line 8). 112 . The algorithm takes several parameters as input. The parameter dLimit provides a maximum attribute set combination size. we can evaluate all unique set instances).The algorithm is somewhat analogous to the Apriori method for frequent itemset mining in that a given candidate set is only evaluated if it’s subset is considered beneficial. the set will not be evaluated.

bcj ) savings = bck . The variables. 113 .length 7: if savings > 0 8: goodness+ = savings ∗ cb. are greedily selected from the sorted sets and added to the recommended bitmap indexes.creationLength − cb. The probSR parameter is the probability that the query range for an attribute will be small enough such that a multi-attribute index over the query criteria will still be beneficial. The probQT parameter represents the probability that an attribute is queried. a 2-attribute index is not as useful for a single attribute query.length + bcj . probQT and probSR describe global query characteristics and are used to estimate diminishing benefits as index dimensionality increases. 1: 2: 3: 4: 5: 6: The goodness of a particular attribute set is computed by Algorithm 4.ca. attribute aj ) set probQT and probSR goodness=0 for each bitcolumn bck in asi for each bitcolumn bcj in aj bitcolumn cb = AND(bck . As with traditional multi-attribute index. Estimating Performance Gain for a Potential Index detGoodness (attribute set asi .matches 9: goodness∗ = (probQT ∗ probSR)|asi |+1 10: return goodness Algorithm 4: Determine goodness of including an attribute in a multi-attribute bitmap.length+ bck .

minus the compressed length of the combined bitmap. The index selection solution space is controlled by limiting parameters and through 114 . Our approach takes into account the characteristics of the data by weighting and measuring the effectiveness of specific inter-attribute combinations. For size 1 attribute sets. Both of these factors can provide a performance gain depending on the query criteria. This strategy assumes that the query space is similar in distribution to the data space. Each combination is weighted by its frequency of occurrence. The savings of a particular combination is measured as the sum of the compressed lengths of the bitmap operands and the compressed length of the bitmaps needed to create the first operand.Each unique value combination of the entire attribute set (combinations generated in lines 4-5) is evaluated in terms of bitmap length savings over a corresponding single attribute bitmap solution. the bitcolumn creation length is 0. The total savings reflects the difference in total bitmap operand length to answer the combination under analysis. This technique finds attributes for which their combinations will lead to a reduction in the number of operations as well as a reduction in the constituent bitmap lengths. This is because this option will not be selected during the query execution for this combination and will not incur an additional penalty. for a multiple attribute bitmap this value is the length of all the constituent bitmaps used to create it. If the savings for a particular combination is negative. However. The final goodness measure for an attribute set is also modified based on the probability that benefit will actually be realized for the query (line 9). then its negative savings is not accumulated (line 7). The goodness of an attribute set is measured as the weighted average of the bitlength savings of each value combination (line 8).

greedy selection of potential solutions. we need a mechanism to decide when single attribute and/or multi-attribute bitmaps would be used.3. the potential execution plans must be evaluated and the option that is deemed better should be performed. Once multi-attribute bitmaps are available in the system. Later experiments show that the reduction in the number of total operations has a more substantial effect on query execution time than reduction of bitmap length. If the query is equally likely to occur in the data space. In order to take advantage of the most useful representation.2 Query Execution with Multi-Attribute Bitmaps One issue associated with using multiple representations for data indexes. Once bitmap combinations are recommended. To be effective. the complexity added by the additional representations can not be more costly than the benefits attained by incorporating them. Although the indexes recommended are not guaranteed to be optimal. (i.e. We take a number of measures in order to limit the 115 . is that the query execution becomes more complex. if index space is critical. each combined bitmap can be generated by assigning an appropriate identifier to the bitmap and ANDing together the applicable single attribute bitmaps. then the weighting in line 8 of the determine Goodness algorithm (Algorithm 4) can be removed. we can order the pairs in line 12 of the Select Combination Candidates algorithm (Algorithm 3) by goodness over combined attribute size. seemingly less than ideal attribute combinations can still provide performance gains). The behavior of the algorithm can be affected with slight alteration. they can be determined efficiently. 4. For example. we need to decide on a query execution plan for each query. In other words.

2. we compare the set representation of the attributes in the query to the sets in the index identification structure.overhead associated with selecting the best query execution option. Those sets that have non-empty intersections with the query set are candidates for use in query execution. Bitmap Naming We need to be able to easily translate a query to the set of bitmaps that match the criteria of a query. and the values are given in the same order as the attributes. Global Index Identification In order to identify the potential index search space. we will include a set identified as [1] in this structure. The key is made up of the attributes covered by the bitmap and the values for those attributes. We keep track of those attribute sets that have bitmap representations (either single attribute or multi-attribute). The attributes are always stored in numerical order so that there is only one possible mapping for various set orders. We use a unique mapping for bitmap identification. If we have available a multi-attribute bitmap built over attributes 1. If we have available a single attribute bitmap for the first attribute.3] in the structure. In order to do this we devise a one-to-one mapping that directly translates query criteria to bitmap coverage.10]5. and 3 then we include the set [1. Finding relevant bitmaps is a matter of translating the query into its key to access the bitmaps.2. we maintain a data structure of those attributes and combined attributes for which we have bitmap indexes constructed. Upon receiving a query. 116 . For example. a multiattribute bitmap over attributes 9 and 10 where the value for attribute 9 is 5 and the value for attribute 10 is 3 would get ”[9. We explore potential query execution options using inexpensive operations.3” as its key.

then the single attribute bitmaps should be used. we compute the total length of the bitmaps required to answer the query. However. if the total length of matching bitmaps for the multi-attribute bitmap over attributes [1. if the multi-attribute bitmaps have greater total length. Those indexes which cover an attribute involved in the query are candidates to be used as part of the solution for the query. which we can compute using the recursive function. For the recursive case. the unique key is built up and passed to the recursive call. So rather than incorporating logic that guarantees an optimal minimal total bitmap length. We order the 117 . For example. then the multi-attribute bitmap should be used to answer the query. we approximate this behavior using heuristics. We just need to sum the lengths of bitmaps that match the query criteria. Once candidate indexes are determined. The logic applied to determine the query execution plan needs to be lightweight so as to not add too much overhead to handle the increased complexity of bitmap representation.Query Execution During query execution we need a lightweight method to check which potential query execution will provide the fastest answer to the query. A recursive function is used to compute the total length of all matching bitmaps in order to handle arbitrary sized attribute sets.2] is less than the total length for both attributes [1] and [2]. and the callee length results are tabulated. The complete unique hash key is generated in the base case of the recursive method and the length of the matching bitmaps is obtained and returned to the calling caller. The first step is to compare the set of attributes in the query to the sets of attributes for which we have constructed bitmap indexes.

and and over value 3 for attribute 2 where we have bitmaps available for the attribute sets [1].potential attribute sets in descending order based on the amount of length saved by using the index. The algorithm assumes that the bitmap indexes available are identified and that we can access any bitcolumn through a unique hashkey. [2]. In the first step. we will have a total 118 . Those that have negative values. As a baseline. and will not be processed before their constituent single attribute indexes. all potential single attribute indexes get a value of 0 for the amount of length saved. the potential indexes are prioritized based on how promising they are with respect to saving total bitmap length during query execution. The second step performs the bit operations required to answer the query. Those multi-attribute indexes that have positive benefit compared to single attribute indexes will have positive value. and will be processed before any single attribute index. would actually slow down the query. At the end of this section.1 shows the query matching keys for each of the potential indexes. For example. and [1. Table 4. In lines 1-5. Multi-attribute indexes get a value which is the difference between the length of the single attribute indexes that would be required to answer the query and the length of the multi-attribute index to answer the query. Algorithm 5 provides the algorithm for executing selection queries when the entire bitmap index has both single attribute and multi-attribute bitmap indexes available to process the query. It involves two major steps.2] combined. consider a query over values 1 and 2 for attribute 1. Then we process the query by greedily selecting non-covered attribute sets from this order until the query is answered. we compute the total size of the bitmaps that match the query considering those attributes that intersect between the query and the index attribute set currently under analysis.

subresult) 24: return queryResult Algorithm 5: Query Execution for Selection Queries when Combined Attribute Bitmaps are Available.numAtts == 1 8: as. with length and saved fields and query matching keys list BI . BitColumn Hashtable BI.keys[0]) 21: for i = 2 to |as.attributes 17: as = next attribute set from AI 18: if as. query q) notes: AI .attributes 20: subresult = BI.get(as.a bitcolumn of all 1’s // Prioritize Query Plan 1: for each attribute set as in AI 2: if as.keys| 22: subresult = OR(subresult.attributes ∩ ca = ∅ 19: ca = ca ∪ as.keys[i])) 23: queryResult = AND(queryResult.saved = 0 9: else 10: for each size 1 attribute subset a in as 11: as.BI.saved -= as.SelectionQueryExecution (Available Indexes AI.contains identification of sets of indexed attributes.length 13: sort AI by saved field // Execute Query 14: attribute set ca = null 15: queryResult = B1 16: while ca = q.length 12: as.provides access to bitcolumn using unique key B1 . 119 .saved += a.get(as.attributes ∩ q.attributes = ∅ 3: generate query matching keys for as 4: attach query matching keys to as 5: sum query matching bitmap lengths 6: for each attribute set as in AI 7: if as.

2]1.[1]2 [2]3 [1. Length of the resultant bitmap is then derived using effective bit density. Those multi-attribute indexes that reflect a cost savings will get a positive value 120 . 2]2. Note that the computation of bitmap length in line 5 is not actually a simple sum. This does not adversely affect query operation since the multi-attribute option would not be selected anyway. All single attribute indexes are assigned a saved value of 0. we sum up the query matching bitmap length associated with each single attribute subset of the multi-attribute attribute set. [1. The difference between this length and the total length of the query matching bitmaps for the multi-attribute set is the amount of bitmap length saved by using the multi-attribute set. If the bit density is within the range of about 0. For a multiple attribute index. Available Index [1] [2] [1. In lines 6-12.9. Since the matching bitmaps will not overlap. In these cases.query matching bitmap length for each available index. then its value can not be determined. we use this initial information in order to compute the total bitmap length saved by using a multi-attribute bitmap in relation to using the relevant single attribute alternatives.3 Table 4.3 .5. 2] Matching Keys [1]1.1 to 0. we conservatively estimate its value as 0. these bit densities are summed to arrive at an effective bit density of the resultant OR bitmap.1: Sample Query Matching Keys for Query Attributes 1 and 2. The bit density of each matching bitmap is estimated based on its length.

is suitable when the update rate is not very high and we need to maintain the index up-to-date in an online manner. the query is complete and we return the result.3 Index Updates Bitmap indexes are typically used in write-once read-mostly databases such as data warehouses. In line 13. 121 . we will process a subquery using the index.3. If not. When we have covered all query attributes. 4. We get the next best index.18). Bitmap updates are expensive as all columns in the bitmap index need to be accessed. where ci is the number of bitmap values (or cardinality) i=1 for attribute Ai . In [12]. we initialize a set that indicates the attributes covered by the query so far and initialize the query result to all 1’s. and check if it’s attributes have already been covered (line 17.for saved. while those that are more costly will get a negative value. In lines 14 and 15. Then we start processing the query in order of how promising the prospective indexes were an continue until all query attributes are covered (line 15). In other words. the number of bitmaps that need to be updated when a new tuple is appended in a table with A attributes is A as opposed to ΣA ci . we sort the potential indexes in terms of this saved value. We AND together this result to build the subqueries into the overall query result. We update the covered attribute set and OR together all bitcolumns that match the queries relevant to the query and the index. Appends of new data are often done in batches and the whole index is rebuilt from scratch once the update is completed. This approach. which clearly outperforms the alternative. we propose the use of a padword to avoid accessing all bitmaps to insert a new tuple but only to update the bitmaps corresponding to the inserted values.

as only need to be accessed.and 11 follow Zipf distribution with s parameter set to 1.1 Experiments Experimental Setup Experiments were executing using a 1.4. 7-9 have cardinality 10.4 4. 1M records. The bitmap index selection.7. They are characterized as follows: • synthetic .Updating multi-attribute bitmaps is even more efficient than updating traditional bitmaps. attributes 2. If multi-attributes bitmaps are also padded we only need to access the bitmaps corresponding to the combined values (attributes) that are being inserted. and 12 follow Zipf distribution with s parameter set to 2.6. attributes 1-3 have cardinality 2.10 follow uniform random distribution.87GHz HP Pavilion PC with 2GB RAM and Windows Vista operating system. A 2 bitmaps 4. Attributes 1.5.4.12 attributes. consider the previous table with A attributes where multi-attribute bitmaps are built over pair of attributes. • uniform . each with cardinality 10. 4-6 have cardinality 5. It has 12 attributes. Data Several data sets are used during experiments. For example. and 10-12 have cardinality 25. the cost of updating the multi-attribute bitmaps is half of the cost of updating the traditional bitmaps.8. 1M records. 122 . attributes 3. bitmap creation and query execution algorithms were implemented in Java.9.uniformly distributed random data set.

It has 12 attributes. This data set is made up of more than 2 million records. • Landsat . against our proposed approach in which both single and multi-attribute bitmaps are available. These points remain in the data set. where only single attribute bitmaps are available.a bin representation of High-Energy Physics data. We used the first 1 million instances and 44 attributes. Department of Commerce) Census Bureau website. There are a total of 122 bitcolumns for this data set. This data set is not uniformly distributed. It contains over 275000 records. Queries Query sets were generated by selecting random points from the data and constructing point queries from those points. The cardinality per dimensional varies between 2 and 12. The discretized dataset comes from the UCI Machine Learning Repository. For those cases where range queries are used. The data set can be obtained from [47] • HEP .S. each cardinality 10. We present comparisons in terms of query execution time. The raw dataset is from the (U. so that it is guaranteed that there will be at least one match for each query. number of bitmaps involved in answering a 123 .a bin representation of satellite imagery data. Metrics Our goal is to compare performance of a system reflecting the current state of the art. The cardinality of these attributes varies from 2 to 16.a discretized version of the USCensus1990raw data set.• census . the range is expanded around a selected point so that the range is guaranteed to contain at least one match. The data set is 60 attributes.

we instead get 2 attribute recommendations. while it was 17. [2. After executing index selection with parameters probQT and probSR set to 0. and [3. 124 . no degradation at higher number of attributes).5. Attributes 10. The average query execution of 100 point queries using only single attribute bitmaps was 41. This solution makes sense in the context of the performance metric of weighted benefit. We would expect more expected benefit for combinations for which the number of value combinations approaches the cardinality limit.2 Experimental Results Index Selection We executed the index selection algorithm over the synthetic data set.e. because a sparser bitmap can be more effectively compressed. We set the cardLimit parameter to 100 and set the query probability parameters to 1 (i. and the values are independent. because it’s value combination cardinality is 100. Query execution assumes the bitmap indexes are in main memory.query.6. We would also expect more expected benefit from combining attributes that are independent. The time for using recommended index set includes an average overhead of 0.25ms to explore the richer set of options and select the best query execution plan.47ms.39ms using the recommended indexes. 11. and 12 do not combine with any others in this scenario because there are no free attributes that would be within cardinality constraints.4. The other combinations have greater inter-attribute dependency because they follow Zipf distributions.7]. Therefore set [1.4. The attribute sets that were selected to combine together were [1. Further performance enhancement would be possible during query execution if the required bitmaps need to be read from disk. The proposed set of indexes is [7.7] is selected first. 4. and total bitmap size accessed to answer a query.8].9].5.4.8].

8 show the query execution time. Figures ?? and 4. Q1. Query Execution Controlled Query Types. is a point query on both attributes.6]. means that the query for that attribute is a point query. A range of 1. and total size of the bitmaps involved in answering the query. and Q3 selects one value from the first attribute and five values from the second attribute. [4. respectively.12].2: Query Definitions for 2 Attributes with cardinality 10 and uniform random distribution.10]. We ran 6 query types over bitmaps built over 2 attributes with uniform random distribution and cardinality 10. [2.9]. number of bitmaps involved in answering each query type.2 and varied in the length of the range.Range Length Query A1 A2 Q1 1 1 Q2 1 2 Q3 1 5 Q4 2 5 Q5 2 5 Q6 5 5 Table 4. [5. [1. Each of these has a value combination cardinality greater than or equal to 50 and selection order is dependent on cardinality and then inter-attribute independence.11]. In these figures. Note that in this case. The ‘Only 1-Dim’ label refers to the case where only Single Attribute Bitmaps are available. The queries are detailed in Table 4. 5 is the worst case scenario. the ‘Only M-Dim’ label refers to the 125 . Any range bigger than five could take the complement of the non-queried range to answer the query. and [3. for example.

case where the queries are forced to use Multi-Attribute Bitmaps, and the ‘Proposed’ label refers to the case where the query execution plan is selected considering both indexes available. For queries Q1-Q4, the multi-attribute bitmaps are always selected, and they are significantly faster than the single-attribute options. For queries Q5-Q6, if single and multi-attribute bitmaps are available, single attribute is always selected. However, the best is always slightly faster than the proposed technique due to the overhead of evaluating and selecting the best option. For these queries, the overhead was on the order of 0.2ms. The proposed technique usually selects the option that involves the shortest original bitmaps required to answer the query. The exception here is Q5. The single dimension index is used because it involves fewer bitmap operations and involves less total bitmap operand length over the set of operations that need to take place.

Figure 4.6: Query Execution Time for each query type.

126

Figure 4.7: Number of bitmaps involved in answering each query type.

Query Execution for Point Queries. Figure 4.9 demonstrates the potential query speedup associated with our proposed query processing technique when multiattribute bitmaps are available to answer queries. The single-D bar shows the average query execution time required to answer 100 point queries using traditional bitmap index query execution, where points are randomly selected from the data set (there is at least one query match). For the multi-attribute test, indexes were selected and generated with a maximum set size of 2. So a suite of bitcolumns representing 2attribute combinations is available for query execution as well the single attribute bitmaps. Each single attribute is represented in exactly one attribute pair. The results show significant speedup in query execution for point queries. Point queries over the uniform, landsat, and hep datasets experience speedups of factors of 5.01, 3.23, and 3.17 respectively.

127

Figure 4.8: Total size of the bitmaps involved in answering each query type.

Similar results were achieved for the census data set. Indexes were selected without a dimensionality limit and with a cardinality limit of 100. The query execution time using only single attribute bitmaps was 83.01 ms while the query execution time was only 24.92 ms when both, single attribute and multi-attribute bitmaps were available. Query Execution Over Range Queries. Figure 4.10 shows the impact of using the single-attribute bitmaps instead of the proposed model as the queries trend from being pure point queries to gradually more range-like queries. Average query execution times over 100 queries are shown. The queries are constructed by making the matching range of an attribute either a single value, or half of the attribute cardinality. Whether an attribute is a point or a range is randomly determined. The x-axis shows the probability that a given pair of attributes in the query represents a range (e.g .1 means that there is 90% probability that 2 query attributes are both over single values). The graph shows significant speedup for point queries. Although the 128

9: Query Execution Times. The lower portion of the bar shows the total size of the bitmaps for all of the single attribute indexes. Point Queries. the multi-attribute alternative maintains a performance advantage until nearly all the query attributes are over ranges.Figure 4. Traditional versus MultiAttribute Query Processing speedup ratio decreases as the queries become more range-like.11 shows bitmap index sizes for each data set. The entire bar represents the space required for both the single and 2-attribute bitmaps. Even though the multi-attribute 129 . The upper portion of the bar shows the size of the suite of 2-attribute bitmaps (each attribute represented exactly once in the suite). Index Size Figure 4. This is because there are still some cases where using the multiattribute bitmap is useful.

while its 2-attribute counterpart is represented by 3000 bitmaps (30 attribute pairs.10: Query Execution Times as Query Ranges Vary.Figure 4. Landsat Data bitmaps represent many more columns than the single attribute bitmaps(e. the single dimension landsat are covered by 600 bitcolumns. the compression associated with the combined bitmaps means that the space overhead for the multi-attribute representations is not as severe as may be expected.g. 130 . 100 bitmaps per pair)).

Figure 4.11: Single and Multi-Attribute Bitmap Index Size Comparison Bitmap Operations versus Bitmap Lengths Since query execution enhancement is derived from two sources. Table 4. Randomly selected attributes simply combines random pairs of attributes. reduction in bitmap operations and reduction in bitmap length. we tested query execution after performing index selection using different strategies. For each listed strategy. a suite of size 2 attribute sets is determined differently.3 shows the query speedup for point queries using different strategies to determine the attributes that should be combined together. The other 2 strategies use Algorithm 3 to generate goodness factors for each combination based on bitmap length saved and weighted by result density. it is valuable to know the relative impacts of each one. The speedup is measured over 100 random point queries taken from HEP data. To this end. The goodness using best greedily selects non-intersecting 131 .

we propose the use of multi-attribute bitmap indexes to speed up the query execution time of point and range queries over traditional bitmap indexes. Using Different Attribute Combination Strategies pairs starting from the top of the list. HEP Data. Selecting Best 3.79 tributes 3 Algorithm 1 Selecting Worst 2.3: Query Speedup. It can also be tuned to avoid undesirable conditions such as an excessive number of bitmaps or excessive bitmap dimensionality. Utilizing our index selection 132 . while the other technique greedily selects non-overlapping pairs from the bottom. 4.5 Conclusion In this paper. We introduce a query execution model when both traditional and multi-attribute bitmaps are available to answer a query. The technique is driven by a performance metric that is tied to actual query execution time.67 Beneficial Combinations Table 4.15 Combinations 2 Randomly Selected At2. We introduce an index selection technique in order to determine which attributes are appropriate to combine as multi-attribute bitmaps.Attribute Combination Strategy Query Speedup 1 Algorithm 1. The query execution takes advantage of the multi-attribute bitmaps when it is beneficial to do so while introducing very little overhead for cases when such bitmaps are not useful. The results show that significant speedup can be achieved simply by combining reasonable attributes together. It takes into account data distribution and attribute correlations. and that query execution can be further enhanced by careful selection of the indexes.

and query execution techniques. we were able to consistently achieve up to 5 times query execution speedup for point queries while introducing less than 1% overhead for those queries where multi-attribute indexing was not useful. 133 .

In order to maximize query performance. Since workloads can shift over time. the query processing follows a access structure traversal path that matches the function being optimized while respecting additional problem constraints. and the distribution and characteristics of the data itself. the possible query execution plans evaluated by a data management system are limited by available indexes. It considers both the query patterns and the data distribution to determine the benefit of different alternatives. the query processing and access structures should be tailored to match the characteristics of the application. 134 . The optimization query framework presented herein describes an I/O-optimal query processing technique to process any query within the broad class of queries that can be stated as a convex optimization query. While resolving queries. The online high-dimensional index recommendation system provides a cost-based technique to identify access structures that will be beneficial for query processing over a query workload. a method to determine when the current index set is no longer effective is also provided. query patterns over time. For such applications.CHAPTER 5 Conclusions Data retrieval for a given application is characterized by the query types.

By ensuring that the run-time traversal and processing of available access structures matches the characteristic of a given query and that the access structures available match the overall query workload and data patterns. Both the run-time query processing and design-time access methods are targeted as potential sources of query time improvement. 135 . The index selection task finds attribute combinations that lead to the greatest expected query execution improvement. The primary end goal of the research presented in this thesis is to improve query execution time.The multi-attribute bitmap indexes presents both query processing and index selection for the traditionally single attribute bitmap index. Query processing selects a query execution plan to reduce the overall number of operations. improved query execution time is achieved.

Kriegel.BIBLIOGRAPHY [1] S. pages 78–86. 1995. Marathe. 2004. and H. Oracle Corp. A. Sprugnoli. In Data Compression Conference. NY. Langley. 2001. and R. New York. and a M. [5] S. A. Koll´r. 2000. In VLDB. Convex Optimization. Keim. Keim. Nashua. Selection of relevant features and examples in machine learning. Vandenberghe. Adaptive Control Processes: A Guided Tour.-P. Cambridge University Press. AI. pages 28–39. [10] S. Bruno and S.. [2] G. B¨hm. Berchtold. Princeton University Press. [6] S. D. 1990. S. [7] A. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. Database Syst. and D. Narasayya. NH. Byte-aligned bitmap compression. pages 499–510. pages 1110–1121. Blum and P. Database tuning advisor for microsoft sql server 2005. The x-tree: An index structure for highdimensional data. 1996. PODS. In Proceedings of the International Conference on Very Large Data Bases. ACM Trans. 1961. 33(3):322–373. A cost model for query processing in high dimensional data spaces. [4] R. Keim. 1997. [9] C. R. Kriegel. and H. C. Bohm. Chaudhuri. Optimal selection of secondary indexes. P. Pinzani. Syamala. Boyd and L. Barucci. ACM Comput. USA. [11] N. [8] C. 1997. Surv. Chaudhuri. Bohm. 136 . 25(2):129–178. Bellman. Agrawal. R. L. V. Berchtold. A. [3] E. In VLDB. A cost model for nearest o neighbor search in high-dimensional data space. D. To tune or not to tune? a lightweight physical design alerter. Berchtold. IEEE Transactions on Software Engineering. Antoshenkov. 2004.. S. 2006.

2000. Ferhatosmanoglu. pages 367–378. Edelstein. D. NY. Adaptive and automated index selection in rdbms. Castelli. pages 146–155. Bergman. C. Agrawal. Data and Knowledge Engineering. Dogac. [20] H. Faster data warehouses. and A. USA. Narasayya. Autoadmin ’what-if’ index analysis utility. Roussopoulos. Chang. Navathe. An efficient cost-driven index selection tool for microsoft SQL server. and R.[12] G. Faloutsos. and A. In SIGMOD ’87: Proceedings of the 1987 ACM SIGMOD international conference on Management of data. Montevideo. Sellis. An automated index selection tool for oracle7: Maestro 7. Y. and S. R. [22] C. Analysis of object oriented spatial access methods.-L. Capara. [17] S. TUBITAK Software Research and Development Center. H. 1993. Technical Report LBNL/PUB-3161. [14] Y. J. Lotem. ACM SIGMOD. [18] R. Costa and S. 1994. In SSDBM. L. 1997. Vienna. Fagin. C. Austria. Optimal aggregation algorithms for middleware. E. Omiecinski. and T. Ikinci.-S. Smith. In XXVIII Latin-American Conference on Informatics (CLIE). Ferhatosmanoglu. In The VLDB Journal. D. Erisik. Lo. A. and M. New York. Blanken. Chang. Frank.-C. Lifschitz. and D. L. 1995. Vector approximation based indexing for non-uniform high dimensional data sets. 1998. Index self-tuning with agent-based databases. 1992. Li. pages 426–439. Information Week. [16] S. [13] A. Update conscious bitmap indices. Exact and approximate algorithms for the index selection problem in physical database design. IEEE Transactions on Knowledge and Data Engineering. Naor. Journal of Computer and System Sciences. 1987. 2002. [24] M. M. [19] A. [21] R. Maio. [23] H. Uruguay. 2003. T. 137 . Tuncel. Narasayya. A. E. The onion technique: indexing for linear optimization queries. Gibas. M. Chaudhuri and V. E. and N. Abbadi. V. ACM Press. Canahuate. In CIKM ’00. M. Fischetti. Choenni. Chaudhuri and V. 2007. [15] S. In Proceedings ACM SIG-MOD Conference. In International Conference on Extending Database Technologhy (EDBT). December 1995. and H. On the selection of secondary indexes in relational databases.

In Symposium on Large Spatial Databases. [30] J. [31] G. L. An introduction to variable and feature selection. In Proceedings of the Int. Dordrecht. pages 518–529. Muthukrishnan. In SIGMOD. P. Bernstein.[25] V. C. 4(2):235–247. Edinburgh. PhD thesis. On the selection of an optimal set of indexes. Lobo. Influence sets based on reverse nearest neighbor queries. [27] M. [29] I.-W. Geist. 2001. Boyd. 2000. Ye. Hjaltason and H. E. [36] R. John. IEEE Transactions on MultiMedia. S. Kohavi and G. [37] F. J.com/informix/corpinfo/.zines/whiteidx. UK. Conf. [28] C. The gc-tree: a high-dimensional index structure for similarity search in image databases. 1998. Sootland. In W. Han. pages 1–12. N. [34] M. Guang-Ho Cha. 05 2000. S. Inc. R. March 2000.. Saxton. Gionis. Stanford University. Pei. and I. Elissef. 2005. Ranking in spatial databases. Portugal. editors.informix. Koudas. ACM Press. on Very Large Data Bases. 30(2):170–231. and Y. Disciplined convex programming. Surv. http://www. The Department of Electrical Engineering. Yin. Chen. Mining frequent patterns without candidate generation.htm. June 2002. and V. and Y. Gaede and O. Kluwer (Nonconvex Optimization and its Applications series). 2003. [26] A. In International Database Engineering & Applications Symposium. [38] M. 1997. Korn and S. 2004. Motwani. AI. 2000 ACM SIGMOD Intl. Autonomous query-driven index tuning. pages 83–95. 1995. In SIGMOD Conference. Journal of Machine Learning Research. Ip. 138 . Indyk. and Y. Coimbra. [32] V. Conference on Management of Data. and R. Schallehn. and P. Gunther. [35] S. Raghavan. Naughton. Grant. PREFER: A system for the efficient execution of multi-parametric ranked queries. Multidimensional access methods. J. Robust and Convex Optimization with Applications in Finance. Similarity searching in high dimensions via hashing. Samet. September 1999. 1983. [33] I. Kai-Uwe. Papakonstantinou. In press. ACM Comput. Hristidis. Wrappers for feature subset selection. A. Informix decision support indexing for the enterprise data warehouse. IEEE Transactions on Software Engineering. Guyon and A.

page 70. Ma. SIAM. May 2000. Texture features for browsing and retrieval of image data. 32(3):16. P.edu/databases/census1990/USCensus1990-desc. Schenkel. & Math. S. and R. volume 2.. S.[39] B. Mao. IEEE Transactions on Pattern Analysis and Machine Intelligence.. Nesterov and A.html. 1995.uci. Multidimensional indexing and query coordination for tertiary storage management. Manjunath and W. L. [49] A. Weikum. VLDB. USA. Database Syst. The Office of Science Data-Management Challenge. Sci. and F. D. [45] S.htm. Inst. Rotem. Interior-Point Polynomial Algorithms in Convex Programming. Nemirovsky. In SIGMOD. DOE Office of Advanced Scientific Computing Research. Nemirovsky. 139 . A general approach to polynomial-time algorithms design for convex programming. Bernardo. Hersch. Sim. [41] R. Nordberg. Repository. Top-k query evaluation with probabilistic guarantees. S. 1994. 1999. pages 21–30. USSR. [42] Y. In Indexing Goes a New Direction. 1999. Econ. J. The us census1990 data http://kdd. 2000. Y. 2002. March–May 2004.ece.ics. G. and R. Roussopoulos. Mount. [51] M. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. R. [48] N. 18(8):837–842. Han. [40] B.. P. USSR Acad. Manjunath. Theobald. Vincent. Airphoto dataset. 2004. [43] Y. Technical report. M.. PA 19101. H. on Mass Storage Systems and Technologies. Moscow. Kelly. ACM Trans. Nearest neighbor queries. Nesterov and A. and R. CLOSET: An efficient algorithm for mining frequent closed itemsets.ucsb. In Proceedings of the 19th IEEE Symposium on Mass Storage Systems and Tenth Goddard Conf. M. [50] R. Centr. Ponce. Multi-resolution bitmap indexes for scientific data. 2007. Ramakrishna. Winslett. and A. Sinha and M. set. Indexing and selection of data items in huge data sets by constructing and accessing tag collections. 1988. [46] M. [44] J. http://vivaldi. August 1996. Shoshani. Pei. Vila. In Proceedings of the 11th International Conference on Scientific and Statistical Data(SSDBM). Philadelphia. [47] U. L.edu/Manjunath/research.

Blott. K. and S. 1998. and A. Shoshani. Shoshani. Wang. D. 2003. Index selection in relational databases. and A. In Proceedings of SSDBM. Valentin. C. In International Conference on Foundations on Data Organization (FODO). E. G. Otoo. W. In SIGMOD. Shoshani. Compressing bitmap indexes for faster search operations. Zuliani. [58] K.. and A.-J. and A. Lohman.-C. Chang. On the performance of bitmap indices for high cardinality attributes. E. Chang. Optimizing bitmap indexes with efficient compression. Schek. H. Weber. 1985. [60] Z. Blott. S. In Proceedings International Conference Data Engineering. Db2 advisor: An optimizer smart enough to recommend its own indexes. Kyoto. [57] K.[52] G. In SSDBM. Wu. pages 194–205. 2006. Whang. March 2004. [59] K. Shoshani. E. H. 2002. ACM Transactions on Database Systems. [56] K. Wu. Schek. Boolean + ranking: Querying a database by k-constrained optimization. Skelley. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. and S. Using bitmap index for interactive exploration of large datasets. Wu. C.-J. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. M. Lawrence Berkeley National Laboratory. Weber. J. [55] K. In VLDB. 1998. Chen. Paper LBNL54673. Hwang. 2006. In Proceedings of the 24th International Conference on Very Large Databases. J. Zhang. Wu. Koegler. Japan. 2000. Otoo. and Y. [53] R. Otoo. Lang. Zilio. [54] R. M. and A. 140 .

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->