You are on page 1of 476
Lecture Notes in Computer Science 525 Edited by G. Goos and J. Hartmanis Advisory Board: W. Brauer D.Gries J. Stoer O. Giinther H.-J. Schek (Eds.) Advances in Spatial Databases 2nd Symposium, SSD ’91 Zurich, Switzerland, August 28-30, 1991 Proceedings Springer-Verlag Berlin Heidelberg New York London Paris Tokyo Hong Kong Barcelona Budapest Series Editors Gerhard Gos Juris Hartmanis GMD Forschungsstelle Department of Computer Science Universitat Karlsruhe Comell University Vincenz-Priessnitz-StraBe 1 Upson Hall W-7500 Karlsruhe, FRG Ithaca, NY 14853, USA Volume Editors Oliver Gtinther Forschungsinstitut fiir anwendungsorientierte Wissensverarbeitung (FAW) Universitit Ulm, Postfach 2060, W-7900 Ulm, FRG Hans-Jorg Schek Institut fiir Informationssysteme, ETH Zitrich, ETH-Zentrum CH-8092 Ziirich, Switzerland CR Subject Classification (1991): A.0, E.1-2, E.5, F.2.2, H.2.8, 12.1,13,5 ISBN 3-540-54414-3 Springer-Verlag Berlin Heidelberg New York ISBN 0-387-54414-3 Springer-Verlag New York Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in other ways, and storage in data banks. Duplication of this publication or parts thereof is only permitted under the provisions of the German Copyright Law of September 9, 1965, in its current version, and a copyright fee must always be paid. Violations fall under the prosecution act of the German Copyright Law. © Springer-Verlag Berlin Heidelberg 1991 Printed in Germany Typesetting: Camera ready by author Printing and binding: Druckhaus Beltz, Hemsbach/Bergstr. 2145/3140-543210 - Printed on acid-free paper Preface ‘Two years after the First Symposium on the Design and Implementation of Large Spatial Databases (SSD'89), which was held in Santa Barbara, California, the range of spatial database applications has broadened considerably. Whereas traditionally the focus has been on computer-aided design and solid modeling, one can now observe a rapid increase in geographic applications. The market for geographic information systems (GIS) is growing steadily, and so are the demands for more efficient spatial data management capabilities. The reasons for these new developments are manifold. First, the efficient management of geographic maps in computers has become reality only very recently, due to some significant progress in hardware technology. The availability of up to 100 Mips and 100 Megabytes of main memory in a desktop computer, enhanced with optical disks for mass storage, seemed quite unrealistic just a few years ago. Without these powerful workstations, the interactive display and manipulation of geographic information, which is essential for most applications in the field, would not be possible today. Second, the management of spatial data has been facilitated considerably by some important research results in the areas of database management, computational geometry, and computer graphics, The technology of extensible databases, which allows the integration of abstract data types and user-defined operations into a database system, or the availability of efficient algorithms for geometric operations such as polygon overlay, spatial search, or hidden line elimination are all building blocks that are essential for efficient spatial data management software. Third, there is an increasing political motivation for having powerful geographic information systems, due to a growing public awareness of environmental protection. Since the late 1960s and the 1970s, when the preservation of the environment first became a major issue, many local governments have made significant progress regarding the collection of environmental data. Measuring networks have been installed to obtain more data about the air, the water, and the ground, Satellite images and aerial photographs will soon deliver terabytes of image data per day. The problem today is therefore not a lack of data but a lack of efficient systems 10 ‘manage these huge amounts of data and to transform them into information that really helps the decision maker. Spatial databases, in connection with sophisticated information retrieval tools and fast display devices are one step towards such an efficient environmental data management. Once the need for further spatial database research has been established, the question arises: what's next? Whereas in the last years a lot of emphasis has been put on physical aspects, there now seems to be an increasing interest in logical matters. On the one hand, the integration of object-oriented techniques into spatial data management has led to some interesting new data models and even to some products that have entered the market recently. Especially the modeling of complex geo-objects using structural object-orientation and the investigation of related query optimization issues seem to be promising topics for further research. On the other hand, it may also prove important to integrate deductive capabilities into spatial databases in order to allow for more powerful query languages and optimization techniques and for more sophisticated spatial reasoning capabilities. Many related research results from the artificial intelligence and deductive database communities have come to the attention of vl spatial database researchers, and vice versa, which may also lead to some interesting new work. All this new emphasis on logical aspects of spatial data management does not mean, however, that the problems concerning physical data management are solved. At this point, the huge amounts of data that are anticipated for the late 1990s exceed the capabilities of the most efficient data managers available to date. The development of new techniques for the integration of archival storage devices and for incremental backups and checkpoints are a necessity for the years to come. Itis our hope that the Second Symposium on Large Spatial Databases (SSD'91), which will be held August 28-30, 1991, in Ziirich, will help to create new ideas and to found new cooperations between interested researchers. We would like to thank Springer-Verlag for their interest in publishing this proceedings volume; as always, their support and cooperation were excellent. Thanks to the associations GI, SI, ACM, and IEEE for their support, to the program committee members for returning their reviews under great time pressure, and to the authors of all submitted papers for their interest in SSD'91. Thanks also to Hans Hinterberger, Claudia Schmid, and Antoinette Forster for taking care of our local arrangements, and 10 Carmen Stebisch and Christine Ziegler for administrative support. Ulm and Ziirich, June 1991 The Editors General Chair: Co-Chairmen; Program Chairman: Program Committee: vil Hans-Jorg Schek, ETH Ziirich, Switzerland Kurt Brassel, University of Ziirich, Switzerland Max Egenhofer, University of Maine, Orono, ME, USA Ron Sacks-Davis, RMIT, Melboume, Victoria, Australia Heinz Schweppe, Free University of Berlin, Germany Helmut Thoma, Ciba-Geigy, Basel, Switzerland Oliver Giinther, FAW Ulm, Germany Dave Abel, CSIRO, Canberra, Australia Kart Brassel, University of Zirich, Switzerland Alex Buchmann, University of Darmstadt, Germany Peter Dadam, University of Ulm, Germany Umesh Dayal, DEC, Cambridge, MA, USA Max Egenhofer, University of Maine, Orono, ME, USA Hans-Dieter Ehrich, University of Braunschweig, Germany Christos Faloutsos, University of Maryland, College Park, MD, USA Andrew Frank, University of Maine, Orono, ME, USA Mike Freeston, BCR, Miinchen, Germany Michael Goodchild, University of California, Santa Barbara, USA Ralf H. Giting, Fernuniversitét Hagen, Germany Klaus Hinrichs, University of Siegen, Germany Alfons Kemper, University of Aachen, Germany Hans-Peter Kriegel, University of Miinchen, Germany Witold Litwin, University of Paris 9, France Peter Lockemann, University of Karlsruhe, Germany Raymond Lorie, IBM Almaden, San Jose, CA, USA Frank Manola, GTE Labs, Waltham, MA, USA Andreas Meier, SBV, Basel, Switzerland Dave McKeown, Camegie-Mellon University, Pittsburgh, PA, USA Scott Morehouse, ESRI, Redlands, CA, USA Jiirg Nievergelt, ETH Ziirich, Switzerland Hartmut Noltemeier, University of Wiirzburg, Germany Jack Orenstein, Object Design, USA Thomas Ottmann, University of Freiburg, Germany Andreas Reuter, University of Stuttgart, Germany Doron Rotem, Lawrence Berkeley Laboratory, Berkeley, CA, USA. Ron Sacks-Davis, RMIT, Melbourne, Victoria, Australia Hanan Samet, University of Maryland, College Park, MD, USA Matthaus Schilcher, Siemens Center of GIS, Miinchen, Germany Timos Sellis, University of Maryland, College Park, MD, USA Terence R. Smith, University of California, Santa Barbara, USA Franz Steidler, STI Sirassle, Glatibrugg, Switzerland ‘Yuan-F. Wang, University of California, Santa Barbara, USA Marvin White, ETAK, Menlo Park, CA, USA Peter Widmayer, University of Freiburg, Germany Paul Wilms, IBM Almaden, San Jose, CA, USA Contents Geometric Algorithms Geometric Algorithms and Their Complexity Th. Ottmann (Invited Speaker), University of Freiburg, Germany The Combination of Spatial Access Methods and Computational Geometry in Geographic Database Systems H.-P. Kriegel, Th. Brinkhoff, R. Schneider, University of Milnchen, Germany ..........-+ 5 Fl-Quadtree: A New Data Structure for Content-Oriented Retrieval and Fuzzy Search J.P. Cheiney, A. Touir, Ecole Nat. Supér. des Télécommunications, Paris, France. ...... Pel Meta-Knowledge and Data Models The Importance of Metaknowledge for Environmental Information Systems F. J. Radermacher (Invited Speaker), FAW Ulm, Germany .....0scccesscceeeeeeeues 35 ‘An Extended Object-Oriented Data Model for Large Image Bases A, Gupta, T. E, Weymouth, R. Jain, University of Michigan, Ann Arbor, MI, USA. On the Integration for GIS and Remotely Sensed Data: Towards an Integrated System to Handle the Large Volume of Spatial Data Q. Zhou, B. J. Garner, University of New South Wales, Kensington, Australia........... 6 Query Languages and User Interfaces ‘Towards a Toolbox for Geographic User Interfaces (A: Voisard, INRIAg Le Chesnays France: fe 75 ‘The Management of the Ambiguities in a Graphical Query Language for Geographical Information Systems D. Calcinelli, M. Mainguenaud, France Télécom, Evry, France 0.0... 0.0seceveeveces 99 Geo-SAL: A Query Language for Spatial Data Analysis P. Svensson, H. Zhexue, National Defence Research Est. Sundbyberg, Sweden. ....... 9 Topology and Reasoning Reasoning About Binary Topological Relations M. J. Egenhofer, University of Maine, Orono, ME, USA .......seceeccveeeee serene 143 Topological Constraints: A Representational Framework for Approximative Spatial and Temporal Reasoning (s)Dutio) INSEAD, Foniainebleau, Frances 161 Access Methods Grow and Post Index Trees: Role, Techniques and Future Potential D.B. Lomet (Invited Speaker) , DEC CRL, Cambridge, MA,USA . 183 The Spatial Locality and a Spatial Indexing Method by Dynamic Clustering in Hypermap Systems K.J. Li, R. Laurini, INSA, Villeurbanne, France .... +000 06ce cee eevee eee eevee ees 207 Properties of Geographic Data: Requirements for Spatial Access Methods A.U. Frank, University of Maine, Orono ME, USA Query Processing Efficient Processing of Spatial Queries in Line Segment Databases E.G. Hoel, Bureau of the Census, Washington, DC, USA H, Samet, University of Maryland, College Park, MD, USA.....+0000eeserveeeeeens 237 The Performance of Object Decomposition Techniques for Spatial Query Processing H.-P. Kriegel, H. Horn, M, Schiwietz, University of Munchen, Germany ..... 0.000000 257 Performance Comparison of Segment Access Methods Implemented on Top of the Buddy-Tree B. Seeger, University of Waterloo, Waterloo, Ontario, Canada.......6..ee0seeeeeees 277 Spatial Operations and Query Languages Extending a DBMS with Spatial Operations W. G. Aref, H. Samet, University of Maryland, College Park, MD, USA...... 0.000005 299 The Use of a Complex Object Language in Geographic Data Management R.A. Lorie, IBM ARC, San Jose, CA, USA xl Motion in a Geographical Database System S. Shekhar, T. A. Yang, University of Minnesota, Minneapolis, MN, USA......-. Index and Storage Management A Spatial Index for Convex Simplicial Complexes in d Dimensions G. Vanecek, Jr., Purdue University, West Lafayette, IN, USA V. Ferrucci, Universita "La Sapienza", Roma, Haly.......000c.cceeve vevenee ‘An Algorithm for Computing the Overlay of k-Dimensional Spaces J. Orenstein, Object Design, Burlington, MA, USA ...... 1.01000 ee cece ec seeee Encoding and Manipulating Pictorial Data with S+-Trees W. de Jong, A. Schijf, Vrije Universiteit, Amsterdam, The Netherlands P. Scheuermann, Northwestern University, Evanston, IL, USA......0000..00005 GIS and Database Systems Exploiting Extensible DBMS in Integrated Geographic Information Systems L. M. Haas (Invited Speaker), W. F. Cody, IBM ARC, San Jose, CA, USA......+. Storage Management in Geographic Information Systems H. Lu, B. C. Ooi, A. D’Souza, C. C. Low, National University of Singapore ....... Panel: Why Does GIS Industry Ignore University Research? H. Samet (Chair), University of Maryland, MD, USA... 0.00.0. 00ceeeeeeeeee eee 361 Geometric Algorithms Geometric Algorithms and their Complexity Thomas Ottmann* Since more than 10 years algorithmic and combinatorial aspects of geometric problems have obtained an ever increasing interest. The driving force behind this development were new applications in spatial databases, computer graphics, VLSI-design, robotics and many more, Avnew area of mainly theoretical work has emerged and matured in the meantime: Compu- tational geometry. It is focused on the design and analysis of efficient algorithms to solve particular geometric problems, using intricate data structures and algorithm design techniques turned to geometry. Literally thousands of new algorithms and data structures have been developed. Thus, it is impossible to give an exhaustive or at-least representative survey of this area. Instead, we choose a particular class of problems, the hidden line elimination and the hidden surface re- moval problem (HLE- and HSR-problem) occuring in computer graphics, in order to explain the interplay between choosing data structures for storing spatial objects in two- and threedi- mensional space, and algorithms which access the data to solve the given problems, Algorithms to solve the HLE- and HSR-problem may use the image-space or the object-space approach. Using the first approach one computes for each pixel in the viewing plane which object is visible at that pixel. In this talk we restrict ourself to the object-space approach which is independent of the screen-resolution used for display. Early object-space algorithms solved the problems by projecting all the edges of the given objects onto the viewing plane, computing all intersections and deciding which of the intersecting edges were nearest to the observer. The obtained algorithms are intersection-sensitive. Those algorithms do not detect situations like the one where a large wall hides a complicated scene. Therefore, one is interested in output- sensitive solutions to the HLE- and HSR-problem. If only very little of a threedimensional object is visible, the display algorithm should be fast, if the visible silhouette of the displayed object consists of many parts, the algorithm may take more time to display the scence. So far no general output-sensitive solutions to the HLE- and HSR-problem are known, Therefore, a number of special cases have been studied which are obtained by restricting the scenes or making special assumptions which facilitate the computation. A special restriction is this: View form infinity along the 2-axis onto a collection of iso-oriented Tectangles in 2-y-z-space, that is, rectangles whose sides are parallel to the z- and y-axes. The restriction can then be weakened to scenes having faces and edges in only ¢ orientations, for some constant c, as seen from a given viewpoint, If there is a non-cyclic depth-order of the given objects available, the algorithm may use a dynamic contour-maintenance approach. That is, the faces are procecessed in depth-order one Freiburg, Institut fir Informatik, Rheinstr. 10-12, W-7800 Freiburg, Germany after the other starting with the one nearest to the observer. At each stage, the currently sible contour is maintained. Whenever a new face is encountered, one determines whether it is completely contained in the previous contour, whether the previous contour is contained in it, or whether the previous contour and the silhouette of the encountered face intersect. In ‘each case, the currently visible contour is updated appropriately. Thus, it should be clear that HLE can be viewed as the task of dynamically maintaining a structure storing the (currently visible) contour. HSR can be treated analogously. Different structures storing the contour (in the case of the HLE-problem) and storing the visibility map (in the case of the HSR-problem) lead to different algorithms. We review standard structures like segment-trees, range-trees, segment-segment-trees etc. which storing contours and visibility maps and also newly proposed structures which ty queries, ray shooting queries and other operations useful in this context. This paradigmatic view should give insight into geometric data structures, algorithms and their complexity. The Combination of Spatial Access Methods and Computational Geometry in Geographic Database Systems*+ Hans-Peter Kriegel, Thomas Brinkhoff, Ralf Schneider Institut fiir Informatik, Universitat Munchen, Leopoldstr. 11, D-8000 Miinchen 40, Germany Abstract Geographic database systems, known as geographic information systems (GISs) particularly among non-computer scientists, are one of the most important applications of the very active research area named spatial database systems. Consequently following the database approach, a GIS has to be seamless, i.e. store the complete area of interest (c.g. the whole world) in one database map. For exhibiting acceptable performance a seamless GIS has to use spatial access methods. Due to the complexity of query and analysis operations on geographic objects, state-of-the-art computational geometry concepts have to be used in implementing these operations. In this paper, we present GIS operations based on the compuational geometry technique plane sweep. Specifically, we show how the two ingredients spatial access methods and computational geometry concepts can be combined for improving the performance of GIS operations. The fruitfulness of this combination is based on the fact that spatial access methods efficiently provide the data at the time when computational geometry algorithms need it for processing. Additionally, this combination avoids page faults and facilitates the parallelization of the algorithms. 1 Introduction Geographic database systems, also known as geographic information systems (GISs), are one of the most important applications of spatial database systems. Basically, they consist of two parts: First, components to query and manipulate geographical data and second, components to manage and store the data. However, the main purpose of a GIS is to analyze geographical data. GIS algorithms presented in the past assume that the maps are kept in main memory or in sequential files on secondary storage. The following two important requirements of future GISs demand for new approaches: First, the database system of a GIS must be able to manage very large volumes of data. The large amount of data (in the order of Giga- and Terabytes) is additionally increased by pursuing the goal to manage scaleless and seamless databases [Oos 90]. Second, the database system has to support spatial access to parts of the database, such as maps, and to the objects of a map. Such access is a necessary condition for efficient query and manipulation processing. Pursuing these goals we want to take advantage of spatial access methods (SAMs). In the past few years many access methods were developed which allow to organize large sets of spatial objects on secondary storage. There are three basic techniques which extend multidimensional point access methods (PAMs) to multidimensional spatial access methods [SK + This work was supported by grant no. Kr 670/4-3 from the Deutsche Forschungsgemeinschaft (German Research Society) and by the Ministry of Environmental and Urban Planning of Bremen 88}: clipping, overlapping regions, and transformation. Point access methods such as the grid file [NHS 84], PLOP-hashing [KS 88], the BANG file [Fre 87] and the buddy tree [SK 90] can be extended by these techniques. Additionally, there are access methods which are designed for managing simple spatial objects directly. They use one of the above techniques inherently, e.g. the R-tree [Gut 84] and the R*-tree [BKSS 90] use overlapping regions, or the cell tree [Gin 89] uses clipping. An excellent survey of such access methods is given in [Sam 89]. The use of SAMs as an ingredient in GISs is absolutly necessary to guarantee good retrieval and manipulation performance, in particular for large maps. The use of SAMs enables us to perform operations only on relevant parts of seamless databases. GIS operations on maps modelled by a vector based representation are often very time intensive, Therefore the use of state-of-the-art computational geometry algorithms as a second step of performance improvement is straightforward [Nie 89]. In [KBS 91] we have shown in detail that the performance of the operation map overlay -an important and often used analysis operation in a GIS- can be considerably improved by applying the computational geometry technique ‘plane sweep’. The basic approach of this paper is to partition the seamless databases using SAMs according to the requirements of the GIS operations. ‘Then state-of-the-art computational geometry algorithms are performed on these partitions and the results are combined in order to increase the overall performance of the GIS operations. Thus we combine spatial access methods and computational geometry in order to improve the efficiency of GISs. The combination of these two areas is based on the fact that both use spatial order relations. The next section describes seamless, vector based databases in GISs. How the efficiency of a GIS is increased by using computational geometry algorithms is shown in section 3. The coupling of spatial access methods and the plane-sweep technique is presented in section 4. An approach to parallelize plane-sweep algorithms follows in section 5. The paper concludes with a summary and an outlook to future work. 2 Seamless vector-based databases in GISs One important requirement to future GISs is the efficient management of so-called seamless spatial databases [Oos 90]. A database is seamless if it does not store sets of map sheets describing only particular small parts of the database, but the whole area managed by the GIS (e.g. the whole world) is stored in one database map, For analysis the user can select any area of interest by a window query. An example is shown in figure 1. This window contains the map which is of further interest to the user. Queties to and manipulations of objects of this map need access to the whole database which is in the order of Giga- and Terabytes. Therefore, the database system of the GIS must be able to support efficient access to any parts of the data on secondary storage. A GIS is based on two types of data [Bur 86]: spatial and thematic data, Thematic data is alphanumeric data related to geographic objects, e.g, the degree of soil pollution. Spatial data has two different properties: (1) geometric properties such as spatial location, size, and shape of spatial objects, and (2) topological properties such as connectivity, adjacency, and inclusion. Topological data can be stored explicitely or can be derived from geometric data. Figure 1: Window query selecting a part of the spatial database There exist two models for spatial data: vector and raster representations. We consider in this paper only maps modelled by a vector representation because there are two main disadvantages of raster representations [Oos 90]: (1) Raster data depends on a specific projection. Therefore, there are problems when combining raster maps from different sources. A scaleless database cannot be realized using a raster representation, (2) Objects in raster maps generally are not handled individually. Thus, a support by access methods is more difficult. Additionally, raster data are more voluminous. In this paper the term map is used for thematic maps. Those emphasize one or more selected topics, e.g. land utilization, population density etc. Thematic maps are generally represented by choropleth maps which separate areas of different properties by boundaries [Bur 86], ¢.g. forests, lakes, roads, or agriculturally used areas (see figure 2). FE Forest EE House [E59 Road Grain ESS] Com EE Bartey Figure 2: Example of a thematic map We assume that the connected areas with the same property are described by simple polygons with holes, and that the used data structures are able to handle such polygons explicitly [KHHSS 91a]. A polygon is simple if there is no pair of nonconsecutive edges sharing a point. A simple polygon with holes is a simple polygon where simple polygonal holes may be cut out (see figure 3), There may be other areas in such a hole. The areas of a map are disjoint but they do not need to cover the map completely. Each area refers to exactly one thematic attribute. In figure 2 these characteristics of a thematic map are depicted by an example which visualizes the land utilization of a part of a map. simple polygon non-simple polygon simple polygon with holes Figure 3: Different polygons Below, a formal definition of a thematic map is presented where M, denotes the regularized intersection [Til 80] and T is the set of values of the thematic attributes of M: M := (t=(tP,t.A)ItP is a simple polygon with holes, Ae T) , wheret,€ M,t)é Mt #h > 1. PA,HP=O ‘Maps of different topics describing the same part of the world are called map layers. 3 Increasing the performance of a GIS using computational geometry Efficient algorithms typically use general techniques such as divide-and-conquer or recursion. For algorithms solving computational geometry problems the algorithmic technique called plane sweep has proven to be very efficient. In this section we apply this technique to operations in GISs and examine the performance and robustness of such an approach, 3.1. The plane-sweep technique An algorithm working in the area of GIS should define and utilize an order relation on the objects in the plane to enable a spatial partition of the input maps. Plane sweep is a technique of computational geometry which fulfills this demand [PS 88]: significant points of the objects (event points) are projected onto the x-axis and are processed according to the order relation on this axis. Event points are stored in a queue called event point schedule. If event points are computed during processing, the event point schedule must be able to insert event points after initialization. A vertical line sweeps the plane according to the event points from left to right. This line is called sweep line. The state of the plane at the sweep line position is recorded in vertical order in a table called sweep line status. The sweep line status is updated when the sweep line reaches an event point. Event points which are passed by the sweep line are deleted from the event point schedule. Figure 4 depicts an example of the event point schedule and the sweep line status. event point schedule (Containing the start and end points of the line segments which are not passed yet; ordered by x-coordinates) sweep line status [> ] (containing the line segments [ee] ich ntsc te sweep ine at sweep line position) Figure 4: Example of a plane sweep 3.2 Applications of plane-sweep algorithms in a GIS The map overlay ‘One of the most important operations in a GIS is the map overlay. It combines two or more input maps of different topics into a single new output map. The combination of the thematic attributes or of geometric or topological properties of the input areas is controlled by an overlay ‘function f, where f is defined or selected by the user of the GIS. The goals are to derive new ‘maps, to find correlations between the information encoded in maps, and to process complex queries. C:D. Tomlin's map analysis package (MAP) [Tom90] is completely based on the map overlay operation. ‘We want to illustrate the overlay operation by an example. Figure 5 depicts two input maps ‘land utilization’ and ‘soil pollution’. In the output map all areas should be reported, which are forests or agriculturally used land and where the degree of soil pollution is greater than 2. BEBE depres of soi polston =2 BBL degree of soit pollution =3 [Ey seated tana [EQ] forest and soit pollution 2 2 sero eced and todo pluton 22 Figure 5: Example of a map overlay and an overlay function 10 In [KBS 91] we presented an overlay algorithm in detail which was based on the plane- sweep technique. This algorithm is called plane-sweep overlay. The merge algorithm Plane-sweep algorithms can be used for further problems in a GIS. The merge operation is one of them which is closely related to the map overlay [Fra 87]: Its purpose is to merge neighboring areas in one map representing the same thematic attribute (see figure 6). For example, such maps may result from a classification of the attributes or from an overlay with a non-injective overlay function. The neighboring areas with identical attributes can be merged by an plane-sweep algorithm similar to the plane-sweep overlay algorithm, The merge algorithm does not insert edges which separate areas with identical attributes into the sweep line status. Thus the resulting polygons describe the merged areas. Figure 6: Merging neighbored areas with identical thematic attibutes Geometric computation of polygons from a set of line segments Another application of plane-sweep algorithms is the following operation: Given a planar graph by a set of line segments, generate the areas limited by these line segments. This operation is needed for example to perform a geometric conversion of spatial data between different geographic information systems. Our implementation of this operation is based on the implementation of the plane-sweep overlay in [KBS 91]. Necessary modifications are an adaption of the intersection treatment and a new calculation of the thematic attributes. 3.3. Performance analysis In this section we examine the performance of plane-sweep algorithms in an experimental framework. Because the map overlay is the most costly operation of the algorithms mentioned above, we investigated the plane-sweep overlay in the following. The principle results are also valid for the other operations if we consider that those algorithms need not to compute intersections. Let n be the total number of edges of all polygons and k be the total number of intersection points of all edges. In [KBS 91] we showed that the worst case performance of the plane- sweep overlay is t (n,k) = O ( (n+k) * log (n) ) (under the assumption that the number of edges attached to one event pcint is limited by a constant). ‘We implemented the plane-sweep overlay algorithm in Modula-2. To examine the performance experimentally, we ran tests on a SUN workstation 3/60 under UNIX. We used a 8 Byte floating point representation which was supported by the hardware and system software. ‘The implementation was developed to demonstrate the properties of the plane-sweep algorithm but it was not tuned for speed. Consequently, there is scope to speed up the overlay. ‘We performed four test series between two input maps. The maps consist of (1) a regular net of areas to get a constant proportion p of k / n, (2) areas covering the map completely which are generated by a tool, (3) tool-generated areas covering only 50 per cent of the map, and (4) real data. Test series 1 was performed with different proportions p. In test series 4a two maps of administrative divisions of Italy were overlaid where one was translated by an offset. In test series 4b the state frontier of Uganda and lakes near by were overlaid. Typical input maps of the series are depicted in figure 7: test series 1 test series 2 ‘lest series 4a (Italy) test series 4b (Uganda: state frontier) test series 4b (Uganda: lakes) Figure 7; Input maps of the test series The results of the test series 1 are shown in table 1. t is the used CPU time in sec which is needed to perform the overlay. Additionally, we want to determine the constant c of the map overlay algorithm hidden by the O-notation (c =t/(n * In n)). series 1a (p series 1c (p = 0.033): 2 mace) D___tiseel_cimscel n___tiseel_cimscel 2048 1.61 2880 «291.26 2160 «18 A 4608 152 5120 Bh 1.22 4860 44 1.06 8192 151 8000 861.20 8640 831.06 10368 1.48 11520 130 1.21 11760 113 1,03 15488 152 15680 1831.21 15360147099 21632 145 20880 2461.21 19440 190099 25088 1.43 25920 3071.17 24000 «2461.02 Table 1: Results of the test series 1 The test series of table 1 demonstrate how the constant depends on the number of intersection points. An analysis of these tests results in the following function: tak) = c'* (n+ 1.75 *k) * In (n) ‘The value of c' in the test series 1a to 1c is approximately 1.05 msec. We would like to emphazise that this constant is very small with respect to performance criteria. In table 2 the test series 2 and 3 are depicted where c' is calculated additionally. series 2: series 3: D D___tisec]_¢ {mnsec]_ msec} n D___L{sec}_cimsec]_c'fmsec] 13176 0240 188 «1501.02 6251 0.262 «101-85 1.21 28837 (0.154 372, 1.26097 14179 (0.184221 1.63 1.20 30285 (0.161 «413, «1321.01 15260 0.188 = 2451.66 1.22 Table 2: Results of the test series 2 and 3 Series 2 and 3 demonstrate another dependency of the running time: If a large number of edges of the polygons coincide (see figure 7), the running time decreases. The reason is that the algorithm detects such edges and combines them. Table 3 depicts the results of test series 4 with files of real data, In series 4a (Italy) varying administrative divisions and in series 4b (Uganda) varying resolutions are considered. series 4a (laly): soties 4b (Uganda, p < 0.004): state 6666 0.012 67 114 1,12 1852 16 1.16 groups of regions 9622 0.015 94 1.07 1.04 8973 861.05 regions 11542 (0,014 110 1.02 0.99 17829180 1,03 provinces 20378 0.023 194 096 0.92 Table 3: Results of the test series 4 The results of test series 4 demonstrate the validity of the previous results for real data. 13 3.4 A suitable coordinate representation for plane-sweep algorithms The instability of plane-sweep algorithms against numerical errors is an objection being raised. This reproach may be justified if a floating point representation is used to compute the intersection points, However rational coordinates are a more suitable representation because they form a vector space. For example, a rational representation of coordinates is used in an implementation of a map overlay in [FW 87]. A more detailed analysis of such a representation leads to the following statements: 1, The coordinates in maps recorded by a GIS can be represented by pairs of integers. This assumption is realistic because both, the described part of the world and the resolution are limited. 2. To compute intersection points, integer coordinates are insufficient [Fra 84]. But the computation of the intersection of line segments described by integer coordinates, needs only a limited number of digits to represent the intersection points by rational numbers. Let n the number of digits of the integers of the input map then the number of digits of the nominator of the intersection points is smaller than 2*n+4 and the number of digits of the denominator is smaller than 3*n+4 (see [Bri 90]).. 3. If the input maps of an overlay or of another operation producing intersections, result from an analysis operation (thus containing rational coordinates), the same number of digits as in statement 2 is sufficient for the representation of the intersection points. This is due to the fact that no line segments connecting intersection points are introduced. Under realistic assumptions rational coordinates of finite precision are an easy, relative efficient, and numerical exact coordinate representation for geographic information systems. Plane-sweep algorithms are absolutly robust by this approach, For an efficient use of rational coordinates an adequate support by hardware and system software is desirable but lacking today. 4 Coupling spatial access methods and plane-sweep algorithms The database system of a GIS must support efficient query processing as well as efficient manipulation and combination of maps. To fulfill these requirements we assume that the database system uses suitable spatial access methods (SAMS) for the management of the database. In particular, this allows to extract the relevant parts from the seamless database (maps). In the following we assume that each map layer is organized and supported by its own SAM because in GISs an efficient access to a map of one topic is desirable, e.g. land utilization or soil pollution. An often used technique to store areas with SAMs is to approximate them by minimal bounding boxes (MBBs). MBBs preserve the most essential geometric properties of geometric objects, ie. the location of the object and the extension of the object in each axis, The query Processing is carried out in two (or more) steps. MBBs are used as a first filter to reduce the set of candidates. The second step (refinement) examines those candidate polygons by 4 decomposing them into simple spatial objects such as convex polygons, triangles, or trapezoids ((KHHSS 91a}, [KS 91]). To test the polygons for intersection with a sweep line or query rectangle, MBBs are a sufficient first filter. In the following we assume that the SAM organizes the access to the objects of a database using a tree-like directory. Such access methods are adequate in handling non-uniform spatial data [KSSS 89]. The inner nodes are called directory pages, the leaves of the tree are data pages. The data and directory pages of a SAM define a partition of the data space. In our case a record in a data page consists at least (a) of a MBB, (b) of the value of the thematic attribute, and (c) of a polygon description or of a pointer to such a description depending on the size of the polygon, see [KHHSS 91b]. ‘As mentioned in the introduction, the database and the maps in a GIS may be very large. ‘Therefore it is not useful to keep all maps in main memory, especially not in multi user systems. In systems with a virtual storage manager the efficiency could decline by a large number of page faults. Instead of processing the maps completely, it is more efficient to partition the maps and to carry out the plane sweep algorithms on these pattitions. One approach is to partition the map using a uniform grid like in (Fra 89). Obviously, this is not the best way because a non-uniform data distribution is not adequately handled by this approach. We will partition the map by using SAMs and the plane-sweep technique. Another important reason to partition the maps is the running time of plane-sweep algorithms which is often more than linear. By partitioning we reduce the number of polygons and edges which have to reside in main memory performing the plane sweep. This speeds up the running time for the complete plane sweep. 4.1 Sweep-line partition For a plane-sweep algorithm only those polygons are relevant which intersect the sweep line. Thus we have a criterion for partitioning by the algorithm itself: Only polygons intersecting the sweep line or close to the sweep line, are kept in main memory. In terms of SAMs this means to read data pages from secondary storage as soon as the sweep line intersects them, We call this approach sweep-line partition. Sweep-line partition and transformation For example, the sweep-line partition can be realized by the transformation technique [SK 88]. This technique transforms the coordinates of a 2-dimensional MBB into a 4-dimensional point. There are two representations of such points: (1) The center representation consists of the center of the rectangle (c,,cy) and the distance of the center to the sides of the rectangle (¢,,¢y). (2) The corner representation stores the lower left (x,,y,) and the upper right corer (X2,¥) of the box. The 4-dimensional points are stored by a suitable multidimensional point access method, e.g. the grid file [NHS 84}, PLOP-hashing [KS 88], the BANG file [Fre 87], or the buddy tree [SK 90) . 15 In the following we use the transformation technique with comer representation. The SAM uses its own sweep line. These sweep line is driven by the partition of the SAM. Performing a plane-sweep algorithm, we must synchronize the sweep line of the algorithm and the sweep line of the SAM. When the sweep line of the algorithm overtakes the sweep line of the SAM, new data pages must be read from secondary storage by the SAM. An example is depicted in fig. 8. fsneep line Figure 8: Sweep-line partition and realization by transformation (x-dimensions are shown only) The sweep-line partition is also applicable to the other techniques (i.e. clipping and overlapping regions [SK 88]) and to access methods inherently using these techniques, e.g, the R-tree [Gut 84] and the R'-tree [BKSS 90] (overlapping regions), or the cell tree [Gtin 89] (clipping). Performance Using the sweep line partition reduces the number of page faults considerably because only those parts of the maps intersecting the sweep line reside in main memory. Minimizing the number of page faults during the algorithm improves the overall performance. This gain of efficiency is only slightly reduced by the following effect: Without partition every required page is accessed exactly once. The pass through the tree of the SAM according to the sweep line may cause several accesses to the same directory page. However, the number of accessed directory pages is, compared to the total number of read pages, generally very small. In table 4, the space requirements of real maps are listed (assumming an R*-tree with pages of 2 KB): i ‘Africa (countries) 4679348 byte 3924 byte 0.084 % Aftica (topography) 5528816 byte 46332 byte 0.831 % Latin America (countries) 5178440 byte 10332 byte 0.199 % Latin America (topography) 3785440 byte 51480 byte 1.342 % EC (regions) 1126360 byte 29916 byte 2.587 % Table 4: Space requirements of data and directory 4.2 Strip Partition Contrary to a partition driven by the sweep line, an orthogonal partition is possible. To support the plane sweep, it is sensible to divide the map into strips S, which extend over the whole map (strip plane sweep). In the following we assume proceeding from the top strip S, to the bottom strip Sy, (see figure 9). The strip partition shortens the length of the sweep line which decreases running time. The height of the strips may vary to adapt the partitions to non-uniform data distributions. Figure 9: Strip partition Some areas of a map may intersect more than one strip. One solution is to read all necessary areas for each strip. The consequence is that many data pages are accessed several times. Therefore, this procedure is too costly. Another way is to store such areas temporarily in a buffer. Those areas are an additional input to the next strip plane sweep. Thus, every area of an map can be assigned to exactly one strip S; and needs to be read from secondary storage only once. As in section 4.1 we assume that each accessed data page is completely read. Areas not intersecting the actual strip are buffered. The access to the areas of one strip corresponds to standard region queries which supply all polygons intersecting the strip. There is only one exception: Data pages accessed by previous strips are not read again. We call this kind of query modified region query. Such queries are performed very efficiently by the R'-tree [BKSS 90], a variant of the well-known R-tree [Gut 84]. This is caused by the minimization of area, margin, and overlap of directory rectangles in the R’-tree. Generating an optimal strip partition In the following, we want to describe how an optimal strip partition of the map is generated. An optimal strip partition adapts the strips to the distribution of the areas of the map, exploits the size of main memory, and avoids page faults. The strip partition is best supported by using an efficient SAM, such as the R’-tree. ‘As mentioned, the areas of each map which are simple polygons with holes, are approximated by minimal bounding boxes, which preserve the location and the extension of the areas. The number of bytes representing an area is assigned to the MBB. This is necessary 7 because we cannot expect in GIS applications that each area is described by the same number of bytes. Each data page represents a set of areas. Thus, for each data page the number of bytes can be calculated which is necessary to store the data of this page in main memory. This information is stored in lowest level of the directory. In a preprocessing step of a plane-sweep algorithm, we determine the data pages which intersect the map. Each data page of the SAM corresponds to a region of the data space, e.g. the regions of the R°-tree are rectangles. These regions are sorted in descending order according to the highest y-coordinate of the region. Initially, the buffer is empty which stores areas which are not performed completely by a strip sweep line. According to the order mentioned above the first k regions are determined where the sum of bytes which are represented by these k regions is smaller than the size of main memory minus the size of the buffer. Thus, the first strip S, is limited by the highest y-coordinate of the (k+1)st data page. The areas which are not handled completely in the first strip sweep line will be stored in the buffer. The above procedure is iterated. To illustrate this approach we present the following example where the size of main memory is restricted to 8 mega bytes (see figure 10). Wa, strip partition Figure 10: Example of generating an optimal strip partition ‘The numbers 1 to 9 of the data pages indicate the order mentioned above. The size of the data pages 1 to 4 amounts to 7 MB, With the next data page the size of main memory would be exceeded (7MB+2 MB > 8MB) and page faults could occur. Therefore, the first strip ends at the highest y-coordinate of the data page 5. Let us assume that after the first strip plane sweep 0.25 MB are stored in the buffer. Then the second strip can be extended until 7.75 MB are not exceeded. Thus, the data pages 5 to 7 are associated to the second strip. Finally, the data pages 8 8 and 9 and the regions stored in the buffer are accomodated in the third strip. Generating the optimal strip partition is not time intensive because only directory pages are read to get the necessary information, such as the size of the data pages and bytes represented by the data pages. Data pages are only read from secondary storage when the plane-sweep algorithm is performed actually. The ratio of read directory pages to read data pages is very small when performing a plane-sweep algorithm (compare section 4.1). 5 Parallel processing of plane sweeps In the last years there are many efforts to design, to manufacture, and to utilize computer architectures of multiple, parallel central processing units (CPUs). Computers using such architectures are called multiprocessor systems or parallel computers. Their main objective is to increase the performance compared to one-processor systems. Future database systems and particularly spatial database systems of GISs have to pursue using such architectures. This is important especially for time-consuming: operations such as the map overlay or related GIS operations. The use of parallel architectures necessitates the development of spatial access methods which support parallel access to the database and Gf possible) utilize the parallel architecture. ‘The second goal is the design of parallel algorithms which exploit the parallelism offered by the architectures and the parallelism hidden in the problem in a best possible way. In this section we demonstrate such exploitation of parallelism for plane-sweep algorithms. There exist different types of multiprocessor systems. We assume that each CPU has its own main memory (local memory). Parts of the memory may be shared. One important characteristic of multiprocessor systems is the communication between the processors. There exist many interconnection networks in such systems ({GM 89], [SH 87]), e.g. static networks as rings, trees, hypercubes, or grids. Modern architectures allow dynamic routing between arbitrary processors. In the following, we assume a linear arrangement where each processor can communicate with its direct neighbor. Such a structure can be realized by most interconnection networks. ‘The strip partition seems to be the best candidate for a parallel execution of plane sweeps. A natural approach is to process the strips simultaneously and independently. But there are some problems: As mentioned in section 4.2, areas exist which intersect more than one strip. If we perform the plane sweeps independently, many data pages must be read from secondary storage several times. This effect decreases the performance of the approach considerably. Another problem is that we may need the results of strip S,., for processing strip S; which is eg. necessary for the plane-sweep overlay of thematic maps without complete cover by the areas. Therefore, we have to synchronize the strip processing. We introduce a second sweep line for cach strip indicating the part of the map which is already processed completely. The first sweep line of strip S,,, is not allowed to overtake the second sweep line of S;. The process of Sis1 is suspended if necessary. An example of parallel strip processing is shown in figure 11: Figure 11: Parallel strip processing Parallel plane-sweep overlay This approach can be realized for the plane-sweep overlay [KBS 91] with little extensions to the original algorithm: For maintaining the second sweep line, we need a new data structure L. The actual x-position P of the first sweep line and an identification number of the region are inserted into L when a new region is starting. L is ordered by P and can be implemented using a balanced tree. The position P and the region ID are also stored additionally to the edges in the sweep line status. If the algorithm detects that two regions with different region IDs are identical, the entry with P staring further to the right is deleted from L. When a region is Closed, the assotiated entry is deleted from P. If this entry was the minimum entry, the second sweep line is allocated at the position of the new minimum entry of L. This processing is illustrated in figure 12. Other plane-sweep algorithms can be modified in a similar way. : f t gelae x) Gam inser 9) , i 1) was the HSE inser) SAC) eed ny of L, fooL set the socond sweep line to the new minimum of L Figure 12: Update of the data structure L 6 Conclusions In this paper, we demonstrated the fruitfulness of combining spatial access methods and computational geometry concepts, in particular for the plane-sweep paradigm, in order to increase the efficiency of geographic database systems. The marriage of these two areas was enabled by the property that the spatial access method supports the plane-sweep paradigm. 20 Since plane-sweep processing generates results in sorted order, the spatial access method must be robust with respect to sorted insertions for storing these results. As an example of providing good performance we presented the plane-sweep map overlay which is a very important analysis operation in a geographic information system. Good analysis and retrieval performance are important factors for good user acceptance of a GIS. Thus, in our future work, we will design efficient algorithms based on spatial access methods and computational geometry for all retrieval operations. Performance improvements which exceed those realized in this paper by coupling spatial access methods and computational geometry, are feasible by using processors suitable for rational numbers and by implementing parallel GIS algorithms on parallel multiprocessor systems. These issues are important goals in future work. Acknowledgement We thankfully acknowledge receiving real data representing national administrative divisionsof the European countries by the Statistical Office of the European Communities. Further real data are taken from the World Data Bank Il. Additionally, we would like to thank Holger Hom for making his map generator available tous. References [BKSS 90] Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R°-tree: An Efficient and Robust Access Method for Points and Rectangles. Proc. ACM SIGMOD Int. Conf. on Management of Data, 322-331, 1990 [Bri 90] Brinkhoff, T.: Map Overlay of Thematic Maps Supported by Spatial Access Methods. Master thesis (in German), University of Bremen, 1990 [Bur 86] Burrough, P.A.: Principles of Geographical Information Systems for Land Resources Assessment. Oxford University Press, 1986 [Fra 84] Franklin, W.R.: Cartographic Errors Symtomatic of Underlying Algebra Problems. Proc. Int. Symp. on Spatial Data Handling, Vol. I, 190-208, 1984 [Fra 87] Frank, A.U.: Overlay Processing in Spatial Information Systems. Proc. 8th Int. Symp. on Computer-Assisted Cartography (Auto-Carto 8), 16-31, 1987 [Fra 89] Franklin, W.R. et al.: Uniform Grids: A Technique for Intersection Detection on Serial and Parallel Machines. Proc. 9th Int. Symp. on Computer-Assisted Cartography (Auto-Carto 9), 100-109, 1989 [Fre 87] Freeston, M.: The BANG file: a new kind of grid file. Proc. ACM SIGMOD Int. Conf. on Management of Data, 260-269, 1987 [FW 87] Franklin, W.R., Wu, P.Y.F.: A Polygon Overlay System in Prolog. Proc. 8th Int. Symp. on Computer-Assisted Cartography (Auto-Carto 8), 97-106, 1987 [GM 89] Gonauser, M., Mrva, M. (eds.): Multiprozessor-Systeme: Architektur und Leistungsbewertung, Springer, 1989 [Gin 89] Giinther, O.: The Design of the Cell Tree: An Object-Oriented Index Structure for Geometric Databases. Proc. IEEE Sth Int. Conf. on Data Engineering, 598- 605, 1989 [Gut 84] [KBS 91] ai Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching. Proc. ACM SIGMOD Int. Conf. on Management of Data, 47-57, 1984 Kriegel, H.-P., Brinkhoff, T., Schneider, R.: An Efficient Map Overlay Algorithm based on Spatial Access Methods and Computational Geometry. Proc. Int. Workshop on DBMS's for geographical applications, Capri, May 16- 17, 1991 (KHHSS 91a] Kriegel, H.-P., Heep, P., Heep, S., Schiwietz, M., Schneider, R.: An Access Method Based Query Processor for Spatial Database Systems. Proc. Int. ‘Workshop on DBMS's for geographical applications, Capri, May 16-17, 1991 [KHHSS 91b] Kriegel, H.-P., Heep, P., Heep, S., Schiwietz, M., Schneider, R.: A Flexible [KS 88] IKs 91] IKSSS 89] NHS 84] UNic 89] [00s 90] IPS 88) [Sam 89] [SH 87] [SK 88] [SK 90] (Til 80) [Tom 90] and Extensible Index Manager for Spatial Database Systems. Proc. 2nd Int. Conf. on Database and Expert Systems Applications (DEXA '91), Berlin, August 21-23, 1991 Kriegel, H.-P., Seeger, B.: PLOP-Hashing: A Grid File without Directory. Proc. 4th Int, Conf. on Data Engineering, 369-376, 1988 Kriegel, H.-P., Schneider, R.: The TR'-tree: A New Representation of Polygonal Objects Supporting Spatial Queries and Operations. Submitted for publication, 1991 Kriegel, H.P., Schiwietz, M., Schneider, R., Seeger, B.: Performance Comparison of Point and Spatial Access Methods. Proc. Ist Symp. on the Design of Large Spatial Databases, 1989. In: Lecture Notes in Computer Science 409, Springer, 89-114, 1990 Nievergelt, J., Hinterberger, H., Sevcik, K.C.: The Grid File: An Adaptable, Symmetric Multikey File Structure. ACM Trans. on Database Systems, Vol. 9, No. 1, 38-71, 1984 Nievergelt, J.: 7£2 Criteria for Assessing and Comparing Spatial Data Structures. Proc. Ist Symp. on the Design of Large Spatial Databases, 1989. In: Lecture Notes in Computer Science 409, Springer, 3-28, 1990 Oosterom, P.J.M.: Reactive Data Structures for Geographic Information Systems. PaD-thesis, Department of Computer Science at Leiden University, Preparata, F.P., Shamos, M.L: Computational Geometry. Springer, 1988 Samet, H.: The Design and Analysis of Spatial Data Structures. Addison- Wesley, 1989 Siegel, H.J., Hsu, W.T.: Interconnection Networks. In: Milutinovic (ed.): Computer Architecture: Concepts and Systems. North-Holland, 225-264, 1987 Seeger, B., Kriegel, H.-P.: Techniques for Design and Implementation of Efficient Spatial Access Methods. Proc. 14th Int. Conf. on Very Large Data Bases, 360-371, 1988 Seeger, B., Kriegel, H.-P.: The Buddy-Tree: An Efficient and Robust Access Method for Spatial Database Systems. Proc. 16th Int. Conf, on Very Large Data Bases, 590-601, 1990 Tilove, R.B.: Set Membership Classification: A Unified Approach To Geometric Intersection Problems. IEEE Trans. on Computers, Vol. C-29, No. 10, 874-883, 1980 Tomlin, C.D.: Geographic Information Systems and Carto; hic Modeling. Prentice-Hall, 1990 ia : FI-Quadtree: a New Data Structure for Content-Oriented Retrieval and Fuzzy Search IP. Cheiney A. Touir Ecole Nationale Supérieure des Télécommunications 46, rue Barrault - 75013 Paris - FRANCE e-mail: (cheiney, touir}@inf.enst.fr Abstract : In this paper, we focus on the problem of content-oriented retrieval in an image database. This problem can be interpreted as the search and the selection of images containing whichever pattern introduced beforehand by the user. We propose a data structure valid for this kind of manipulation, the Full Inverted Quadtree (Fl-Quadtree). The structure represents a set of images within a single quadtree. We analyze the distribution of the data in the base, the operations of insertion, selection and we report some experimental results. 1 Introduction New multimedia database systems require an efficient spatial data handling [Orenstein86, MeyerWegener89, Chang89]. The large amount of information and the unstructured characteristics of data bring in new problems of storage and manipulation. The problem is how to store a large amount of unformatted data for efficient search and retrieval. Convenient data organization for multimedia database is still today a problem to be solved. Data structures for storing images must have multiple and contradictory characteristics. In fact, there is a trade-off between retrieval capabilities and space occupation, and this is an important issue for image handling. In several approaches, the accesses to an image use alphanumeric or graphic descriptors [Tamura84, Woelk86, Chang88}. Thus, queries have to be composed with criteria that are present in these descriptors. The main problem of this type of image database is the limitation of the searching capabilities to a set of previously defined criteria. However, many applications need to manipulate the images directly. Spatial operations, editing facilities, content-oriented retrievals demand to directly handle the bitmap representation. A convenient data structure has to be as compact as possible, and to support access paths corresponding to different types of manipulation. A bitmap-compatible structure seems to be an attractive solution. Quadtree [Samet90a,b] is such a structure, which provides an interesting technique to code images in a compact way [Cheiney90]. The quadtree is a hierarchical data structure used to organize an object space. An object can be either a point, a line segment, a quadrant or a rectangle. This data structure has been widely used in computer vision, geographic information systems and geometric modeling. Its main advantage is due to its compactness and regularity. Its principle of encoding is to partition each object (€.g. an image) into homogeneous quadrants and to label each of them. An homogeneous quadrant could be either white or black, and it is associated to a leaf (terminal) node of the quadtree. A nonhomogeneous quadrant is considered as a gray quadrant and it is associated to a nonterminal node of the quadtree. A recursive cutting is applied to the binary image: a quadrant is cut into four equal parts until an homogeneous quadrant or pixel is reached. Each quadtree node is labelled. This label, which we call prefix, is usually obtained by using the Morton's code [Morton66]. The Morton's code of a node is built by interleaving the bits of x 24 and_y coordinates of the upper left comer of the quadrant that corresponds to the node. A prefix may have several representations. In this paper, we suppose that a prefix can be defined either by its binary label, or by its length (number of bits that compose it) and its decimal value. The length of binary labels depends on the image definition. Within a 2" x 2N bitmap, the maximum length is 2N. Some varieties of quadtrees have been proposed. Each one is more or less adapted to manipulate specific data. Linear quadtrees [Gargantini82] are used to code binary images, where each node corresponds to a black quadrant. Nodes of the PM Quadtree [Samet84] represent line segments data. The PR Quadtree [Ang89] is used to code point and region data. The principle of the PR Quadiree is to suppose that a node would be split to four nodes one evel deeper, only if the number of points in its corresponding quadrant reaches a given limit, called the node capacity. Another kind of quadtree is the MX-CIF Quadtree [Kedem82]. It is used to represent rectangle data. The principle of the MX-CIF Quadtree is to associate to each rectangle, the quadtree node corresponding to the smallest quadrant which contains it. The decomposition of quadrants is recursively done until no quadrant contains no rectangle, These latter two types of quadtrees accept more than one data in a node. Moreover, using the MX- CIF Quadtree, data can be in any node (root, terminal or nonterminal nodes), whereas the use of the other kind of quadtrees allows data to be stored only in leaves (terminal nodes). The essential characteristics of all these types of quadtrees is to associate a whole quadtree to each complex object (image, plan, map, ...). Unfortunately, all these kinds of quadtree are not well adapted to execute a pattern searching. A query like: “for any given pattern, select the image that contains it” is difficult and expensive to process. In this paper, we focus on this type of problem and we propose an original method that copes with this kind of query. The word "contain" used in a such query means “contain as near as makes no differences”, Therefore, the problem is to execute a fuzzy search and to select an image that contains any given pattern. To solve this problem, we propose in this paper a new type of quadtree, called the FI- Quadtree (F=full and Tinverted). In this structure, a set of images is encoded into a single quadtree. The main idea is to invert the representation. A classical representation associates to each Image Identifier the set of prefix encoding the corresponding bitmap. In the inverted approach, we associate to each possible prefix the set of Image Identifiers including the considered prefix. The paper is organized as follow: in Section 2, we define the Fl-Quadtree and we precise its characteristics, Section 3 gives some details about the pattem matching within the proposed structure, Section 4 focuses on fuzzy search capability provided by the Fl-Quadtree. Section 5 reports some experimental results. Finally we conclude in Section 6. 2 The FI-Quadtree data structure ‘The coding of an image database, using the quadtree structure, generates an image space. A quadtree is built and associated to each inserted image. Thus, if we execute a query such as the ‘one mentioned above, we have to check the existence of the pattern in each quadtree. It is clear that the execution of such operations on a set of images is very expensive. In this section, we show how to minimize the data manipulation for this kind of queries. 2.41 The main idea In a classical quadtree approach, each image is coded into a set of prefixes. A searched pattern can also be expressed as a set of prefixes. Thus, the search consists in scanning each image, in order to detect a pattern matching between the image and the searched pattern. The aim of the 25 query processing is to select the identifiers of valid images. A large amount of the complexity of the operation is due to the sequential characteristic of the adopted strategy. A set of images is composed by a large set of prefixes (each image has its own set of prefixes). In fact, many of these prefixes are redundant, The greater is the number of images, the largest is the redundancy. On the contrarily, we suggest to process all the images at a time. The main idea is to invert the search. If we link each prefix to the set of images where it occurs, we can see that the data redundancy is minimized. Moreover, the search uses directly the data structure to determine the identifiers of the image where a particular prefix occurs. The search of a pattern including a set of prefixes leads us to intersect several lists of identifiers to obtain the convenient identifiers. This is the base of the proposed data structure, the FI-Quadtree (Full and Inverted Quadtree). The structure is Fuil, because all the possible prefixes must be built as entries. The quadtree is Inverted, because the nodes’ data are the identifiers. There are three main advantages in this technique: first, each possible matching is investigated only once; second, the space occupied by the set of images is bounded; third, the fuzzy search is not very costly. The image space is coded only with a single quadtree called the Fl-Quadtree. A field node of Fi-Quadtree is composed of a list of Image Identifiers (ImIds). A particular quadrant corresponds to each node. This quadrant is black in each Image Identifier in the Imlds list. Consequently, the base is composed only by one special image. We can see this image as a ‘meta-image (illustrated in Figure 1) which is structured as a FI-Quadtree. The inserted images make up the meta-image. 2.2 Characteristics of The FI-Quadtree In this section, we give some characteristics of the Fl-Quadtree and we show how to build and use it, We suppose that all the inserted images have a size of 2Nx2N pixels. Definition: We define the FI-Quadtree as a complete quaternary tree, where : (1), each field node is composed with a list of /mfds. An Imid present in a given node is the identifier of an image that contains a black quadrant corresponding to this node. If a field of one node contains none, its correspondent quadrant is a white one for all the inserted images; (2) all the field nodes of the FI-Quadtree have the same size corresponding to a maximum number of Image Identifiers. However, if one field is full and data cannot be inserted in, its size is augmented, and so for all the others . Properties: (1). this particular tree is full, so all the nodes (terminals and non-terminals) of a Fl-Quadtree are considered black ones; so the Fl-Quadtree is composed exactly by (4-1) 3 nodes; (2) contrarily to a classic quadtree, where all the represented quadrants are di the Fl Quadtree are either disjoint or embedded; int, those of @ :Imld=1 Figure 1: Building a Fl-Quadiree @ Gi) Figure 2: Evolution of the Fl-Quadtree corresponding to the sequence of insertions ‘When no image is in the base, all the lists of the Imld field of the Fl-Quadtree nodes are empty (Figure 2.i). The insertion of a given image consists in computing the prefixes of all the quadrants, and in inserting the identifier of this image in all the Fl-Quadiree nodes that correspond to these prefixes. Figure 2.ii represents the Fl-Quadtree, after the insertion of the first image of Figure 1.d. Whereas Figure 2.iii represents the state of the Fl-Quadtree after the insertion of the second image of Figure 1.e; the evolution of the meta-image is represented respectively in Figures 1.a, 1.b and I.c. Thus, for all the images in the base, the access to any node of the FI-Quadtree allows to know all the images that contain the corresponding quadrant of this node. According to the definition of the FI-Quadtree, the search of a given pattern is realized by accessing the Fl-Quadtree nodes, where special treatment is done. We analyze this treatment with more details in next section. 3. Pattern Matching within a FI-Quadtree We discuss in this section the search within a Fl-Quadtree. The problem is to build an access function providing a direct access to the pertinent node from the searched pattern. However, the nodes are stored into pages of disks; it will be useful to maintain this storage order according to the node prefixes (i.e. to the quadtree order). A hashing function will be investigated for this use. 3.1. Defining an Order The chosen function has to be well adapted to the quadtree order. This criterion is important in order to minimize the I/O complexity in searching ordered patterns. The pattern to be searched is indeed coded by a linear quadtree and the prefixes are represented in base two. Moreover, these prefixes are sorted with an increasing order (the quadtree order). Therefore, efficiency of the matching operation implies that the quadtree of the searched pattern and the address of Fl-Quadtree nodes (on disk) have the same order. For any prefix P=piP2Ps, Pip=(LP. VP), Ly where pie(0,1}, Vp= 5" p29, Thus, for any couple of prefixes P, Q defined respectively by their number of bits and their decimal values (Lp,Vp) and (Lq,Vq), the quadtree order is defined by P is associated to Qy, and p; is associated to Q,,i< Lp. For example, in Figure 3 where N=3, the prefix P=010110=(Lp=6, Vp=22) is associated to the quadrant number 31. This number is obtained by associating (2, to the first bit (from the left ), Qpto the second one,Q, to the third one, and so forth; thus, #(P) = Q+ Qg+Qs+ Lp /2=31. = Lp.Vp), The hashing function used, is defined as : For any prefix P=p,p:Ps... UNI J if iis even *Qj,rif i is odd Lp HP) = B YE p*Q; where | 7 t Q, ist i ‘The quadtree can be implemented as a list which allows a sequential access, or as a tree which permits a direct access to a specific node. The hashing function described above permits to implement the Fl-Quadtree as a list of prefixes and allows a direct access. Thus, for any node the access is performed in just one logical disk access. 3.2 Insertion Algorithm As the Fl-Quadtree is a full structure, it is not necessary to store explicitly the prefixes. They are only used to compute the node addresses on disk, where their corresponding Imlds fields have to be stored, In an Imlds field, a single bit is associated to each Imld, For instance, the use of one byte to code the Images_Id field, allows us to insert eight images in the meta-image. If we suppose that one field size is composed by o: bytes, then the meta-image size is ci bytes, and it could range from 1 to an upper bound of 80: images. It is penalizing to fix in advance the number of images that could store the meta-image; therefore, we consider that the meta-image is dynamic. When it is empty or it contains less than 8 images, just one byte is associated to each field. If there are 8 images, the Fl-Quadiree is reorganized and two bytes are associated to each field. In the general case, if there are 80. images, the Fl-Quaduree is reorganized and (al) bytes are associated to each Imlds field. To insert an image, we have to code it in a linear quadtree and for each prefixe P of the quadtree, the identifier of the image is inserted in s(P). We note that the coding is performed using the notion of the active nodes [Shaffer87}. 4 Fuzzy Search In this section, we expose and analyze the fuzzy search capability provided by the structure. First, we give an idea about the complexity of the displacement of the pattern through the FI- Quadtree, then we expose and analyze the principle of the fuzzy search. We consider a pattern composed of M_pref prefixes. These prefixes address a set of quadrants having a total of M_pix pixels. The patter is contained in a minimal rectangle having L*I pixels size. It is coded asa linear quadtree of M_pref prefixes. We have to find the image that contains it. 4.1 Moving in the FI-Quadtree For a pattern included in a L*! minimal rectangle, the total number of translations through the meta-image is (2N-L+1) * (N-I+1)+1. Therefore, the larger this rectangle is, the less is the number of translations. This operation is very costly {Walsh88, Touir90a]. However, note that the translation can be done simultaneously with the filtering [Touir90b]. Such a process allows the filtering to be stopped in a given position, if it is shown that the pattern cannot exist in that position of the Fl-Quadtree. This means that no image contains the pattern in such a position, 29 The translation is then interrupted and it is performed again in another position, To reduce the number of translations, the user can indicate a particular region that could contain the pattern, If this region is (R * r) pixels size, the number of translations is then reduced to (RLH) * HHL 4.2 "Microscopic" Filtering In this section, we present the principle of the filtering. First we give the definition of a naive global matching between a pattern and an image, then we introduce a new definition which is more precise. An image I contains a pattern M if all the prefixes of M match with some prefixes of I. This definition is not sufficient and not precise. Indeed, let us suppose that we find a pattern repretentd by (0000001010, 0000011000, 0000100, 0000111), and it exists an image having a prefix P=0000, we note that all the prefixes of M match with P, and consequently that I contains M. We note that the detection is not precise. To improve the detection and the filtering, we introduce two matching criterias corresponding to two filtering levels. The first is a microscopic level (or prefix level) represented by Definition, and the second is a macroscopic level (or pattern level) represented by Definition2. 1 419243..4iq= (La,Vq) and P = p,P2Ps,.Pip= (Lp, Vp) be two prefixes of M and I respectively; we say that Q matches with P to within a factor 8, if Lq=Lp+8,820 matchg(Q,P)= | and Ge (let beet) We note that : if & = 0, Q and P represent the same quadrant; if § = 2, the quadrant associated to Q is included into the one of P and it has a quarter size of the one of P; and so forth. The more § is important, the more the matching is impresice, 43° Filtering Distance Ina second definition, we will take two new parameters into consideration. N_pref defines'the number of the prefixes of an image that match the pattern, and N_pix defines the number of pixels addressed by the set of N_pref prefixes. Definition 2: We define the filtering ratio (or the filtering distance) between M and I by: a1. (Nopref | Nopix aM) = 5 * Ge {_pref * M_pix 7” Note that if d(M,I) = 1, the search is an exact one, Furthermore, because for any image I, M_pref 2 N_pref and M_pix 2 N_pix, d(Mil) < 1. Thus, the precision of the search could be defined by the user. This means that the user gives a precision coefficient K (0 < K < 1), and the system searches any image I that verifies d(M,I) 2 K. According to these definitions, we can compare the filtering distance of two images Il, 12 to a given pattern M. We note d(M,I1) > d(M,12) if Gee Noi), > (Fae ( Neping 4.4 Fuzzy Filtering Algorithm According to these two definitions, and the definition of the hashing function #,, the principle of the search algorithm is: for a given position of the pattern, two variables N_pref, N_pix are associated to each image, and initialized to zero. a- for each prefix P of the pattern, resulting from the translation do access to xt (P) and modify (N_pref, N_pix) associated to each Image_Identifier contained in 3 (P) field; b- select the candidate image CI where d(M,CI) = { MAX(d(M,D); I belongs to the set of the inserted images); ¢- insert CI in the set of the candidate images; d- select a new position of the pattern and go to stepl. select the image SI (solution of the problem) where d(M,SI) = {MAX(d(M,CI)); where CI belongs to the set of candidate images 4.5 Improvement of the Fuzzy Search As introduced above (Section2), fuzzy search consists in displacing the pattern among the FI- Quadtree and for each position, in performing a filtering and in selecting a candidate image. Thus, before beginning the search in a new position, the filtering that corresponds to a previous position has to be achieved and a candidate image is selected. This method is not well adapted to perform this operation, and it generates an important I/O complexity. Indeed, if the searched pattern has almost the same size that the main memory buffer associated to the Fl- Quadiree, the number of disk accesses is nearly equal to the number of positions where the pattern has to be searched. To solve the problem of excessive number of disk accesses, we execute the filtering in some steps. We filter the pattern in a set of positions, only for the Fl-Quadtree nodes that are in the main memory; if a filtering is not achieved and needs a disk access, it is temporarily interrupted and a new filtering is performed in another position of the pattem, in order to finish this partial filtering later. 5. Experimental Results ‘The Fl-Quadtree has been implemented as a part of the development of an Intelligent Search System in Graphics and Images [Cheiney90]. This implementation is realized on a Sun_Sparc workstation. Alll the inserted images are 1024x1024 pixels size, We check the performance of the insertion, deletion and the search operations. The measurements consist in computing the execution time and the number of disk accesses for each operation, In order to obtain more reliable estimation of the processing time, (I)each operation is applied many times and the average of all the obtained results is taken, (2) the inserted images are very disparate and the number of prefixes that compose each of them varies between 25 and 500,000. 5.1 Insertion ‘The experimental results of this operation show that the insertion of a complex image (which is composed with more than 100,000 prefixes) requires an important processing time. This complexity is due to (1) the reading of the linear quadtree of the image to be inserted and (2) 3t the processing and the updating the Fl-Quadtree. We check this operation with different size of the buffer assigned to the Fl-Quadtree. Figures 4 & 5 (where the buffer size is successively equal to 2048, 8196 and 16384 bytes) show that the processing time of this operation varies linearly with the complexity of the image, whereas the /O complexity varies with the complexity of the image and with the distribution of the quadrants into the image. Time (See) avo ° © — 100000 200000#Prefixes 400000 Soooon —®=—=—100000 200000 Prefixes 400000 $00000 Figure 4; Variation of the Time Figure 5: Variation of the 10 With the number of prefixes with the number of prefixes 5.2 Fuzzy search Itis difficult to give an exact idea about the complexity of this operation using the experimental results. Indeed, (1) the complexity of the search varies with the number of the positions where the pattern has to be searched before reaching the solution; so, the searched pattem could be found immediately or checked in each possible position without finding any solution. (2) The complexity of the search depends on the number of prefixes that compose the searched pattern; the search of two patterns composed respectively by 30 and 500,000 prefixes have not the same complexity. Thus, we test this operation with patterns composed with various number of prefixes, and after searching in various positions noted P_i, i=1, 2 and 3. Figure 6 shows the general behavior of this operation with the size of the searched pattern and P_i . 20 g] eH os i] i= 8 100 o © 100000 200000 Prefixes 400000 $0000 Figure 6: Variation of the complexity of the search with the number of prefixes 6. Conclusions iin contribution of this paper is the proposal of a new Quadtree-based data structure allowing a fuzzy search of pattems in an image database. We have investigated different types of manipulations (insertion, searching and data organization) within this structure, and we have 32 shown that it is well adapted for content-oriented retrieval and fuzzy search. We have supposed that the inserted images are binary ones and that the processing is a sequential one. In a forthcoming work, we will investigate a parallel processing for the fuzzy search using this data structure, where the behavior of this problem will be analyzed. 7. References [Ang89] CH. ANa,H. SAMET, "Node Distribution in a PR Quadtree”, In Proceedings 1st International ‘Symposium on Large Spatial Databases, Santa Barbara, USA, July 1989. [Chang88] S.K. CHaNo, C.W. Yan, T. Apr, D. Dimmrnorr, “An Inelligent Image Database System", IEEE Transactions on Software Engineering, Vol.14, N°S, 1988. [Chang89] S.K, CHANG, E, Juncsrr, Y. Li, * The Design of Pictorial Database Based Upon the Theory of Symbolic Projections", In Proceedings 1st International Symposium on Large Spatial Databases, Santa Barbara, USA, July 1989. [Cheiney90) LP, Cuxney, B. KenvEnv Tinage Data Storage and Manipulations for Multimedia Database Systems, In Proceedings 4th Intemational Conference on Spatial Data Handling, Zurich, Swineland, July 1990. [Gargantini82] I. Garcantimi: "An Efficient Way to Represent Quadtrees", In Communications of the ‘cM, Vol .25, N° 12, 1982, [Meyer-Wegener89] K. MEYER, V.Y, LUM, C.T. Wu :"Image Management in Multimedia Database System’ In Proceedings rir 12 2.6 Working Conference on Visual Daabate Systems, Tokyo, Japan, ‘April 1989. [Kedem82] G, KepEm, "The Quadtree-CIF tree: a Data Structure for Hierarchical on-line algorithms", In Proceedings 19" Design Automation Conference, Las Vegas, USA, June 1983. [Morton66] G.M, Morton, " A Computer Oriented Geodetic Database and a New Technique in File Sequencing” IBM Lid,, Ottawa, Canada, 1966. {Orenstein86] J. ORENSTEIN : "Spatial Query Processing in Object-Oriented Database System", In Proceedings ACM SIGMOD'86 International Conference on Management of Data, Washington, USA, May 1986. [Samet84] H.SaMer, "The Quadsree and Related Hierarchical Data Structures", In ACM Computing Surveys , Vol.16, N°2, 1984. [Samet90a] H. Samer: "Applications of Spatial Data Structures", Addison-Wesley, 1990. {Samei90b] H. SAMET: "The Design and Analysis of Spatial Data Structures", Addison-Wesley, 1990. {Shaffer87] C.A. SHAFFER, H, SAMET, “Optimal Quadtree Construction Algorithms", In Computer Vision, Graphics and Image Processing, Vol 37,N°3, 1987. {Tamoura84) H.TaMoURA, N.YoxOYA: "Image Database System: A Survey", In Pattern Recognition, Vol.17, N°1, 1984. [Touir90a] A. Tout, B. KERHERVE: "Shape Translation in Images Encoded by Linear Quadsree", In Proceedings IFIP TC 5.10 Working Conference on Modeling in Computer Graphics, Tokyo, Japan April 1991. [Touiz90b] A. TouR: "Search Algorithms in Image Databases", Internal Report ENST/INF/BD/90_10 [Walsh88] T.R. WALSH: "Efficient Axis-Translation of Binary Digital Pictures by Blocks by Linear Quadiree Representation” Computer Vision, Graphics and Images Processing Vol .4i, N°3, 1988. [Woelk86] D. WoELK, W. Kim, W. LUTHER, "An Oriented Approach to Multimedia Databases",In Proceedings ACM SIGMOD'86 International Conference on Management of Data, Washington, USA, May 1986. Meta-Knowledge and Data Models ‘THE IMPORTANCE OF METAKNOWLEDGE FOR ENVIRONMENTAL INFORMATION SYSTEMS! F, J. Radermacher Forschungsinstitut fiir anwendungsorlentierte Wissensverarbeitung (FAW) Helmbholtzstr. 16, D-7900 Ulm Abstract ‘This paper deals with the importance of metalnowledge as a key topic for new and challenging applications of information systems. To make this point tangible, a number of interesting applications, particularly in the fleld of environmental information systems, are discussed. With regard to these applications, difficulties resulting from the distribution of information, from the implicit usage of knowledge by people involved, and difficulties concerning the transformation of data into information are addressed. The insights given here essentially build on a number of projects and workshops on the topic dealt with at the FAW in Ulm. 1, The problem framework ‘Typical modern information systems process an abundance of data available from many sources, but metaknowledge about that data is usually either not available at all or not available in an explicitly given form. Often, enormous amounts of low-level data must somehow be aggregated to obtain meaningful insights. This task generally entails the application of methods within a particular model framework as a means of creating new data from available data. New data may address new questions, the clarification of ambiguities, or even the treatment of inconsistencies between different sources of (low-level) data. Alternatively, information may be aggregated as a basis for meaningful answers to ‘This paper strongly builds on the articles (15, 28], given in the references. 36 overarching questions posed, for example, by high-level decision makers. Such questions occur frequently in connection with modern information systems |29]. They are addressed in a number of FAW projects, particularly topics in the area of environmental information systems [9, 11, 19, 28]. This is an economically and politically important field and relevant to both the public and government |2, 16, 19]. In the environmental area, the FAW projects ZEUS |13, 14, 18, 20] and WINHEDA [4, 28] deal with the integration of different data sources, using a number of particular models. Applications in ZEUS deal with water management and particularly with the identification of sites for ground-water monitoring stations [18]. The fleld of GIS [10] also has relevance for these projects and for the FAW Project RESEDA [32], which deals with remote sensing. Remote sensing will eventually Produce terabytes per day of valuable image data that cannot reasonably be processed with Present techniques [27]. But in the long run, remote sensing constitutes one of the few Tealistic hopes for organizing a regular monitoring of the state of the environment worldwide. This holds similarly for automization in chemical water analysis, ¢.g., pursued in the FAW Project WANDA [35, 37]. The environmental area is also a ficld where the availability of metaknowledge will be of crucial importance. Data from many sources and quite different modeling frameworks will have to be integrated, and any broad automation in this respect will work only if formalized metaknowledge is available for all data bases and model bases involved, This is also particularly true for any natural language access to data bases, as done in the FAW project NAUDA. In all of these areas, the FAW is active in pursuing ways to help bring about the formulation of organizational and standard frameworks that in the long run will make such metaknowledge available [11, 15]. 2. The crucial importance of environmental monitoring and encountered problems with heterogeneity Given the dangers to the state of the earth, due, above all, to overpopulation, increasing giobal consumption, and increasingly dangerous types of waste, one of the most urgent problems we must address is environmental monitoring on both a local and a global scale 116, 36). A prominent example of such an effort is the environmental information system (UIS) of the State of Baden-Warttemberg [2, 11, 21, 26] which addresses the (sem!-) automatic access to enormous bodies of information available on the status of the environment and tries to integrate this information with kmowledge concerning administrative processes and responsibilities in this domain. Similar advances have also ‘been made on a worldwide scale. For instance, the United Nations has initiated the UN 37 Environment Program, which includes the Earth-Watch Programs Global Environment Monitoring System (GEMS), the Decentralized Environmental Information System (INFOTERRA), and the International Register of Potentially Toxic Chemicals (IRPTC). These UN activities are complemented by similar German and other European programs. Of Particular importance to the topics addressed here is the recent HEM initiative (22), within GEMS, whose focus 1s the harmonization of environmental measurements. In fact, this initiative is the first attempt on an international scale to fully address the topic of metaknowledge management of dissimilar sources of information on the environment. The HEM initiative reflects the negative experiences over the last decade with uncoordinated, non-standardized approaches to data collection. Whenever a clear model framework was missing, graveyards of uncomparable data have resulted more often than new knowledge sources. 8. Integration of distributed heterogeneous data bases ‘The next fundamental step in the integration of data base systems as part of information systems technology is the integration of distributed heterogeneous data bases (which is Precisely the topic of the FAW project WINHEDA, referred to later), In fact, in a recent report to the National Science Foundation (NSF) [3] the US data base community stated that this task will be one of the greatest challenges in the data base field for the next decade. It is a field in which the US data base community is trying to maintain its present strong position in data base technology and where many activities are going on worldwide [1, 3, 5, 10, 31, 33, 34, 38], Important topics discussed are concerned with non-standard data base systems, autonomy, extendability, cooperation, and federation of data base systems (for a number of clarifying examples from applications, cf. {34)). The paramount importance of the integration of distributed heterogeneous data bases results directly from the nature of many applications. That this now comes so strongly into the view of research programs is due to the progress in data base technology and, even more, in communication facilities, Particularly computer networks. Major technical problems result from different data structures, conceptual schemata, query languages, and network protocols [34]. But, the hardest problems with integration are not technical in nature; rather, they are due to the semantics of concepts and attributes used. This aspect also includes the possibility of inconsistent or contradictary data in the different data sources |1, 3, 34]. Addressing semantic differences is difficult and quite different from more technical issues. 38 ‘The problem requires knowledge about the nature of the data stored in different places. Future automated solutions will require the representation and availability of particular Imowledge that then must be appropriately used. One of the fundamental questions that will arise in this framework is whether concepts and attributes can be translated into a standardized frame of reference of a modest, manageable size [15, 23, 28], or whether an almost complete computer understanding of language, coupled with an efficient representation of general and common-sense knowledge of the world will be required. The latter approach 1s followed in the CYC project [25] at Microelectronics and Computer Technology Corporation (MCC) in Austin: a bold undertaking to achieve the described goal via the integration of up to 100 million pleces of knowledge statements in a huge integrated Imowledge base and processing framework. 4, Dealing with metaknowledge When different information sources have to be integrated and accessed automatically, the availability and proper use of metaknowledge concerning the different sources is essential [8, 15, 24]. This aspect is particularly evident in attempts to integrate inconsistent or even contradictory data, as is often the case with the integration of heterogencous distributed data bases. In this context, metaknowledge concerns the precise definition and classification of the data involved in a kind of self-explanatory way, where again the necessity for a proper reference framework or a system such as the one being developed in the CYC project at MCC should be mentioned. Relevant aspects of the metaknowledge needed may be the quality of the data, its origin, the forecasting potential, the updating sequence, and soon. Note that such metalmowledge {s also essential in providing for any advanced natural language access to such systems, which is the topic of the FAW project NAUDA. This results from the need to answer questions more complicated than the information stored directly in the data base Itself [11, 12, 33, 39]. Examples of such questions might be: what kind of information is stored in the data base? What Is the quality of the data? What kinds of questions can, or cannot be answered with a certain precision based on particular data? Furthermore, metaknowledge will have to include information on physical access paths to information systems, and also on access rights to knowledge sources. Within the HEM meta data base design project [22], mentioned above, that has been initiated by the UN, aspects of metalmowledge will include: the name of the data base, 39 geographic scope, data content, keywords, date of inception, update frequency, measurement techniques, classification standards, accuracy, quality control, level of detail, geographic referencing, responsible organization, contact name, and conditions of access. 5, Data and information In many public discussions, it has become common to emphasize the difference between data and information [15]. We often speak of "graveyards of data" which we cannot handle, while on the other side we identify a huge lack of real information. It is hard to formalize this difference though. Generally, we mean with data quite simple and elementary aspects in some context. Usually, the nature of basic data is quite clear, and such data is available in great number. Typically, one might think of certain values for chemical substances or the number of citizens living in particular areas of a town. Usually, with regard to a fact, there exists a clear method for identifying whether such a basic value is correct or not, be it concentrations of chemical substances or persons living somewhere or not. This also means that we have the feeling that data 1s something solid, something clear, something not questionable, that can be obtained or that can be infered in a standardized way. Contrary to that, information seems to be something on a higher level, something that cannot be easily measured or obtained, rather something that has to be compiled in a difficult and not so obvious way from many items of data with respect to particular questions. For the most part, decision makers need information to make their decisions, and not plain data. For instance, they might seek information concerning the quality of the air in a particular area or information concerning the development of the social structure of some residental area over time. Notions like quality, social structure, development, and so on, present the hardest Problems. Usually, such relevant information is closely related to problems and tasks which must be addressed in a particular context. Certainly, it is often difficult to distinguish whether something is data or information. But if one looks into the historical context of gaining information from data in the area of environmental information systems, it has been the case that there were experts with a great and detailed insight into the nature of the particular data involved in that process, the way this data was generated, and the way this data was used for providing answers. Intuition and personal insight of those users of the data and users of the methods have guaranteed Teasonable results to a considerable extent. From that point of view, the strong involvement of experts from the enviromental administration in any formulation of answers for questions 40 from the political realm is well-justified, though this procedure usually requires a lot of time. Recently, with all the increasing technical options and the expanding requirements from society, new developments became unavoidable. 6, Greater distances between data and information These new applications seek to integrate ever-larger and more distributed sets of data to deal with always more ambitious applications in situations that are characterized by a high degree of expectation and personal involvement in society and politics. It is therefore necessary to involve the data in processes of aggregation which come from vastly different Places and islands of data. Unfortunately, given the distributed character, there is always less pre-knowledge available concerning the origin of the data, the implicitly used models, and the potential of available algorithms. Information is missing with people and in the data itself. This results in extreme limits with respect to a sound integration of the data. Actually, it 1s a disturbing observation for our society that often it is impossible to gain needed additional information concerning old data at all; the information providers are no longer there or cannot remember. Given this characteristic situation, the quest for meta- information 1s the need to store additional information with every data item, such as information concerning data sources, on how to find related data, how to connect it with other data, physical access methods, and what the access rights are. ‘The point of view taken here with respect to a solid use of data comes from philosophical epistemology [30] and sees the semantics of data coming from a modeling context that tells us how algorithms might process the data. Following such a recursive process, necessary information for applications should eventually be obtained from higher levels of the data hierarchy and be composed of sound applications of algorithms to recursively obtained data. ‘The essential aspect are therefore developments in the field of model management, which should lead to environments for the right kind of integration of models, algorithms, and data in the sense of a model-driven application of algorithms to data in order to generate and further process such new data. This is certainly not a topic to be dealt with in the data base context alone [34]. Other contributions will have to come, for instance, from model management approaches [6, 7, 16, 17, 30]. ro 7. Information for high-level decision makers in the environmental field In the field of environment, it seems particularly difficult to be prepared from the data acquisition point even for some of the most important topics that might come up In the future [2, 21, 26}. Also, the used algorithmic methods and the background models involved are particularly difficult. We know, for instance, that distribution models for ground water pollution that use simple linear interpolation methods usually will lead to strongly incorrect results. One also has to to take into account that there is high political involvement in this topic and that many kinds of responsibility are involved. Consequently, there is much controversy surrounding the interpretation of data. On the other hand, it is a fleld that many private citizens and politicians want to be engaged in and where they want to find a proper course of action. So we should offer to them, whenever possible, the ability to exploit huge data sets with the kind of methods capable of extracting answers to interesting questions; and the answers will often lead to follow-up questions. If we do not have this basis, we must take into account that a number of questions will never be asked, reducing the quality of the decision making (19, 29]. In this sense, this is an area where we have to be particularly concerned with handling these questions. This means we have to deal with distributed heterogeneous data sources, and also, that models, algorithms, and data have somehow to be integrated. It is therefore a particularly challenging fleld to study the questions of metaknowledge handling. Certainly, the work on the Baden-Warttemberg Environmental Information System is particularly innovative in this respect. The FAW is especially glad to be involved in such basic and challenging work, and to bring the topics presented here into this framework, In particular, the projects such as ZEUS, RESEDA, WANDA, WINHEDA, and NAUDA at the FAW provide many insights and practical achievements, and continually offer valuable feedback on what can, and cannot be accomplished in this area. ‘Summary (1) The real tough modeling and algorithmic tasks in environmental information processing cannot be automated at present. From that point of view, personal information systems, such as the issue-based information system approach (IBIS), are of importance [14}. (2) Chemical information can be represented but is very special in character (35). (8) Remote sensing constitutes a great hope for systematic monitoring but has particularly challenging requirements, The coupling with GIS is necessary in that respect. (4) The topic of distributed heterogeneous data bases has to be addressed with much more ae emphasis. In doing so, the metaknowledge aspect is of crucial importance, as it is in natural language access systems. A major insight says that if one does the right kind of design of a data base in the first place, this makes it much casier to deal with the heterogeneity and with the natural language access. (6) Object-oriented geoinformation systems coupled with tools from Al, statistics, and decision theory will result in considerable steps forward in the fleld of environmental information systems. Such progress will underline that by now mathematics and computer science have a great potential to offer when research for the environment has to be organized. Acknowledgments I would like to thank the many friends, colleagues, co-workers, and project partners who contributed to the ideas given in this text, by many discussions and through intensive Project work. References 1, Alonso, R.; Garcia-Molina, H.: Salem, K.: Concurrency Control and Recovery for Global Procedures in Federated Database Systems, IEEE Data Engincering 10, No. 3, 5-11, September 1987 2, Baumhauer, W.: Umweltpolitik in Baden-Warttemberg am Beispiel des Umweltinforma- tlonssystems, BDVI-Forum 3/1989 8. Brodie, M. et al.: Database systems: Achievements and Opportunities, Report of the NSF Invitational Workshop on Future Directions in DBMS Research, 1990 4. Endrikat, A. Michalski, R.: The WINHEDA Prototype: Knowledge-Based Access to Distributed Heterogeneous Knowledge Sources, FAW Technical Report, FAW-TR-91012, 1991 5. Garcla-Molina, H.; Wiederhold, G.; Lindsay, B.: Research Directions for Distributed Databases, in: ACM SIGMOD Record - Special Issue on Future Directions for Database Research, W. Kim (Ed.), Vol. 13, No. 4, 1990 6. Gaul, W.; Schader, M. (Eds.): Data, Expert Knowledge and Decisions, Springer Verlag, Berlin-Heldelberg-New York, 1988 7. Geoffrion, A.: Structured Modeling, UCLA Graduate School of Management, Los Angeles, 1988 8. Greenberg, B.V.: Developing An Expert System for Edit and Imputation, ECSC-EEC- EAEC, Briissel, 1989 9. Ginther, O: Data Management in Environmental Information Systems; Proceedings 5. Symposium fir den Umweltschutz; Informatik-Fachberichte, Springer-Verlag, Berlin- Heidelberg-New York, 1990 10. Ginther, O.; Buchmann, A: Future Trends in Spatial Databases, in: IKEE Data Engineering Bulletin - Special Issue on Directions for Future DBMS Research and Development, W. Kim (Ed..), Vol. 18, No. 4, 1990

You might also like