This action might not be possible to undo. Are you sure you want to continue?

Uploaded by AbhijeetMuttepwar

Spatial Data mining

**M.Tech Seminar Report Submitted by:
**

Subhasmita Mahalik 113050073 CSE,M.Tech1

Under the guidance of: Prof.N.L. Sarda

Department of Computer Science and Engineering Indian Institute of Technology, Bombay Mumbai

ACKNOWLEDGEMENT

I would like to express my sincere gratitude to my guide Prof. N. L. Sarda for his constant encouragement and guidance. He has been my primary source of motivation and advice during my entire study of seminar. I would like to thank my friends for their support , suggestions and feedback they have given me.

Subhasmita Mahalik 113050073 CSE, M.Tech IIT Bombay

I

1 Landslide-monitoring data mining 5. Cluster.3 Association Rules 3.3 Filter predicates 3.2 In visual data mining 6. Efficient polygon amalgamation method 4. Database primitives 2. Knowledge Discovery 3.4 Outlier Detection 4.1 Classification 3. Conclusion 1 4 4 5 6 7 7 9 9 10 12 13 16 16 17 19 19 19 19 22 II .Contents 1.1 Complete Spatial Randomness.2 Clustering Methods 3.1 Neighbourhood Relations 2.2 Occupancy-based method 5.1 Challenges 5.2 DECODE : a new method of discovering cluster 3. and Decluster 3.1 Adjacency-based method 4.2. Challenges and Applications 5.2 Neighbourhood graphs and their operation 2. Introduction 2.2.

and databases. spatial analysis.ABSTRACT Spatial data mining is the process of discovering interesting and previously unknown. and spatial autocorrelation. The spatial data mining techniques are often derived from spatial statistics. In this report some of the spatial data mining techniques have discussed along with some applications in real world. spatial relationships. Extracting interesting and useful patterns from spatial datasets is more difficult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types. machine learning. III . but potentially useful patterns from large spatial datasets. and are customized to analyze massive data sets. The requirements of mining spatial databases are different from those of mining classical relational databases.

select algorithm and storage the mined knowledge. the miner layer is mainly used to manage data.. including NASA. the National Imagery and Mapping Agency (NIMA). which mainly includes the spatial database (camalig) and other related data and knowledge bases. reorganize spatial database and obtain concise total characteristic etc. and climatology[2]. The spatial data mining can be used to understand spatial data. are designed to analyze large commercial databases. such as the Fig 1[1] show . General purpose data mining tools such as Clementine and Enterprise Miner.Efficient tools for extracting information from geo-spatial data are crucial to organizations which make decisions based on large spatial datasets.The customer interface layer is mainly used for input and output. multi-media data. genomic data. set up the spatial knowledge base. The system structure of the spatial data mining can be divided into three layer structures mostly. the data source layer.g. epidemiology. iii) observations that are not independent. public safety. astronomical data. discover the relation between pace and the non. Specific features of geographical data that preclude the use of general purpose data mining algorithms are: i) rich data types(e. Extracting interesting and useful patterns from spatial data sets is more difficult than extracting corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types. excel the query. and iv) spatial autocorrelation among the features. and web data. The complexity of spatial data and intrinsic spatial relationships limits the usefulness of conventional data mining techniques for extracting spatial patterns. spatial relationships. extended spatial objects) ii) implicit spatial relationships among the variables. and spatial autocorrelation. they have also been used in analyzing scientific and engineering data. Spatial data mining is the process of discovering interesting and previously unknown.space data. and the United States Department of Transportation (USDOT). transportation. 1 . These organizations are spread across many application domains including ecology and environmental management. the National Cancer Institute (NCI). but potentially useful patterns from spatial databases. Although these tools were primarily designed to identify customer-buying patterns in market basket data.Chapter 1 Introduction The explosive growth of spatial data and widespread use of spatial databases emphasize the need for the automated discovery of spatial knowledge. is original data of the spatial data mining. Earth science.

as well as shape.. is instance of. Non-spatial attributes are used to characterize non-spatial features of objects. The spatial attributes of a spatial object most often include information related to spatial locations.the materialization can result in loss of information. Relationships among non-spatial objects are explicit in data inputs. and polygons. such as overlap.The data inputs of spatial data mining are more complex than the inputs of classical data mining because they include extended objects such as points. One possible way to deal with implicit spatial relationships is to materialize the relationships into traditional data input columns and then apply classical data mining techniques. e. Another way to capture implicit spatial relationships is to develop models or techniques to incorporate spatial information into the spatial data mining process. longitude.. subclass of. population. However. such as name. They are the same as the attributes used in the data inputs of classical data mining.arithmetic relation. intersect. and behind. relationships among spatial objects are often implicit. ordering. In contrast. lines. e. latitude and elevation. Spatial attributes are used to define the spatial location and extent of spatial objects. and membership of. The data inputs of spatial data mining have two distinct types of attributes: non-spatial attribute and spatial attribute. Fig1:Thesystematic structure of spatial data mining.g. and unemployment rate for a city.g. One of the fundamental assumptions of statistical analysis is that the data samples are 2 .

which do not take spatial context into account. per-pixel-based classifiers. but in particular they arise due to the fact that the spatial resolution of imaging sensors are finer than the size of the object being observed. occupation and background tend to cluster together in the same neighborhoods. In fact. Euclidean distance. while the objects under study (e. spatial data tends to be highly self correlated.. Changes in natural resources. the IKONOS satellite from SpaceImaging). a fourneighborhood assumes that a pair of locations influence each other if they share an edge.. 3 . or the rolling of a die. wildlife. people with similar characteristics. Knowledge discovery techniques which ignore spatial autocorrelation typically perform poorly in the presence of spatial data. Fig 2:A Spatial Framework and Its Four-neighborhood Contiguity Matrix. However. Example definitions of neighborhood using adjacency include a four-neighborhood and an eight-neighborhood.g. often produce classified images with salt and pepper noise. As a result. A simple contiguity matrix may represent a neighborhood relationship defined using adjacency. The property of like things to cluster in space is so fundamental that geographers have elevated it to the status of the first law of geography: “Everything is related to everything else but nearby things are more related than distant things”. Often the spatial dependencies arise due to the inherent characteristics of the phenomena under study.independently generated: like successive tosses of coin.g. in the analysis of spatial data. this property is called spatial autocorrelation.. the assumption about the independence of samples is generally false. the Enhanced Thematic Mapper of the Landsat 7 satellite of NASA) to one meter (e.For example. The economies of a region tend to be similar. remote sensing satellites have resolutions ranging from 30 meters (e. and temperature vary gradually over space.In spatial statistics. Forest. The spatial relationship among locations in a spatial framework is often modeled via a contiguity matrix. etc. An eightneighborhood assumes that a pair of locations influence each other if they share either an edge or a vertex.g. Urban. Water) are often much larger than 30 meters. Given a gridded spatial framework. an area within statistics devoted to the analysis of spatial data. For example. These classifiers also suffer in terms of classification accuracy.

on the shores of Lake Erie in Ohio USA in order to predict the spatial distribution of a marsh breeding bird. The row normalized representation of this matrix is called a contiguity matrix.In figure2(a) above shows a gridded spatial framework with four locations. An Application Domain We begin by introducing an example to illustrate the different concepts related to location prediction in spatial data mining. namely the Spatial Autoregressive Model (SAR) and Markov Random Fields (MRF). cellular networking. as shown in Fig2(c).B. floods. droughts. and natural disasters such as fires. and D. A binary matrix representation of a four-neighborhood relationship is shown in Fig2(b). The data was collected from April to June in two successive years. and earthquakes. The prediction of events occurring at particular geographic locations is very important in several application domains. We are given data about two wetlands. the red-winged blackbird (Agelaius phoeniceus). vegetation diseases. Examples of problems which require location prediction include crime analysis. 1995 and 1996. 4 . named Darr and Stubble. Two spatial data mining techniques for predicting locations. C. A.

they are preserved if both objects are rotated. Definition 1: (topological relations) The topological relations between two objects A and B are derived from the nine intersections of the interiors. A contains B. polygons or polyhedrons. Therefore. we distinguish between the source object O1 and the destination object O2 of the direction relation R. we introduce a small set of database primitives for spatial data mining . pd) are elements of a d-dimensional Euclidean vector space called Points[1].1 Neighborhood Relations The mutual influence between two objects depends on factors such as the topology.e. Definition 2: (distance relations) Distance relations are those relations comparing the distance of two objects with a given constant using one of the arithmetic operators. Topological relations are those relations which are invariant under topological transformations. can then simply be defined by the minimum distance between their points. A covers B. There are several 5 . sets of points. .e. .. the distance or the direction between the objects. p2. the points p = (p1. relations between pairs of objects. A meets B. our database primitives are based on the concept of spatial neighborhood relations. i. distance and direction relations which are binary relations. . 2. translated or scaled simultaneously. The relations are: A disjoint B. A inside B.Chapter 2 Database primitives In this section. i. the boundaries and the complements of A and B with each other. In general. The major difference between mining in relational databases and mining in spatial databases is that attributes of the neighbors of some object of interest may have an influence on the object itself. A overlaps B. A equals B. Spatial objects may be either points or spatially extended objects such as lines. A covered-by B. The distance dist between two objects. Definition 3: (direction relations) To define direction relations O2 R O1. i. Three basic types of spatial relations: topological.e.

1 ≤ i < k . Then. . where neighbor(ni. Fig3:Illustrates some of the topological. b) A neighborhood path is a sequence of nodes [n1. 2. then r1 ∧ r2 and r1 ∨ r2 are also neighborhood relations . a) A neighborhood graph G DBneighbor = ( N.n2) holds.. E ) is a graph with the set of nodes N which we identify with the objects o ∈ DB and the set of edges E ⊆ N × N where two nodes n1 and n2 ∈ N are connected via some edge of E iff neighbor(n 1. be the center of the object. ni+1) holds for all n ∈ N. f is called the “fan out” of the graph. We define the direction relation of two spatially extended objects using one representative point rep(O1) of the source object O1 and all points of the destination object O2. Definition 4: (complex neighborhood relations) If r1 and r2 are neighborhood relations.called complex neighborhood relations. e. we introduce the concepts of neighborhood graphs and neighborhood paths and some basic operations for their manipulation. n2. nk]. Definition 5: (neighborhood graphs and paths) Let neighbor be a neighborhood relation and DB ⊆ 2Points be a database of objects. The number k of nodes is called the length of the neighborhood path. . This representative point is used as the origin of a virtual coordinate system and its quadrants define the directions.i. Let n denote the cardinality of N and let e denote the cardinality of E. distance and direction relations using 2D polygons.possibilities to define direction relations depending on the number of points they consider in the source and the destination object.. The representative point of a source object may. f:= e / n denotes the average number of edges of a node.e.g.2 Neighborhood Graphs and Their Operations Based on the neighborhood relations. . 6 .

.e. nk] is valid iff ∀ i ≤ k.3 Filter Predicates for Neighborhood Paths Neighborhood graphs will in general contain many paths which are irrelevant if not “mislead ing” for spatial data mining algorithms. we have to consider only certain classes of paths which are “leading away” from the starting object in some straightforward sense. if k > 1. this influence typically decreases or increases continuously with increasing or decreasing distance. i.nk] be a neighborhood path and let reli be the exact direction for ni and n i+1.n2. Such spatial patterns are most often the effect of some kind of influence of an object on other objects in its neighborhood. 2. For finding significant spatial patterns. Detecting such trends would be impossible if we do not restrict the pattern space in a way that paths changing direction in arbitrary ways or containing cycles are eliminated[1]. The predicates starlike and variable-starlike for paths p are defined as follows: starlike(p) :⇔ (∃ j < k: ∀ i > j: ni+1 reli ni ⇔ reli ⊆ relj). if k > 1.e. n2. i.c) A neighborhood path [n1.. can be considered as a typical example. . j < k: i ≠ j ⇔ n i ≠ n j . 7 . The task of spatial trend analysis. TRUE. if k=1 variable-starlike(p) :⇔ (∃ j < k: ∀ i > j: ni+1 reli ni ⇔ reli ⊆ rel1). Finding patterns of systematic change of some non-spatial attributes in the neighborhood of certain database objects... ni+1 reli ni holds. Furthermore. if k=1.. . Fig4:Filter Starlike and filter variable starlike Definition 6: (filter starlike and filter variable starlike) Let p = [n1.. TRUE.

• Data mining: the application of appropriate algorithms that. spatial data mining algorithms are very important . usefulness knowledge 3. In spatial classification the attribute values of neighboring objects are also considered. To find implicit regularities. • Evaluation: interpreting and evaluating the discovered patterns with respect to their in the given application. commercial database systems offer various index structures to support different types of database queries. e. have to perform many database queries)[4]. Similar to the relational standard language SQL. traffic control or environmental studies.g. and ultimately understand able patterns from data .Chapter 3 Knowledge Discovery Knowledge discovery in databases (KDD) has been defined as the non-trivial process of discovering valid.1 Spatial Classification The task of classification is to assign an object to a class from a given set of classes based on the attribute values of this object. for geomarketing. produce a particular enumeration of patterns over the data. novel. Furthermore. involving several steps such as the following ones: • Selection: selecting a subset of all attributes and a subset of all data from which the should be discovered. in general.Most existing data mining algorithms run on separate and specially prepared files. under acceptable computational efficiency limitations. • Data reduction: using dimensionality reduction or transformation techniques to reduce the effective number of attributes to be considered. 8 . Spatial Database Systems (SDBS) are database systems for the management of spatial data. This functionality can be used without extra implementation effort to speed-up the execution of data mining algorithms (which. but integrating them with a database management system (DBMS) has the following advantages. and potentially useful. the use of standard primitives will speed-up the development of new data mining algorithms and will also make them more portable.The process of KDD is interactive and iterative. Redundant storage and potential inconsistencies can be avoided. rules or patterns hidden in large spatial databases.

. Furthermore.. The extension to spatial attributes is to consider also the attribute of objects on a neighbourhood path starting from the current object. we define generalized attributes for a neighbourhood path p = [o1. The size of the buffer yielding the maximum information gain is chosen and this size is applied to compute the aggregates for all relevant attributes. Figure 5 depicts a sample decision tree and two rules derived from it. The determination of relevant attributes is based on the concepts of the nearest hit (the nearest neighbor belonging to the same class) and the nearest miss (the nearest neighbor belonging to a different class). The generalized attribute (economic-power. Instead. index) where index is a valid position in p representing the attribute with attribute name of object o index. Because it is reasonable to assume that the influence of neighbouring objects and their attributes decreases with increasing distance. For instance. The task of classification is to assign an object to a class from a given set of classes based on the attribute values of the object. in the case of shopping malls a buffer may represent the area where its customers live or work[4]. In the construction of the decision tree.The classification algorithm works as follows:The relevant attributes are extracted by comparing the attribute values of the target objects with the attribute values of their nearest neighbors. . the neighbors of target objects are not considered individually.. we can limit the length of the relevant neighbourhood paths by an input parameter max-length. Thus. the classification algorithm allows the input of a predicate to focus the search for classification rules on the objects of the database fulfilling this predicate.g. ok] as tuples (attribute-name. . Economic power has been chosen as the class attribute and the focus is on all objects of type city. represents the attribute economic-power of some (direct) neighbour of object o1. In spatial classification the attribute values of neighbouring objects may also be relevant for the membership of objects and therefore have to be considered as well.2). e. so-called buffers are created around the target objects and the nonspatial attribute values are aggregated over all objects contained in the buffer. 9 .

the implicit 10 . Spatial clustering can be applied to group similar spatial objects together.2 Spatial Clustering Spatial clustering is a process of grouping a set of spatial objects into clusters so that objects within a cluster have high similarity in comparison to one another. For example.Fig 5:Sample decision tree and rules discovered by the classification algorithm 3. clustering is used to determine the “hot spots” in crime analysis and disease tracking. Many criminal justice agencies are exploring the benefits provided by computer technologies to identify crime hot spots in order to take preventive strategies such as deploying saturation patrols in hot spot areas[6]. but are dissimilar to objects in other clusters. Hot spot analysis is the process of finding unusually dense event clusters across time and space.

3. and statistics derived from the counters are computed. patterns generated by a non-random process can be either cluster patterns(aggregated patterns) or decluster patterns(uniformly spaced patterns). One type of descriptive statistics is based on quadrats (i. 3. and departures indicate that the pattern is not distributed randomly in space.e. the standard against which spatial point patterns are often compared is a completely spatially random point process. well defined area. 11 . the statistical significance of spatial clusters should be measured by testing the assumption in the data. often rectangle in shape)[6]. After the verification of the statistical significance of the spatial clustering.2 DECODE (DiscovEring Clusters Of Different dEnsities) Discovering clusters in complex spatial data. and Decluster Patterns Several statistical methods can be applied to quantify deviations of patterns from a complete spatial randomness point pattern. and Decluster In spatial statistics. one such type is Ripley’s K-function .1 Complete Spatial Randomness.. in seismic research.assumption is that patterns in space tend to be grouped rather than randomly located. Fig 6:Illustration of CSR.e. Complete spatial randomness (CSR) [Cressie1993] is synonymous with a homogeneous Poisson process. Usually quadrats of random location and orientations in the quadrats are counted. severely challenges existing data mining methods. The patterns of the process are independently and uniformly distributed over space. For instance. Cluster. Cluster. the patterns are equally likely to occur anywhere and do not interact with each other. However. classical clustering algorithms can be used to discover interesting clusters. Another type of statistics is based on distances between patterns. i. The test is critical before proceeding with any serious clustering analyses. However.2.2.. in which clusters of different densities are superposed.

When clusters with different densities and noise coexist in a data set. high dimensionality and multiple densities. 12 . In this context. but different intensity. (2) Distance-based clustering method Often based on the mth nearest neighbor distance.foreshocks (which indicate forthcoming strong earthquakes) or aftershocks (which may help to elucidate the mechanism of major earthquakes) are often interfered by background earthquakes. Therefore. DECODE is a solution for it.It can automatically estimate the thresholds for separating point processes and clusters. The novelties of DECODE are 2-fold: (1) It can identify the number of point processes with little prior knowledge. few existing methods can determine the number of processes and precisely estimate the parameters. DECODE is a new density-based cluster method(DECODE) to discover clusters of different densities in spatial data[8]. data are presumed to consist of various spatial point processes in each of which points are distributed at a constant. Fig 7:Flowchart of the method for discovering clusters of different densities in spatial data Two strategies have been adopted for finding density homogeneous clusters in density-based methods: (1) Grid-based clustering method Map data into a mesh grid and identify dense regions according to the density in cells. The main advantage : detection capability for finding arbitrary shaped clusters and their high efficiency in dealing with complex data sets which are characterized by large amounts of data.

(c) fb = 50.000. (3) Clusters are formed by spatially connecting the points whose mth nearest distances fall into a particular bin defined by the thresholds. b. (h) fb = 250. Where λmax is the intensity corresponding to min(Xm) Fig 8: The cumulative occupancy fractions of j at different f b : (a) fb = 2. (i) updating β (with δ = 1. to update β constantly during the process.DECODE is based upon a reversible jump Markov Chain Monte Carlo( MCMC) strategy and divided into three steps: (1) Map each point in the data to its mth nearest distance. g = 0. To fix β.000. α = 1.000. (f) fb = 5.2) 13 . h = 10/(max(Xm ) − min(Xm )). (d) fb = 200. (e) fb = 1. (2)Classification thresholds are determined via a reversible jump MCMC strategy. Some aspects of the model 1) Analysis of sensitivity to prior specification Two strategies can be applied to the selection of β. (b) fb = 10. a. (g) fb = 20.000.

k is the number of point processes that the algorithm determines. A strong rule indicates that the patterns in the rule have relatively frequent occurrences in the database and strong implication relationships. The extraction and comprehension of the knowledge implied by the huge amount of spatial data.…. For example. we find that updating β produces the highest posterior probability.A spatial association rule is a rule in the form of P1 ∧ … ∧ Pm → Q1 ∧ … ∧Qn confidence of the rule. the description of the general weather patterns in a set of geographic regions is a spatial characteristic rule. a rule like most big cities in Canada are close to the Canada-U. raditional data organization and retrieval tools can only handle the storage and retrieval of explicitly stored data. the comparison of the weather patterns in two geographic regions is a spatial discriminant rule. Q1 . Qn is a spatial predicate. Pm . 14 (c%) . A spatial characteristic rule is a general description of a set of spatial-related data. pose great challenges to currently available spatial database technologies. A strong rule indicates that the patterns in the rule have relatively frequent occurrences in the database and strong implication relationships. M is the number of points of the data set. A spatial discriminant rule is the general description of the contrasting or discriminating features of a class of spatial-related data from other class(es).3 Association rules Spatial association rule is a rule indicating certain association relationship among a set of spatial and possibly some nonspatial predicates. border" is a spatial association rule. and c% is the .As shown above. For example. From the results. DECODE is a robust and automatic cluster method which needs less prior knowledge of the target data.….S. where at least one of the predicates P1 . though highly desirable. fb is the parameter which significantly influences the results when fixing β. 3. For example. A rule “P Q/S” is strong if predicate “P ∧ Q” is large in set S and the confidence of “P → Q/S” is high. Total complexity of the algorithm is O(T(Mk+(k-1)k!))where T is the sweep times. A spatial association rule is a rule which describes the implication of one or a set of features by another set of features in spatial databases. it is observed that: updating β tends to produce a model with more point processes.

Techniques of outlier detection:Graphical methods include Variogram clouds and Moran Scatterplots. a new house in an old neighborhood of a growing metropolitan area is a spatial outlier based on the non-spatial attribute house age. (Algorithm is executed at fine resolution level. For example. the objective is to design a computationally efficient algorithm to detect the S-outliers. For each pair of loactaions.the variance in the attribute differences will increase with increasing distance 15 . The algebraic method includes Scatterplot and Z(S(x)). A variogram cloud displays data points related by neighborhood relationships. So.Steps for extracting the association rules: STEP 1: Task_relevant_DB := extract task relevant objects(SDB .) STEP 5: Find_large_predicates_and_mine_rules(Fine_predicate_DB). the problem is given the components of the S-outlier definition.4 Outlier Detection Spatial outlier represent locations which are significantly different from their neighborhoods even though they may not be significantly different from the entire population. A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from those of other spatially referenced objects in its spatial neighborhood. (computes the support for each predicate in Coarse_predicate_DB.RDB). In data sets exhibiting strong spatial dependence. 3. (Relevant objects are collected into one database) STEP 2: Coarse_predicate_DB := coarse spatial computation(Task relevant DB).the square-root of the absolute difference between attribute values at the locations versus the Euclidean distance between the loacations are plotted.and filters out those entries whose support is below the minimum support threshold at the top level. Identification of spatial outliers can lead to the discovery of unexpected.even they may not be significantly different from the entire population. a spatial outlier is a local instability or a spatially referenced object whose non-spatial attributes are extreme relative to its neighbors. Informally. such as local instability[3]. (Spatial algorithm is executed at the coarse resolution level) STEP 3: Large_Coarse_predicate_DB := filtering_with_mininmum_support(Coarse_predicate_DB).) STEP 4: Fine_predicate_DB := refined_spatial_computation(Large_Coarse_predicate_DB). interesting and implicit knowledge.

the Y-axis is the attribute value for each data point.49 + 2 ∗ 1. The outlier detected using this approach is the data point G.between locations.71.but with large attribute differences. Fig 10:Variogram cloud 16 . the X-axis is the location of data points in one-dimensional space. Global outlier detection methods ignore the spatial location of each data point and fit the distribution model to the values of the non-spatial attribute. In Fig9(a). S is a spatial outlier whose observed value is significantly different than its neighbors P and Q[3]. which has an extremely high attribute value 7. even though the values at both locations may appear to be reasonable when examining the dataset non-spatially . as shown in Figure 9(b). This test assumes a normal distribution for attribute values.61 = 7.might indicate spatial outlier.9. exceeding the threshold of μ + 2σ = 4. Fig 9:A Dataset for Outlier Detection.Locations that are near to one another . On the other hand.

axis) between a point P with location(Xp.Yp) to the regression line Y = mX +b . A least square regression line is used to identify spatial outliers. The poin S may be identified as a spatial outlier since it occurs in both pairs ( Q.S ) in the left hand side lie above the main group of pairs. S). This technique requires non-trivial post-processing of highlighted pairs to separate spatial outliers from their neighbors.This plot shows that two pairs (P.S ) and (Q.that is.S) and ( P. and are possibly related to outliers. particualrly when multiple outliers are present or density varies greatly.The residual is defined as the vertical distance (Y. a scatter sloping upward to the left indicates a negative spatial auto -correlation . residual ∈ =Yp -(mXp +b). Fig 11:Scatter plot A scatterplot shows attribute values on the X-axis and the average of the attribute values in the neighborhood on the Y-axis. A scatterplot shows the attribute values plotted against the average of the attribute values in neighboring areas for the given dataset. 17 . A scatter sloping upward to the right indicates a positive spatial autocorrelation.

Two efficient methods for identifying internal polygons without retrieving them from databases[7]. Hence.Chapter 4 Efficient polygon amalgamation method The polygon amalgamation operation computes the boundary of the union of a set polygons. A tuple (p.p’) is in the table ADJACENCY if and only if p and p’ are adjacent. It’s a fundamental operation for emerging new applications such as spatial OLAP and spatial data mining.p’ are identifiers of polygons in P. most of the internal polygons will not be retrieved. 1) Adjacency-based method 2) Occupancy-based method 4. A polygon is on the boundary of S if it’s adjacent to some polygons which don’t belong to S[7]. p’) where p.The adjacency table of a set of polygons P is defined as a two column table ADJACENCY (p. Then remove identical line segments in boundary polygons to get t(S). The basic idea is to use adjacency table to identify boundary polygons. Fig 12:An example of shadow ring 18 .1 Adjacency-based approach Two polygons are adjacent if they have at least one pair of identical line segments.

2 Occupancy-based approach Z-values: a spatial data access method which establishes certain relationship between the data space and spatial objects or their approximation bounding rectangles. P1 will not be involved in the computation. p2 .D).p’) ∈ ADJACENCY. Z-values is commonly used spatial indexing mechanism[7]. Our occupancy-based approach is built on top of a simple extension of Z-values. D ) means p has at least one line segment adjacent to no P polygons (in this case we say p is adjacent to a dummy polygon also labeled as D ).p3). (p1. by definition we have δS = { p|p ∈S. 4.p2). p’ ∉ S } So. p4 are boundary polygons. We extract both δS and δS +.p4). p3 and p4 which were originally shared with p1 have lost their counterpart. Identify δS +: the set of sub-boundary polygons which are internal polygons adjacent to boundary polygons. (p2. We get the target polygon and possibly a shadow ring. These line segments couldn’t be removed and form a shadow ring.p4).D)} where (p. But this time we know the shadow ring is formed by line segments of sub-boundary polygons δS +. remove all identical line segments in them. (p3. (p2. The remaining line segments will form the boundary polygon.(p4. Fig 13: Z-order and object approximation using z-values 19 . (p2. ∃(p. p3 . p3. p4 }: {(p1.p1.D). It approximates a given object’s shape by recursively decomposing the embedding data space into smaller data space known as Peano cells.The adjacency table for the four polygons in Figure 12(a) where P = {p1 . For S ⊆ P . (p3. So we then remove the line segments belonging to δS +.p3). δS :p2. So those line segments of p2.p4).

thus it is not sufficient to determine if a Peano cell is completely occupied by a set of objects. stating that object p overlaps with Peano cell z.. Let C be the set of all Peano cells with which S polygons overlap with. D.zn . Therefore. p). The spatial indices using z-values associate objects with Peano cells.One way to assign Z-values. The accuracy of approximation can be improved by assigning multiple Z-values to a polygon. If c C is not completely occupied by S polygons. p. The z-values of the four quadrants of a Peano cell whose z-value is z = z1 . Any line segment inside a boundary cell either is part of the target polygon. are z1. 1 ≤ i ≤ n. 1 ≤ zi ≤ 4. There is no information about what percentage of the cell is occupied by p. is 1. 1. Because: Any line segment which doesn’t overlap with boundary cells is not part of the target polygon. then α=area(p ∩ z)/area(z) 20 . An S polygon overlapping with a boundary cell is likely to be a boundary polygon. z2. or its counterpart from another polygon must also be inside this boundary cell. we extend the index entry to the form of (z. we call c boundary cell. z3 and z4 respectively following the z-order. That is. Occupancy-based approach will not produce shadow rings. 2. α) where α is the occupancy ratio. Let p ∩ z be the polygon produced from clipping polygon p by the Peano cell z. and thus can be discarded..each index entry is of the form (z. The z-value of the initial space.

5. according to the knowledge mined in the spatial database. geographic knowledge discovery in geographic information science and geographic knowledge discovery in geographic research[2] . Making use of the spatial distribution rules. The challenges and impacts can be classified into three main areas. spatial characteristic rules. This situation creates new challenges in coping with scale.Provide decision support for the city planning Spatial data mining technique makes use of general geometric knowledge.The spatial data types/structures are complex . preventive pollution during the city planning for providing good data environment in city construction.Expensive spatial processing operations . prevent or control flood. spatial distribution rules. geologic circumstance. spatial characteristic rules.1 Challenges A noteworthy trend is the increasing size of data sets in common use. geographic information in knowledge discovery. These data sets often contain millions of records. namely. such as records of business transactions. spatial evolution rules to get many factors about terrain.The main knowledge types that can be discovered in the spatial database are: general geometric knowledge. b.2 Application in land use dynamic monitoring The mass data storaged in spatial database includes spatial topological. spatial clustering rules. So retrieval and storage is difficult. soil characters. environmental data and census demographics. spatial clustering rules. The huge volume of spatial data . nospatial properties and objects appearing variety on the time .. or even far more. there are following several applications[4]: a. prevent or control flood information. For land use dynamic monitoring. spatial discriminate rules. of the land. spatial association rules. Make prediction of land variety According to geographic location. spatial distribution rules. spatial evolution rules etc. transportation circumstance etc. spatial association rules. spatial discriminate rules to analyze can get the distributing and the future development of the land . 21 .Chapter 5 Challenges and Applications 5.

Presenting data in an interactive. and general knowledge with the enormous storage capacity and computational power of today’s computers. 5. encouraging the formation and validation of new hypotheses to the end of better problem-solving and gaining deeper domain knowledge. Visual data mining applies human visual perception to the exploration of large data sets. Some of they key advantages of visual data exploration over automatic data mining techniques alone are: • yields results more quickly. with a higher degree of user satisfaction and confidence in findings . creativity. • are especially useful when little is known about the data and exploration goals are vague. images. 22 .3 Application in Visual Data Mining For data mining of large data sets to be effective. zoom and filter. query result and analysis result for decision. because the analyst guides the search and can shift or adjust goals on the fly . • can provide a qualitative overview of the data.c. it can validly output various statistical charts. • can deal with highly non-homogeneous and noisy data . it is also important to include humans in the data exploration process and combine their flexibility.Valid management and analysis of remote sensing monitoring result According to algorithm of spatial data mining. allowing unexpected phenomena . based on knowledge discovery. • can be intuitive and require less understanding of complex mathematical or statistical algorithms or parameters . Visual data mining often follows a three step process: Overview first. graphical form often fosters new insights. and then details-ondemand [5].

Spatial data mining is being used in various fields like remote sensing sattelite. including the study of global climate change and genomics. 23 . Visyal data mining to mine data. The distinguishing characteristics of spatial data mining can be netaly summarized by the first law of geography:All things are related. Some algorithms require further expert knowledge that can not be mined from the data. direction). distance. It is based on techniques like generalization. like mutual influence of neighboring objects by certain factors (topology.Chapter 6 Conclusion Spatial Data Mining extends relational data mining with respect to special features of spatial data. clustering and mining association rules. Spatial data mining is a niche area within data mining for the rapid analysis of spatial data. like concept hierarchies. but nearby things are more related than distant things. Spatial data can potentially influence major scientific challenges.

Yan Huang . Pixel Based Visual Mining of Geo-Spatial Data . Ranga Raju Vatsavai. Jiawei Han . Published in:Journal IEEE Computer Graphics and Applications Volume 24 Issue 5. Zhang jixian a. 200 Union ST SE. Ng .Department of Computer Science and Engineering. Pusheng Zhang . Kluwer Academic Publishers. "Integration of Data Mining with Database Technology".Bibliography [1] Martin Ester. University of Minnesota 4-192. Alexander Frommelt. Research on spatial data mining technique applied in land use dynamic monitoring. Jörg Sander . Algorithms and Efficient DBMS Support . CA.Proceeding SSD '95 Proceedings of the 4th International Symposium on Advances in Spatial Databases Springer-Verlag London. Volume XXXVI-2/W25. Published in:Proceeding SSD '99 Proceedings of the 6th International Symposium on Advances in Spatial Databases Springer-Verlag London. Published in:Journal Geoinformatica Volume 7 Issue 2. Hans-Peter Kriegel. San Franscisco. Yan qin. and Jiawei Han3 . MN 55455 [3] Sashi Sekhar . et al.Published in:Proceedings VLDB '94 Proceedings of the 20 th international conference on very large databases. Hand Dj. Analysis. A Unified Approach To Detecting Spatial Outlier. UK ©1999 [8] Pei T. 1999. Spatial Reasoning. DECODE: A new method for discovering clusters of different densities in spatial data. Efficient and Effective Methods for Spatial Data Mining. Jasra A.Proceedings of International Symposium on Spatio-temporal Modeling. Efficient Polygon Amalgamation Methods for Spatial OLAP and Spatial Data Mining . David Truffet2 . Minneapolis. Trends in Spatial Data Mining. Data Mining and Knowledge Discovery. 2005 [5] Daniel A. [2] Shashi Shekhar .Chang-Tien Lu and Pusheng Zhang. LLC 2008 [9] Raymond T. September 2004 [6] Krzysztof Koperski and Jiawei Han . Discovery of Spatial Association Rules in Geographic Information database. UK ©1995 [7] Xiaofang Zhou1 . an International -Journal. Spatial Data Mining:Database Primitives.Springer Science+Business Media. June 2003 [4] Zhong yong a.USA ©1994 24 . Data Mining and Data Fusion. Keim Christian Panse Mike Sips .

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd