You are on page 1of 75

FAST NEAREST NEIGHBOR SEARCH WITH KEYWORDS

ABSTRACT:

Conventional spatial queries, such as range search and nearest neighbor retrieval, involve
only conditions on objects’ geometric properties. Today, many modern applications call for
novel forms of queries that aim to find objects satisfying both a spatial predicate, and a predicate
on their associated texts. For example, instead of considering all the restaurants, a nearest
neighbor query would instead ask for the restaurant that is the closest among those whose menus
contain “steak, spaghetti, brandy” all at the same time. Currently the best solution to such queries
is based on the IR2-tree, which, as shown in this paper, has a few deficiencies that seriously
impact its efficiency. Motivated by this, we develop a new access method called the spatial
inverted index that extends the conventional inverted index to cope with multidimensional data,
and comes with algorithms that can answer nearest neighbor queries with keywords in real time.
As verified by experiments, the proposed techniques outperform the IR2-tree in query response
time significantly, often by a factor of orders of magnitude.

1
INTRODUCTION

A spatial database manages multidimensional objects (such as points, rectangles, etc.),


and provides fast access to those objects based on different selection criteria. The importance of
spatial databases is reflected by the convenience of modeling entities of reality in a geometric
manner. For example, locations of restaurants, hotels, hospitals and so on are often represented
as points in a map, while larger extents such as parks, lakes, and landscapes often as a
combination of rectangles. Many functionalities of a spatial database are useful in various ways
in specific contexts. For instance, in a geography information system, range search can be
deployed to find all restaurants in a certain area, while nearest neighbor retrieval can discover the
restaurant closest to a given address.

Today, the widespread use of search engines has made it realistic to write spatial queries
in a brandnew way. Conventionally, queries focus on objects’ geometric properties only, such as
whether a point is in a rectangle, or how close two points are from each other. We have seen
some modern applications that call for the ability to select objects based on both of their
geometric coordinates and their associated texts. For example, it would be fairly useful if a
search engine can be used to find the nearest restaurant that offers “steak, spaghetti, and brandy”
all at the same time. Note that this is not the “globally” nearest restaurant (which would have
been returned by a traditional nearest neighbor query), but the nearest restaurant among only
those providing all the demanded foods and drinks.

There are easy ways to support queries that combine spatial and text features. For
example, for the above query, we could first fetch all the restaurants whose menus contain the set
of keywords {steak, spaghetti, brandy}, and then from the retrieved restaurants, find the nearest
one. Similarly, one could also do it reversely by targeting first the spatial conditions – browse all
the restaurants in ascending order of their distances to the query point until encountering one
whose menu has all the keywords. The major drawback of these straightforward approaches is
that they will fail to provide real time answers on difficult inputs.

2
A typical example is that the real nearest neighbor lies quite faraway from the query
point, while all the closer neighbors are missing at least one of the query keywords. Spatial
queries with keywords have not been extensively explored. In the past years, the community has
sparked enthusiasm in studying keyword search in relational databases. It is until recently that
attention was diverted to multidimensional data. The best method to date for nearest neighbor
search with keywords is due to Felipe et al.

They nicely integrate two well-known concepts: R-tree, a popular spatial index, and
signature file, an effective method for keyword-based document retrieval. By doing so they
develop a structure called the IR2-tree, which has the strengths of both R-trees and signature
files. Like R-trees, the IR2-tree preserves objects’ spatial proximity, which is the key to solving
spatial queries efficiently. On the other hand, like signature files, the IR2-tree is able to filter a
considerable portion of the objects that do not contain all the query keywords, thus significantly
reducing the number of objects to be examined. The IR2-tree, however, also inherits a drawback
of signature files: false hits. That is, a signature file, due to its conservative nature, may still
direct the search to some objects, even though they do not have all the keywords.

The penalty thus caused is the need to verify an object whose satisfying a query or not
cannot be resolved using only its signature, but requires loading its full text description, which is
expensive due to the resulting random accesses. It is noteworthy that the false hit problem is not
specific only to signature files, but also exists in other methods for approximate set membership
tests with compact storage . Therefore, the problem cannot be remedied by simply replacing
signature file with any of those methods.

In this paper, we design a variant of inverted index that is optimized for


multidimensional points, and is thus named the spatial inverted index (SI-index). This access
method successfully incorporates point coordinates into a conventional inverted index with small
extra space, owing to a delicate compact storage scheme.

3
Fig. 1. (a) shows the locations of points and (b) gives their associated texts.

Meanwhile, an SI-index preserves the spatial locality of data points, and comes with an
R-tree built on every inverted list at little space overhead. As a result, it offers two competing
ways for query processing orders of magnitude. We can (sequentially) merge multiple lists very
much like merging traditional inverted lists by ids. Alternatively, we can also leverage the R-
trees to browse the points of all relevant lists in ascending order of their distances to the query
point. As demonstrated by experiments, the SI-index significantly outperforms the IR2-tree in
query efficiency, often by a factor of

PROBLEM DEFINITIONS

Let P be a set of multidimensional points. As our goal is to combine keyword search with
the existing location-finding services on facilities such as hospitals, restaurants, hotels, etc., we
will focus on dimensionality 2, but our technique can be extended to arbitrary dimensionalites
with no technical obstacle.We will assume that the points in P have integer coordinates, such that
each coordinate ranges in [0, t], where t is a large integer. This is not as restrictive as it may
seem, because even if one would like to insist on real-valued coordinates, the set of different

4
coordinates representable under a space limit is still finite and enumerable; therefore, we could
as well convert everything to integers with proper scaling. As with, each point p 2 P is associated
with a set of words, which is denoted as Wp and termed the document of p. For example, if p
stands for a restaurant, Wp can be its menu, or if p is a hotel, Wp can be the description of its
services and facilities, or if p is a hospital, Wp can be the list of its out-patient specialities.

It is clear that Wp may potentially contain numerous words. Traditional nearest neighbor
search returns the data point closest to a query point. Following, we extend the problem to
include predicates on objects’ texts. Formally, in our context, a nearest neighbor (NN) query
specifies a point q and a set Wq of keywords (we refer to Wq as the document of the query). It
returns the point in Pq that is the nearest to q, where Pq is defined as Pq = {p 2 P | Wq _ Wp} (1)
In other words, Pq is the set of objects in P whose documents contain all the keywords in Wq. In
the case where Pq is empty, the query returns nothing.

The problem definition can be generalized to k nearest neighbor (kNN) search, which
finds the k points in Pq closest to q; if Pq has less than k points, the entire Pq should be returned.
For example, assume that P consists of 8 points whose locations are as shown in Figure 1a (the
black dots), and their documents are given in Figure 1b. Consider a query point q at the white dot
of Figure 1a with the set of keywords Wq = {c, d}. Nearest neighbor search finds p6, noticing
that all points closer to q than p6 are missing either the query keyword c or d. If k = 2 nearest
neighbors are wanted, p8 is also returned in addition.

The result is still {p6, p8} even if k increases to 3 or higher, because only 2 objects have
the keywords c and d at the same time. We consider that the dataset does not fit in memory,
and needs to be indexed by efficient access methods in order to minimize the number of I/Os in
answering a query.

The IR2-tree

As mentioned before, the IR2-tree combines the Rtree with signature files. Next, we will
review what is a signature file before explaining the details of IR2-trees. Our discussion assumes

5
the knowledge of R-trees and the best-first algorithm for NN search, both of which are well-
known techniques in spatial databases. Signature file in general refers to a hashing-based
framework, whose instantiation is known as superimposed coding (SC), which is shown to be
more effective than other instantiations. It is designed to perform membership tests: determine
whether a query word w exists in a set W of words. SC is conservative, in the sense that if it says
“no”, then w is definitely not in W. If, on the other hand, SC returns “yes”, the true answer can
be either way, in which case the whole W must be scanned to avoid a false hit.

In the context of, SC works in the same way as the classic technique of bloom filter. In
preprocessing, it builds a bit signature of length l from W by hashing each word in W to a string
of l bits, and then taking the disjunction of all bit strings. To illustrate, denote by h(w) the bit
string of a word w. First, all the l bits of h(w) are initialized to 0. Then, SC repeats the following
m times: randomly choose a bit and set it to 1. Very importantly, randomization must use w as its
seed to ensure that the same w always ends up with an identical h(w). Furthermore, the m
choices are mutually independent, and may even happen to be the same bit.

The concrete values of l and m affect the space cost and false hit probability, as will be
discussed later. Figure 2 gives an example to illustrate the above process, assuming l = 5 and m =
2. For example, in the bit string h(a) of a, the 3rd and 5th (counting from left) bits are set to 1. As
mentioned earlier, the bit signature of a set W of words simply ORs the bit strings of all the
members of W. For instance, the signature of a set {a, b} equals 01101, while that of {b, d}
equals 01111.

Fig. 2. Example of bit string computation with l = 5 and m = 2

6
Given a query keyword w, SC performs the membership test in W by checking whether
all the 1’s of h(w) appear at the same positions in the signature of W. If not, it is guaranteed that
w cannot belong to W. Otherwise, the test cannot be resolved using only the signature, and a
scan of W follows. A false hit occurs if the scan reveals that W actually does not contain w. For
example, assume that we want to test whether word c is a member of set {a, b} using only the
set’s signature 01101. Since the 4th bit of h(c) = 00011 is 1 but that of 01101 is 0, SC
immediately reports “no”. As another example, consider the membership test of c in {b, d}
whose signature is 01111. This time, SC returns “yes” because 01111 has 1’s at all the bits where
h(c) is set to 1; as a result, a full scan of the set is required to verify that this is a false hit.

The IR2-tree is an R-tree where each (leaf or nonleaf) entry E is augmented with a
signature that summarizes the union of the texts of the objects in the subtree of E. Figure 3
demonstrates an example based on the dataset of Figure 1 and the hash values in Figure 2. The
string 01111 in the leaf entry p2, for example, is the signature of Wp2 = {b, d} (which is the
document of p2; see Figure 1b). The string 11111 in the nonleaf entry E3 is the signature of Wp2
[ Wp6 , namely, the set of all words describing p2 and p6. Notice that, in general, the signature
of a nonleaf entry E can be conveniently obtained simply as the disjunction of all the signatures
in the child node of E.

A nonleaf signature may allow a query algorithm to realize that a certain word cannot
exist in the subtree. For example, as the 2nd bit of h(b) is 1, we know that no object in the
subtrees of E4 and E6 can have word b in its texts – notice that the signatures of E4 and E6 have
0 as their 2nd bits. In general, the signatures in an IR2-tree may have different lengths at various
levels. On conventional R-trees, the best-first algorithm is a well-known solution to NN search.
It is straightforward to adapt it to IR2-trees. Specifically, given a query point q and a keyword set
Wq, the adapted algorithm accesses the entries of an IR2-tree in ascending order of the distances
of their MBRs to q (the MBR of a leaf entry is just the point itself), pruning those entries whose

7
Fig. 3. Example of an IR2-tree. (a) shows the MBRs of the underlying R-tree and (b) gives the
signatures of the entries.

Fig. 4. Example of an inverted index

DRAWBACKS OF THE IR2-TREE

The IR2-tree is the first access method for answering NN queries with keywords. As with
many pioneering solutions, the IR2-tree also has a few drawbacks that affect its efficiency. The
most serious one of all is that the number of false hits can be really large when the object of the
final result is faraway from the query point, or the result is simply empty. In these cases, the
query algorithm would need to load the documents of many objects, incurring expensive
overhead as each loading necessitates a random access.

To explain the details, we need to first discuss some properties of SC (the variant of
signature file used in the IR2-tree). Recall that, at first glance, SC has two parameters: the length

8
l of a signature, and the number m of bits chosen to set to 1 in hashing a word. There is, in fact,
really just a single parameter l, because the optimal m (which minimizes the probability of a
false hit) has been solved by Stiassny mopt = l · ln(2)/g (2) where g is the number of distinct
words in the set W on which the signature is being created. Even with such an optimal choice of
m, Faloutsos and Christodoulakis show that the false hit probability equals Pfalse = (1/2)mopt .
3) Put in a different way, given any word w that does not belong to W, SC will still report “yes”
with probability Pfalse, and demand a full scan of W.

It is easy to see that Pfalse can be made smaller by adopting a larger l (note that g is fixed
as it is decided by W). In particular, asymptotically speaking, to make sure Pfalse is at least a
constant, l must be (g), i.e., the signature should have (1) bit for every distinct word of W.
Indeed, for the IR2-tree, Felipe et al. adopt a value of l that is approximately equivalent to 4g in
their experiments (g here is the average number of distinct words a data point has in its text
description). It thus follows that Pfalse = (1/2)4 ln(2) = 0.15. (4) The above result takes a heavy
toll on the efficiency of the IR2-tree.

For simplicity, let us first assume that the query keyword set Wq has only a single
keyword w (i.e., |Wq | = 1). Without loss of generality, let p be the object of the query result, and
S be the set of data points that are closer to the query point q than p. In other words, none of the
points in S has w in their text documents (otherwise, p cannot have been the final result). By
Equation 4, roughly 15% of the points in S cannot be pruned using their signatures, and thus, will
become false hits. This also means that the NN algorithm is expected to perform at least 0.15|S|
random I/Os. So far we have considered |Wq| = 1, but the discussion extends to arbitrary |Wq | in
a straightforward manner. It is easy to observe (based on Equation 4) that, in general, the false hit
probability satisfies Pfalse _ 0.15|Wq|. (5) When |Wq| > 1, there is another negative fact that adds
to the deficiency of the IR2-tree: for a greater |Wq|, the expected size of S increases dramatically,
because fewer and fewer objects will contain all the query keywords.

The effect is so severe that the number of random accesses, given by Pfalse|S|, may
escalate as |Wq| grows (even with the decrease of Pfalse). In fact, as long as |Wq| > 1, S can
easily be the entire dataset when the user tries out an uncommon combination of keywords that

9
does not exist in any object. In this case, the number of random I/Os would be so prohibitive that
the IR2-tree would not be able to give real time responses.

MERGING AND DISTANCE BROWSING

Since verification is the performance bottleneck, we should try to avoid it. There is a
simple way to do so in an I-index: one only needs to store the coordinates of each point together
with each of its appearances in the inverted lists. The presence of coordinates in the inverted lists
naturally motivates the creation of an Rtree on each list indexing the points therein (a structure
reminiscent of the one). Next, we discuss how to perform keyword-based nearest neighbor
search with such a combined structure.
The R-trees allow us to remedy an awkwardness in the way NN queries are processed
with an I-index. Recall that, to answer a query, currently we have to first get all the points
carrying all the query words in Wq by merging several lists (one for each word in Wq). This
appears to be unreasonable if the point, say p, of the final result lies fairly close to the query
point q. It would be great if we could discover p very soon in all the relevant lists so that the
algorithm can terminate right away. This would become a reality if we could browse the lists
synchronously by distances as opposed to by ids.

In particular, as long as we could access the points of all lists in ascending order of their
distances to q (breaking ties by ids), such a p would be easily discovered as its copies in all the
lists would definitely emerge consecutively in our access order. So all we have to do is to keep
counting how many copies of the same point have popped up continuously, and terminate by
reporting the point once the count reaches |Wq|. At a ny moment, it is enough to remember only
one count, because whenever a new point emerges, it is safe to forget about the previous one.
As an example, assume that we want to perform NN search whose query point q is as shown in
Figure 1, and whose Wq equals {c, d}. Hence, we will be using the lists of words c and d in
Figure 4. Instead of expanding these lists by ids, the new access order is by distance to q,
namely, p2, p3, p6, p6, p5, p8, p8.

10
The processing finishes as soon as the second p6 comes out, without reading the
remaining points. Apparently, if k nearest neighbors are wanted, termination happens after
having reported k points in the same fashion. Distance browsing is easy with R-trees. In fact, the
best-first algorithm is exactly designed to output data points in ascending order of their distances
to q. However, we must coordinate the execution of best-first on |Wq| R-trees to obtain a global
access order. This can be easily achieved by, for example, at each step taking a “peek” at the
next point to be returned from each tree, and output the one that should come next globally. This
algorithm is expected to work well if the query keyword set Wq is small.

For sizable Wq, the large number of random accesses it performs may overwhelm all the
gains over the sequential algorithm with merging. A serious drawback of the R-tree approach is
its space cost. Notice that a point needs to be duplicated once for every word in its text
description, resulting in very expensive space consumption. In the next section, we will
overcome the problem by designing a variant of the inverted index that supports compressed
coordinate embedding.

SOLUTIONS BASED ON INVERTED INDEXES

Inverted indexes (I-index) have proved to be an effective access method for keyword-
based document retrieval. In the spatial context, nothing prevents us from treating the text
description Wp of a point p as a document, and then, building an I-index. Figure 4 illustrates the
index for the dataset of Figure 1. Each word in the vocabulary has an inverted list, enumerating
the ids of the points that have the word in their documents. Note that the list of each word
maintains a sorted order of point ids, which provides considerable convenience in query
processing by allowing an efficient merge step. For example, assume that we want to find the
points that have words c and d. This is essentially to compute the intersection of the two words’
inverted lists.

As both lists are sorted in the same order, we can do so by merging them, whose I/O and
CPU times are both linear to the total length of the lists. Recall that, in NN processing with IR2-
tree, a point retrieved from the index must be verified (i.e., having its text description loaded and

11
checked). Verification is also necessary with I-index, but for exactly the opposite reason. For
IR2-tree, verification is because we do not have the detailed texts of a point, while for I-index, it
is because we do not have the coordinates. Specifically, given an NN query q with keyword set
Wq, the query algorithm of I-index first retrieves (by merging) the set Pq of all points that have
all the keywords of Wq, and then, performs |Pq| random I/Os to get the coordinates of each point
in Pq in order to evaluate its distance to q.

According to the experiments, when Wq has only a single word, the performance of I-
index is very bad, which is expected because everything in the inverted list of that word must be
verified. Interestingly, as the size of Wq increases, the performance gap between Iindex and IR2-
tree keeps narrowing such that I-index even starts to outperform IR2-tree at |Wq| = 4. This is not
as surprising as it may seem. As |Wq| grows large, not many objects need to be verified because
the number of objects carrying all the query keywords drops rapidly. On the other hand, at this
point an advantage of Iindex starts to pay off. That is, scanning an inverted list is relatively cheap
because it involves only sequential I/Os1, as opposed to the random nature of accessing the
nodes of an IR2-tree.

SPATIAL INVERTED LIST

The spatial inverted list (SI-index) is essentially a compressed version of an I-index with
embedded coordinates as described in Section 5. Query processing with an SI-index can be done
either by merging, or together with R-trees in a distance browsing manner. Furthermore, the
compression eliminates the defect of a conventional Iindex such that an SI-index consumes much
less space.

THE COMPRESSION SCHEME

Compression is already widely used to reduce the size of an inverted index in the
conventional context where each inverted list contains only ids. In that case, an effective
approach is to record the gaps between consecutive ids, as opposed to the precise ids. For
example, given a set S of integers {2, 3, 6, 8}, the gap-keeping approach will store {2, 1, 3, 2}

12
instead, where the i-th value (i _ 2) is the difference between the i-th and (i − 1)-th values in the
original S. As the original S can be precisely reconstructed, no information is lost. The only
overhead is that decompression incurs extra computation cost, but such cost is negligible
compared to the overhead of I/Os.

Note that gap-keeping will be much less beneficial if the ids also provide the base for
merging. Therefore, nothing prevents us from using a pseudo-id internally. Specifically, let us
forget about the “real” ids, and instead, assign to each point a pseudo-id that equals its sequence
number in the ordering of Z-values. For example, p6 gets a pseudo-id 0, p2 gets a 1, and so on.
Obviously, these pseudo-ids can co-exist with the “real” ids, which can still be kept along with
objects’ details. The benefit we get from pseudo-ids is that sorting them gives the same ordering
as sorting the Z-values of the points. This means that gap-keeping will work at the same time on
both the pseudo-ids and Z-values. As an example that gives the full picture, consider the inverted
list of word d in Figure 4 that contains p2, p3, p6, p8, whose Z-values are 15, 52, 12, 23
respectively, with pseudo-ids being 1, 6, 0, 2 respectively.

Sorting the Z-values automatically also puts the pseudo-ids in ascending order. With gap-
keeping, the Z-values are recorded as 12, 3, 8, 29 and the pseudo-ids as 0, 1, 1, 4. So we can
precisely capture the 4 points with 4 pairs: {(0, 12), (1, 3), (1, 8), (4, 29)}. Since SFC applies to
any dimensionality, it is straightforward to extend our compression scheme to any dimensional
space. As a remark, we are aware that the ideas of space filling curves and internal ids have also
been mentioned (but not for the purpose of compression). In what follows, we will analyze the
space of the SI-index and discuss how to build a good R-tree on each inverted list. None of these
issues is addressed.

THEORETICAL ANALYSIS

Next we will argue from a theoretical perspective that our compression scheme has a low
space complexity. As the handling of each inverted list is the same, it suffices to focus on only
one of them, denoted as L. Let us assume that the whole dataset has n _ 1 points, and r of them
appear in L. To make our analysis general, we also take the dimensionality d into account. Also,

13
recall that each coordinate ranges from 0 to t, where t is a large integer. Naively, each pseudo-id
can be represented with O(log n) bits, and each coordinate with O(log t) bits. Therefore, without
any compression, we can represent the whole L in O(r(log n + d log t)) bits.

Now we start to discuss the space needed to compress L with our solution. First of all, we
give a useful fact on gap-keeping in general: LEMMA 1: Let v1, v2, ..., vr be r non-descending
integers in the range from 1 to _ _ 1. Gap-keeping requires at most O(r log(_/r)) bits to encode all
of them.

BLOCKING

The SI-index described up to now applies gapkeeping to capture all points continuously
in a row. In decompressing, we must scan an inverted list from its beginning even though the
point of our interest lies deep down the list (remember that a point cannot be restored without all
the gaps preceding it being accumulated). This is not a problem for a query algorithm that
performs sequential scan on the list. But in some scenarios (e.g., when we would like to build an
R-tree on the list, as in the next subsection), it is very helpful to restore a point anywhere in the
list much faster than reading from the beginning every time.

The above concern motivates the design of the blocked SI-index, which differs only in
that each list is cut into blocks each of which holds _(B) points where B is a parameter to be
specified later. For example, given a list of {p1, p2, p3, p4, p5, p6}, we would store it in 2 blocks
{p1, p2, p3} and {p4, p5, p6} if the block size is 3. Gapkeeping is now enforced within each
block separately. For example, in block {p1, p2, p3}, we will store the exact pseudo-id and Z-
value of p1, the gaps of p2 (from p1) in its pseudo-id and Z-value respectively, and similarly, the
gaps of p3 from p2. Apparently, blocking allows to restore all the points in a block locally, as
long as the starting address of the block is available.

It is no longer necessary to always scan from the beginning. Since we need to keep the
exact values of _(r/B) points, the space cost increases by an addictive factor of _( r B (log n + d
log t)). This, however, does not affect the overall space complexity in Lemma 2 if we choose B

14
as a polynomial function of r, i.e., B = rc for any positive c < 1. In our experiments, the size of B
is roughly pr, namely, the value of B can even vary for different inverted lists (i.e., a block may
occupy a different number of disk pages). Finally, in a blocked SIindex, each inverted list can
also be sequentially accessed from the beginning, as long as we put all its blocks at consecutive
pages.

BUILDING R-TREES

Remember that an SI-index is no more than a compressed version of an ordinary inverted


index with coordinates embedded, and hence, can be queried in the same way as described, i.e.,
by merging several inverted lists. In the sequel, we will explore the option of indexing each
inverted list with an R-tree. As explained, these trees allow us to process a query by distance
browsing, which is efficient when the query keyword set Wq is small. Our goal is to let each
block of an inverted list be directly a leaf node in the R-tree. This is in contrast to the alternative
approach of building an R-tree that shares nothing with the inverted list, which wastes space by
duplicating each point in the inverted list. Furthermore, our goal is to offer two search strategies
simultaneously: merging and distance browsing.

As before, merging demands that points of all lists should be ordered following the same
principle. This is not a problem because our design in the previous subsection has laid down such
a principle: ascending order of Z-values. Moreover, this ordering has a crucial property that
conventional id-based ordering lacks: preservation of spatial proximity. The property makes
it possible to build good R-trees without destroying the Z-value ordering of any list. Specifically,
we can (carefully) group consecutive points of a list into MBRs, and incorporate all MBRs into
an R-tree. The proximitypreserving nature of the Z-curve will ensure that the MBRs are
reasonably small when the dimensionality is low. For example, assume that an inverted list
includes all the points, sorted in the order shown.

To build an R-tree, we may cut the list into 4 blocks {p6, p2}, {p8, p4}, {p7, p1}, and
{p3, p5}. Treating each block as a leaf node results in an R-tree identical to the one. Linking all
blocks from left to right preserves the ascending order of the points’ Z-values. Creating an R-tree

15
from a space filling curve has been considered by Kamel and Faloutsos. Different from their
work, we will look at the problem in a more rigorous manner, and attempt to obtain the optimal
solution. Formally, the underlying problem is as follows. There is an inverted list L with, say, r
points p1, p2, ..., pr, sorted in ascending order of Z-values. We want to divide L into a number of
disjoint blocks such that (i) the number of points in each block is between B and 2B − 1, where B
is the block size, and (ii) the points of a block must be consecutive in the original ordering of
L.

The goal is to make the resulting MBRs of the blocks as small as possible. How “small”
an MBR is can be quantified in a number of ways. For example, we can take its area, perimeter,
or a function of both. Our solution, presented below, can be applied to any quantifying metric,
but our discussion will use area as an example. The cost of a dividing scheme of L is thus
defined as the sum of the areas of all MBRs. For notational convenience, given any 1 _ i _ j _ r,
we will use C[i, j] to denote the cost of the optimal division of the subsequence pi, pi+1, ..., pj .
The aim of the above problem is thus to find C[1, r]. We also denote by A[i, j] the area of the
MBR enclosing pi, pi+1, ..., pj.

Now we will discuss the properties of C[i, j]. There are j − i + 1 points from pi to pj . So
C[i, j] is undefined if j − i + 1 < B, because we will never create a block with less than B points.
Furthermore, in the case where j −i+ 1 2 [B, 2B −1], C[i, j] can be immediately solved as the area
of the MBR enclosing all the j −i+1 points. Hence, next we will focus on the case j − i + 1 _ 2B.
Notice that when we try to divide the set of points {pi, pi+1, ..., pj}, there are at most B − 1 ways
to decide which points should be in the same block together with the first point pi. Specifically, a
block of size B must include, besides pi, also pi+1, pi+2, all the way to pi+B−1. If the block size
goes to B +1, then the additional point will have to be pi+B; similarly, to get a block size of B +
2, we must also put in pi+B+1 and so on, until the block size reaches the maximum 2B − 1.
Regardless of the block size, the remaining points (that are not together with p1) constitute a
smaller set on which the division problem needs to be solved recursively.

The total number of choices may be less than B − 1 because care must be taken to make
sure that the number of those remaining points is at least B. In any case, C[i, j] equals the lowest

16
cost of all the permissible choices, or formally: C[i, j] =min{i+2B−2,j+1−B}mink=i+B−1(A[i, k]
+ C[k + 1, j]) (7) The equation indicates the existence of solutions based on dynamic
programming. One can easily design an algorithm that runs in O(Br2) time: it suffices to derive
C[i, j] in ascending order of the value of j − i, namely, starting with those with j − i = 2B,
followed by those with j−i = 2B+1, and so on until finishing at j−i = r−1. We can significantly
improve the computation time to O(Br), by the observation that j can be fixed to r throughout the
computation in order to obtain C[1, r] eventually.

We have finished explaining how to build the leaf nodes of an R-tree on an inverted list.
As each leaf is a block, all the leaves can be stored in a blocked SI-index as described in Section
6.1. Building the nonleaf levels is trivial, because they are invisible to the merging-based query
algorithms, and hence, do not need to preserve any common ordering. We are free to apply any
of the existing R-tree construction algorithms. It is noteworthy that the nonleaf levels add only a
small amount to the overall space overhead because, in an R-tree, the number of nonleaf nodes is
by far lower than that of leaf nodes.

APPLICATIONS:

A data mining application is associate implementation of knowledge mining technology


that solves a particular business or analysis downside. Example application areas include:

• A drug company will analyse its recent sales department activity and their results to
boost targeting of high-value physicians and verify that selling activities can have the best impact
within the next few months. The information has to embrace challenger market activity still as
information regarding the native health care systems. The results will be distributed to the sales
department via a wide-area network that permits the representatives to review the
recommendations from the angle of the key attributes within the call method. The continuing,
dynamic analysis of the information warehouse permits best practices from throughout the
organization to be applied in specific sales things.

17
• A MasterCard company will leverage its large warehouse of client dealing knowledge
to spot customers possibly to have an interest in a very new credit product. Employing a little
check mailing, the attributes of shoppers with associate affinity for the merchandise will be
known. Recent comes have indicated over a twenty fold decrease in prices for targeted mailing
campaigns over standard approaches.

• A heterogeneous company with an oversized direct sales department will apply data
processing to spot the most effective prospects for its services. Mistreatment data processing to
investigate its own client expertise, this company will build a singular segmentation
characteristic the attributes of high- value prospects.

• A giant shopper package merchandise company will apply (data mining/data


methodology) to boost its sales process to retailers. Knowledge from shopper panels, shipments,
and challenger activity will be applied to grasp the explanations for complete and store shift.
Through this analysis, the manufacturer will choose promotional ways that best reach their
target client segments.

18
LITERATURE SURVEY

PREVIOUS RESEARCH WORK:

Many applications need finding objects that are nearest to a given location that contains a
group of keywords. An increasing variety of applications need the economical execution of
nearest neighbour (NN) queries affected by the properties of the spatial objects. Owing to the
recognition of keyword search, notably on the net, several of those applications enable the user
to produce a listing of keywords that the spatial objects ought to contain, in their description
or alternative attribute. For example, real estate websites enable users to go looking for
properties with specific keywords in their description and rank them in line with their distance
from a given location. We tend to decision such queries spatial keyword queries. A spatial
keyword query consists of a query space and a group of keywords.

The solution could be a list of objects hierarchical in line with a mix of their distance to
the query space and also the connection of their text description to the query keywords. A simple
nevertheless widespread variant that is employed is the distancefirst spatial keyword query,
wherever objects square measure hierarchical by distance and keywords square measure applied
as a conjunctive filter to eliminate objects that don‟t contain them. Unfortunately there‟s no
economical support for top-k abstraction keyword queries, wherever a prefix of the results list is
needed. Instead, current systems use ad-hoc combos of nearest neighbour (NN) and keyword
search techniques to tackle the matter. For an example, associate points R-Tree is employed to
seek out the closest neighbours associate points for every neighbour an inverted index is
employed to envision if the query keywords area unit contained.

The economical methodology to answer top-k spatial keyword queries is predicated on


the combination of knowledge structures and algorithms utilized in spatial in- formation search
and Information Retrieval (IR). Particularly, the strategy consists of building an Information
Retrieval R- Tree (IR2- Tree) that could be a structure supported the R- Tree. At query time an
incremental algorithm is employed that uses the IR2-Tree to efficiently produce the top results of

19
the query. The IR2-Tree is a R-Tree wherever a signature is supplementary to every node v of
the IR2- Tree to denote the matter content of all spatial objects within the sub tree nonmoving
at „v‟. The top-k spatial keyword search formula that is impressed by the work of Hjaltason and
Samet exploits this data to find the highest query results by accessing a bottom portion of the
IR2-Tree. This work has the subsequent contributions:

• The matter of top-k spatial keyword search is outlined.

• The IR2-Tree is as an economical categorization structure to store spatial and matter


data for a group of objects. Economical algorithms also are bestowed to take care of the IR2-
Tree, that is, insert and delete objects.

• An economical progressive formula is bestowed to answer top-k spatial keyword


queries mistreatment the IR2-Tree.

The IR2-Tree could be a combination of a R-Tree and signature files. Specially, every
node of an IR2-Tree contains each abstraction and keyword information, the previous within the
kind of a minimum bounding space and therefore the latter within the kind of a signature. An
IR2-Tree facilitates each top-k abstraction queries and top-k abstraction keyword queries. R-tree,
a preferred special index, and signature file, a good technique for keyword-based document
retrieval. IR2- tree (Information Retrieval R-Tree) structure developed, that has the strengths of
each R-trees and signature files. Like Rtrees, the IR2-tree preserves objects‟ spatial proximity
that is the key to determination spatial queries expeditiously. Like signature files, the IR2-tree is
in a position to filter a substantial portion of the objects that does not contain all the query
keywords, so considerably reducing the amount of objects to be examined.

Our treatment of nearest neighbor search falls in the general topic of spatial keyword
search, which has also given rise to several alternative problems. A complete survey of all those
problems goes beyond the scope of this paper. Below we mention several representatives, but
interested readers can refer for a nice survey. Cong et al. considered a form of keyword-based

20
nearest neighbor queries that is similar to our formulation, but differs in how objects’ texts play a
role in determining the query result. Specifically, aiming at an IR flavor, the approach computes
the relevance between the documents of an object p and a query q. This relevance score is then
integrated with the Euclidean distance between p and q to calculate an overall similarity of p to
q.

The few objects with the highest similarity are returned. In this way, an object may still
be in the query result, even though its document does not contain all the query keywords. In our
method, same, object texts are utilized in evaluating a boolean predicate, i.e., if any query
keyword is missing in an object’s document, it must not be returned. Neither approach subsumes
the other, and both make sense in different applications. As an application in our favor, consider
the scenario where we want to find a close restaurant serving “steak, spaghetti and brandy”, and
do not accept any restaurant that do not serve any of these three items. In this case, a restaurant’s
document either fully satisfies our requirement, or does not satisfy at all. There is no “partial
satisfaction”, as is the rationale behind the approach.

In geographic web search, each webpage is assigned a geographic region that is pertinent
to the webpage’s contents. In web search, such regions are taken into account so that higher
rankings are given to the pages in the same area as the location of the computer issuing the query
(as can be inferred from the computer’s IP address). The underpinning problem that needs to be
solved is different from keyword-based nearest neighbor search, but can be regarded as the
combination of keyword search and range queries.

Zhang et al. dealt with the so-called m-closest key- words problem. Specifically, let P be
a set of points each of which carries a single keyword. Given a set Wq of query keywords (note:
no query point q is needed), the goal is to find m = |Wq| points from P such that (i) each point
has a distinct keyword in Wq, and (ii) the maximum mutual distance of these points is minimized
(among all subsets of m points in P fulfilling the previous condition). In other words, the
problem has a “collaborative” nature in that the resulting m points should cover the query
keywords together. This is fundamentally different from our work where there is no sense of

21
collaboration at all, and instead the quality of each individual point with respect to a query can be
quantified into a concrete value.

Cao et al. proposed collective spatial keyword querying, which is based on similar ideas,
but aims at optimizing different objective functions. Cong et al. proposed the concept of prestige-
based spatial keyword search. The central idea is to evaluate the similarity of an object p to a
query by taking also into account the objects in the neighborhood of p.

Lu et al. recently combined the notion of keyword search Although keyword search has
only started to receive attention in spatial databases, it is already thoroughly studied in relational
databases, where the objective is to enable a querying interface that is similar to that of search
engines, and can be easily used by naive users without knowledge about SQL. Well known
systems with such mechanisms include DBXplorer, Discover, Banks, and so on. Interested
readers may refer for additional references into that literature.

22
SYSTEM ANALYSIS

FEASIBILITY STUDY

Feasibility study is a process which defines exactly what a project is and what strategic issues
need to be considered to assess its feasibility, or likelihood of succeeding. Feasibility studies are
useful both when starting a new business, and identifying a new opportunity for an existing
business. Ideally, the feasibility study process involves making rational decisions about a number
of enduring characteristics of a project, including:

 Technical feasibility- do we’ have the technology’? If not, can we get it?
 Operational feasibility- do we have the resources to build the system? Will the system be
acceptable? Will people use it?
 Economic feasibility, technical feasibility, schedule feasibility, and operational
feasibility- are the benefits greater than the costs?

TECHNICAL FEASIBILITY

Technical feasibility is concerned with the existing computer system (Hardware,


Software etc.) and to what extend it can support the proposed addition. For example, if particular
software will work only in a computer with a higher configuration, an additional hardware is
required. This involves financial considerations and if the budget is a serious constraint, then the
proposal will be considered not feasible.

OPERATIONAL FEASIBILITY

Operational feasibility is a measure of how well a proposed system solves the problems,
and takes advantages of the opportunities identified during scope definition and how it satisfies
the requirements identified in the requirements identified in the requirements analysis phase of
system development.

23
ECONOMIC FEASIBILITY

Economic analysis is the most frequently used method for evaluating the effectiveness of
a candidate system. More commonly known as cost/ benefit analysis, the procedure is to
determine the benefits and savings that are expected from a candidate system and compare them
with costs. If benefits outweigh costs, then the decision is made to design and implement the
system.

EXISTING SYSTEM:

 Existing works mainly focus on finding top-k Nearest Neighbours, where each node has
to match the whole querying keywords Spatial queries with keywords have not been ex-
tensively explored.
 It does not consider the density of data objects in the spatial space. Also these methods
are low efficient for incremental query.
 The best method to date for nearest neighbor search with keywords is due to Felipe et al.
They nicely integrate two well-known concepts: R-tree, a popular spatial index, and
signature file, an effective method for keyword- based document retrieval.
 It develops a structure called the IR2 -tree, which has the strengths of both R-trees and
signature files.
 Like R-trees, the IR2 - tree preserves objects’ spatial proximity, which is the key to
solving spatial queries efficiently.
 Like signature files, the IR2 -tree is able to filter a considerable portion of the objects that
do not contain all the query keywords, thus significantly reducing the number of objects
to be examined.

DISADVANTAGES OF EXISTING SYSTEM:

1. Fail to provide real time answers on difficult inputs

2. The real nearest neighbor lies quite far away from the query point, while all the closer
neighbors are missing at least one of the query keywords

24
PROPOSED SYSTEM:

 Many functionalities of a spatial database are useful in various ways in specific contexts.
 For instance, in a geography information system, range search can be deployed to find all
restaurants in a certain area, while nearest neighbor retrieval can discover the restaurant closest
to a given address.
 A spatial database manages multidimensional objects and provides fast access to those
objects based on different selection criteria.
 The importance of spatial databases is reflected by the convenience of modeling entities
of reality in a geometric manner.
 For example, locations of restaurants, hotels, hospitals and so on are often represented as
points in a map, while larger extents such as parks, lakes, and landscapes often as a combination
of rectangles.
 In this project, we design a variant of inverted index that is optimized for
multidimensional points, and is thus named the spatial inverted index (SI-index).
 This access method successfully incorporates point coordinates into a conventional
inverted index with small extra space, owing to a delicate compact storage scheme.
 SI-index preserves the spatial locality of data points, and comes with an R-tree built on
every inverted list at little space overhead. As a result, it offers two competing ways for query
processing.
 We can merge multiple lists very much like merging traditional inverted lists by ids.
Alternatively, we can also leverage the R-trees to browse the points of all relevant lists in
ascending order of their distances to the query point.
 the SI-index significantly outperforms the IR2 -tree in query efficiency, often by a factor
of orders of magnitude

ADVANTAGES OF PROPOSED SYSTEM:

 Distance browsing is easy with R-trees. In fact, the best-first algorithm is exactly
designed to output data points in ascending order of their distances

25
 It is straight forward to extend our compression scheme to any dimensional space

System Configuration:-

H/W System Configuration:-

Processor - Duel core


Speed - 1.1 GHz

RAM - 1 GB (min)

Hard Disk - 80 GB

S/W System Configuration:-

Operating System : Windows95/98/2000/XP

Application Server : Tomcat5.0/6.X

Front End : ASP.NET

Database : SQL Server 2008

SOFTWARE DESCRIPTION

ABOUT THE SOFTWARE

FEATURES OF VISUAL BASIC .NET

THE .NET FRAMEWORK

The .NET Framework is a new computing platform that simplifies application


development in the highly distributed environment of the Internet.

OBJECTIVES OF .NET FRAMEWORK:

26
1. To provide a consistent object-oriented programming environment whether object codes is
stored and executed locally on Internet-distributed, or executed remotely.

2. To provide a code-execution environment to minimizes software deployment and guarantees


safe execution of code.

3. Eliminates the performance problems.

There are different types of application, such as Windows-based applications and Web-
based applications. To make communication on distributed environment to ensure that code be
accessed by the .NET Framework can integrate with any other code.

COMPONENTS OF .NET FRAMEWORK

1. THE COMMON LANGUAGE RUNTIME (CLR):

The common language runtime is the foundation of the .NET Framework. It manages
code at execution time, providing important services such as memory management, thread
management, and remoting and also ensures more security and robustness. The concept of code
management is a fundamental principle of the runtime. Code that targets the runtime is known as
managed code, while code that does not target the runtime is known as unmanaged code.

THE .NET FRAME WORK CLASS LIBRARY:

It is a comprehensive, object-oriented collection of reusable types used to develop


applications ranging from traditional command-line or graphical user interface (GUI)
applications to applications based on the latest innovations provided by ASP.NET, such as Web
Forms and XML Web services.

The .NET Framework can be hosted by unmanaged components that load the common
language runtime into their processes and initiate the execution of managed code, thereby
creating a software environment that can exploit both managed and unmanaged features.

27
The .NET Framework not only provides several runtime hosts, but also supports the
development of third-party runtime hosts.

Internet Explorer is an example of an unmanaged application that hosts the runtime (in the
form of a MIME type extension). Using Internet Explorer to host the runtime to enables embeds
managed components or Windows Forms controls in HTML documents.

FEATURES OF THE COMMON LANGUAGE RUNTIME:

The common language runtime manages memory; thread execution, code


execution, code safety verification, compilation, and other system services these are all run on
CLR.

Security.

Robustness.

Productivity.

Performance.

SECURITY

The runtime enforces code access security. The security features of the runtime thus
enable legitimate Internet-deployed software to be exceptionally feature rich. With regards to
security, managed components are awarded varying degrees of trust, depending on a number of
factors that include their origin to perform file-access operations, registry-access operations, or
other sensitive functions.

ROBUSTNESS:

The runtime also enforces code robustness by implementing a strict type- and code-
verification infrastructure called the common type system(CTS). The CTS ensures that all
managed code is self-describing. The managed environment of the runtime eliminates many
common software issues.

28
PRODUCTIVITY:

The runtime also accelerates developer productivity. For example, programmers can write
applications in their development language of choice, yet take full advantage of the runtime, the
class library, and components written in other languages by other developers.

PERFORMANCE:

The runtime is designed to enhance performance. Although the common language


runtime provides many standard runtime services, managed code is never interpreted. A feature
called just-in-time (JIT) compiling enables all managed code to run in the native machine
language of the system on which it is executing. Finally, the runtime can be hosted by high-
performance, server-side applications, such as Microsoft® SQL Server™ and Internet
Information Services (IIS).

ASP.NET

ASP.NET is the next version of Active Server Pages (ASP); it is a unified Web
development platform that provides the services necessary for developers to build enterprise-
class Web applications. While ASP.NET is largely syntax compatible, it also provides a new
programming model and infrastructure for more secure, scalable, and stable applications.

ASP.NET is a compiled, NET-based environment, we can author applications in


any .NET compatible language, including Visual Basic .NET, C#, and JScript .NET.
Additionally, the entire .NET Framework is available to any ASP.NET application. Developers
can easily access the benefits of these technologies, which include the managed common
language runtime environment (CLR), type safety, inheritance, and so on.

ASP.NET has been designed to work seamlessly with WYSIWYG HTML editors and
other programming tools, including Microsoft Visual Studio .NET. Not only does this make Web
development easier, but it also provides all the benefits that these tools have to offer, including a

29
GUI that developers can use to drop server controls onto a Web page and fully integrated
debugging support.

Developers can choose from the following two features when creating an ASP.NET
application. Web Forms and Web services, or combine these in any way they see fit. Each is
supported by the same infrastructure that allows you to use authentication schemes; cache
frequently used data, or customizes your application's configuration, to name only a few
possibilities.

Web Forms allows us to build powerful forms-based Web pages. When building these
pages, we can use ASP.NET server controls to create common UI elements, and program them
for common tasks. These controls allow we to rapidly build a Web Form out of reusable built-in
or custom components, simplifying the code of a page.

An XML Web service provides the means to access server functionality remotely. Using
Web services, businesses can expose programmatic interfaces to their data or business logic,
which in turn can be obtained and manipulated by client and server applications. XML Web
services enable the exchange of data in client-server or server-server scenarios, using standards
like HTTP and XML messaging to move data across firewalls. XML Web services are not tied to
a particular component technology or object-calling convention. As a result, programs written in
any language, using any component model, and running on any operating system can access
XML Web services.

Each of these models can take full advantage of all ASP.NET features, as well as the
power of the .NET Framework and .NET Framework common language runtime.

Accessing databases from ASP.NET applications is an often-used technique for


displaying data to Web site visitors. ASP.NET makes it easier than ever to access databases for
this purpose. It also allows us to manage the database from your code.

ASP.NET provides a simple model that enables Web developers to write logic that runs
at the application level. Developers can write this code in the global.aspx text file or in a
compiled class deployed as an assembly. This logic can include application-level events, but
developers can easily extend this model to suit the needs of their Web application.

30
ASP.NET provides easy-to-use application and session-state facilities that are familiar
to ASP developers and are readily compatible with all other .NET Framework APIs.

ASP.NET offers the IHttpHandler and IHttpModule interfaces. Implementing the


IHttpHandler interface gives you a means of interacting with the low-level request and response
services of the IIS Web server and provides functionality much like ISAPI extensions, but with a
simpler programming model. Implementing the IHttpModule interface allows you to include
custom events that participate in every request made to your application.

ASP.NET takes advantage of performance enhancements found in the .NET


Framework and common language runtime. Additionally, it has been designed to offer
significant performance improvements over ASP and other Web development platforms. All
ASP.NET code is compiled, rather than interpreted, which allows early binding, strong typing,
and just-in-time (JIT) compilation to native code, to name only a few of its benefits. ASP.NET is
also easily factorable, meaning that developers can remove modules (a session module, for
instance) that are not relevant to the application they are developing.

ASP.NET provides extensive caching services (both built-in services and caching APIs).
ASP.NET also ships with performance counters that developers and system administrators can
monitor to test new applications and gather metrics on existing applications.

Writing custom debug statements to your Web page can help immensely in
troubleshooting your application's code. However, it can cause embarrassment if it is not
removed. The problem is that removing the debug statements from your pages when your
application is ready to be ported to a production server can require significant effort.

ASP.NET offers the Trace Context class, which allows us to write custom debug
statements to our pages as we develop them. They appear only when you have enabled tracing
for a page or entire application. Enabling tracing also appends details about a request to the page,
or, if you so specify, to a custom trace viewer that is stored in the root directory of your
application.

31
The .NET Framework and ASP.NET provide default authorization and authentication
schemes for Web applications. we can easily remove, add to, or replace these schemes,
depending upon the needs of our application .

DATA ACCESS WITH ADO.NET

As you develop applications using ADO.NET, you will have different requirements for
working with data. You might never need to directly edit an XML file containing data - but it is
very useful to understand the data architecture in ADO.NET.

ADO.NET offers several advantages over previous versions of ADO:

Interoperability

Maintainability

Programmability

Performance Scalability

INTEROPERABILITY:

ADO.NET applications can take advantage of the flexibility and broad acceptance of
XML. Because XML is the format for transmitting datasets across the network, any component
that can read the XML format can process data. The receiving component need not be an
ADO.NET component.

The transmitting component can simply transmit the dataset to its destination without
regard to how the receiving component is implemented. The destination component might be a
Visual Studio application or any other application implemented with any tool whatsoever.

The only requirement is that the receiving component be able to read XML. SO, XML
was designed with exactly this kind of interoperability in mind.

MAINTAINABILITY:

In the life of a deployed system, modest changes are possible, but substantial,
Architectural changes are rarely attempted because they are so difficult. As the performance load

32
on a deployed application server grows, system resources can become scarce and response time
or throughput can suffer. Faced with this problem, software architects can choose to divide the
server's business-logic processing and user-interface processing onto separate tiers on separate
machines.

In effect, the application server tier is replaced with two tiers, alleviating the shortage of
system resources. If the original application is implemented in ADO.NET using datasets, this
transformation is made easier.

ADO.NET data components in Visual Studio encapsulate data access functionality in


various ways that help you program more quickly and with fewer mistakes.

PERFORMANCE:

ADO.NET datasets offer performance advantages over ADO disconnected record sets. In
ADO.NET data-type conversion is not necessary.

SCALABILITY:

ADO.NET accommodates scalability by encouraging programmers to conserve limited


resources. Any ADO.NET application employs disconnected access to data; it does not retain
database locks or active database connections for long durations.

VISUAL STUDIO .NET
Visual Studio .NET is a complete set of development tools for building ASP Web
applications, XML Web services, desktop applications, and mobile applications In addition to
building high-performing desktop applications, you can use Visual Studio's powerful
component-based development tools and other technologies to simplify team-based design,
development, and deployment of Enterprise solutions.

Visual Basic .NET, Visual C++ .NET, and Visual C# .NET all use the same integrated
development environment (IDE), which allows them to share tools and facilitates in the creation
of mixed-language solutions. In addition, these languages leverage the functionality of the .NET
Framework and simplify the development of ASP Web applications and XML Web services.

33
Visual Studio supports the .NET Framework, which provides a common language
runtime and unified programming classes; ASP.NET uses these components to create ASP Web
applications and XML Web services. Also it includes MSDN Library, which contains all the
documentation for these development tools.

XML WEB SERVICES:

XML Web services are applications that can receive the requested data using XML over
HTTP. XML Web services are not tied to a particular component technology or object-calling
convention but it can be accessed by any language, component model, or operating system. In
Visual Studio .NET, you can quickly create and include XML Web services using Visual Basic,
Visual C#, JScript, Managed Extensions for C++, or ATL Server.

XML SUPPORT:

Extensible Markup Language (XML) provides a method for describing structured data.
XML is a subset of SGML that is optimized for delivery over the Web. The World Wide Web
Consortium (W3C) defines XML standards so that structured data will be uniform and
independent of applications. Visual Studio .NET fully supports XML, providing the XML
Designer to make it easier to edit XML and create XML schemas.

VISUAL BASIC .NET

Visual Basic.NET, the latest version of visual basic, includes many new features. The
Visual Basic supports interfaces but not implementation inheritance.

Visual basic.net supports implementation inheritance, interfaces and overloading. In


addition, Visual Basic .NET supports multithreading concept.

COMMON LANGUAGE SPECIFICATION (CLS):

Visual Basic.NET is also compliant with CLS (Common Language Specification) and
supports structured exception handling. CLS is set of rules and constructs that are supported by
the CLR (Common Language Runtime). CLR is the runtime environment provided by the .NET

34
Framework; it manages the execution of the code and also makes the development process easier
by providing services.

Visual Basic.NET is a CLS-compliant language. Any objects, classes, or components that


created in Visual Basic.NET can be used in any other CLS-compliant language. In addition, we
can use objects, classes, and components created in other CLS-compliant languages in Visual
Basic.NET .The use of CLS ensures complete interoperability among applications, regardless of
the languages used to create the application.

IMPLEMENTATION INHERITANCE:

Visual Basic.NET supports implementation inheritance. This means that, while creating
applications in Visual Basic.NET, we can drive from another class, which is know as the base
class that derived class inherits all the methods and properties of the base class. In the derived
class, we can either use the existing code of the base class or override the existing code.
Therefore, with help of the implementation inheritance, code can be reused.

CONSTRUCTORS AND DESTRUCTORS:

Constructors are used to initialize objects, whereas destructors are used to destroy them.
In other words, destructors are used to release the resources allocated to the object. In Visual
Basic.NET the sub finalize procedure is available. The sub finalize procedure is used to complete
the tasks that must be performed when an object is destroyed. The sub finalize procedure is
called automatically when an object is destroyed. In addition, the sub finalize procedure can be
called only from the class it belongs to or from derived classes.

GARBAGE COLLECTION:

Garbage Collection is another new feature in Visual Basic.NET. The .NET Framework
monitors allocated resources, such as objects and variables. In addition, the .NET Framework

35
automatically releases memory for reuse by destroying objects that are no longer in use. In
Visual Basic.NET, the garbage collector checks for the objects that are not currently in use by
applications. When the garbage collector comes across an object that is marked for garbage
collection, it releases the memory occupied by the object.

OVERLOADING:

Overloading is another feature in Visual Basic.NET. Overloading enables us to define


multiple procedures with the same name, where each procedure has a different set of arguments.
Besides using overloading for procedures, we can use it for constructors and properties in a class.

MULTITHREADING:

Visual Basic.NET also supports multithreading. An application that supports


multithreading can handle multiple tasks simultaneously, we can use multithreading to decrease
the time taken by an application to respond to user interaction. To decrease the time taken by an
application to respond to user interaction, we must ensure that a separate thread in the application
handles user interaction.

STRUCTURED EXCEPTION HANDLING:

Visual Basic.NET supports structured handling, which enables us to detect and remove
errors at runtime. In Visual Basic.NET, we need to use Try…Catch…Finally statements to create
exception handlers. Using Try…Catch…Finally statements, we can create robust and effective
exception handlers to improve the performance of our application.

BACK END DESIGN

36
Features of SQL-SERVER

The OLAP Services feature available in SQL Server version 7.0 is now called SQL
Server 2000 Analysis Services. The term OLAP Services has been replaced with the term
Analysis Services. Analysis Services also includes a new data mining component. The
Repository component available in SQL Server version 7.0 is now called Microsoft SQL Server
2000 Meta Data Services. References to the component now use the term Meta Data Services.
The term repository is used only in reference to the repository engine within Meta Data Services

SQL-SERVER database consist of six type of objects,

They are,

1. TABLE

2. QUERY

3. FORM

4. REPORT

5. MACRO

TABLE

A database is a collection of data about a specific topic.

VIEWS OF TABLE:

We can work with a table in two types,

1. Design View

2. Datasheet View

Design View

37
To build or modify the structure of a table we work in the table design view. We can specify
what kind of data will be hold.

Datasheet View

To add, edit or analyses the data itself we work in tables datasheet view mode.

QUERY:

A query is a question that has to be asked the data. Access gathers data that answers the
question from one or more table. The data that make up the answer is either dynaset (if you edit
it) or a snapshot(it cannot be edited).Each time we run query, we get latest information in the
dynaset.Access either displays the dynaset or snapshot for us to view or perform an action on
it ,such as deleting or updating.

FORMS:

A form is used to view and edit information in the database record by record .A form
displays only the information we want to see in the way we want to see it. Forms use the familiar
controls such as textboxes and checkboxes. This makes viewing and entering data easy.

Views of Form:

We can work with forms in several primarily there are two views,

They are,

1. Design View

2. Form View

Design View

To build or modify the structure of a form, we work in forms design view. We can add
control to the form that are bound to fields in a table or query, includes textboxes, option buttons,
graphs and pictures.

Form View

38
The form view which display the whole design of the form.

REPORT:

A report is used to views and print information from the database. The report can ground
records into many levels and compute totals and average by checking values from many records
at once. Also the report is attractive and distinctive because we have control over the size and
appearance of it.

MACRO:

A macro is a set of actions. Each action in macros does something. Such as opening a
form or printing a report .We write macros to automate the common tasks the work easy and
save the time

C# (pronounced C Sharp) is a multi-paradigm programming language that encompasses


functional, imperative, generic, object-oriented (class-based), and component-oriented
programming disciplines. It was developed by Microsoft as part of the .NET initiative and later
approved as a standard by ECMA (ECMA-334) and ISO (ISO/IEC 23270). C# is one of the 44
programming languages supported by the .NET Framework's Common Language Runtime.

C# is intended to be a simple, modern, general feedback-purpose, object-oriented


programming language. Anders Hejlsberg, the designer of Delphi, leads the team which is
developing C#. It has an object-oriented syntax based on C++ and is heavily influenced by other
programming languages such as Delphi and Java. It was initially named Cool, which stood for
"C like Object Oriented Language". However, in July 2000, when Microsoft made the project
public, the name of the programming language was given as C#. The most recent version of the
language is C# 3.0 which was released in conjunction with the .NET Framework 3.5 in 2007.
The next proposed version, C# 4.0, is in development.

History:-

39
In 1996, Sun Microsystems released the Java programming language with Microsoft soon
purchasing a license to implement it in their operating system. Java was originally meant to be a
platform independent language, but Microsoft, in their implementation, broke their license
agreement and made a few changes that would essentially inhibit Java's platform-independent
capabilities. Sun filed a lawsuit and Microsoft settled, deciding to create their own version of a
partially compiled, partially interpreted object-oriented programming language with syntax
closely related to that of C++.

During the development of .NET, the class libraries were originally written in a
language/compiler called Simple Managed C (SMC). In January 1999, Anders Hejlsberg formed
a team to build a new language at the time called Cool, which stood for "C like Object Oriented
Language".Microsoft had considered keeping the name "Cool" as the final name of the language,
but chose not to do so for trademark reasons. By the time the .NET project was publicly
announced at the July 2000 Professional Developers Conference, the language had been renamed
C#, and the class libraries and ASP.NET runtime had been ported to C#.

C#'s principal designer and lead architect at Microsoft is Anders Hejlsberg, who was
previously involved with the design of Visual J++, Borland Delphi, and Turbo Pascal. In
interviews and technical papers he has stated that flaws in most major programming languages
(e.g. C++, Java, Delphi, and Smalltalk) drove the fundamentals of the Common Language
Runtime (CLR), which, in turn, drove the design of the C# programming language itself. Some
argue that C# shares roots in other languages.

Features of C#:-

By design, C# is the programming language that most directly reflects the underlying
Common Language Infrastructure (CLI). Most of C#'s intrinsic types correspond to value-types
implemented by the CLI framework. However, the C# language specification does not state the
code generation requirements of the compiler: that is, it does not state that a C# compiler must
target a Common Language Runtime (CLR), or generate Common Intermediate Language (CIL),
or generate any other specific format. Theoretically, a C# compiler could generate machine code

40
like traditional compilers of C++ or FORTRAN; in practice, all existing C# implementations
target CIL.

Some notable C# distinguishing features are:

 There are no global variables or functions. All methods and members must be declared
within classes. It is possible, however, to use static methods/variables within public
classes instead of global variables/functions.
 Local variables cannot shadow variables of the enclosing block, unlike C and C++.
Variable shadowing is often considered confusing by C++ texts.
 C# supports a strict Boolean data type, bool. Statements that take conditions, such as
while and if, require an expression of a Boolean type. While C++ also has a Boolean
type, it can be freely converted to and from integers, and expressions such as if(a) require
only that a is convertible to bool, allowing a to be an int, or a pointer. C# disallows this
"integer meaning true or false" approach on the grounds that forcing programmers to use
expressions that return exactly bool can prevent certain types of programming mistakes
such as if (a = b) (use of = instead of ==).
 In C#, memory address pointers can only be used within blocks specifically marked as
unsafe, and programs with unsafe code need appropriate permissions to run. Most object
access is done through safe object references, which are always either pointing to a valid,
existing object, or have the well-defined null value; a reference to a garbage-collected
object, or to random block of memory, is impossible to obtain. An unsafe pointer can
point to an instance of a value-type, array, string, or a block of memory allocated on a
stack. Code that is not marked as unsafe can still store and manipulate pointers through
the System.IntPtr type, but cannot dereference them.
 Managed memory cannot be explicitly freed, but is automatically garbage collected.
Garbage collection addresses memory leaks. C# also provides direct support for
deterministic finalization with the using statement (supporting the Resource Acquisition
Is Initialization idiom).
 Multiple inheritance is not supported, although a class can implement any number of
interfaces. This was a design decision by the language's lead architect to avoid

41
complication, avoid dependency hell and simplify architectural requirements throughout
CLI.
 C# is more type safe than C++. The only implicit conversions by default are those which
are considered safe, such as widening of integers and conversion from a derived type to a
base type. This is enforced at compile-time, during JIT, and, in some cases, at runtime.
There are no implicit conversions between Booleans and integers, nor between
enumeration members and integers (except for literal 0, which can be implicitly
converted to any enumerated type). Any user-defined conversion must be explicitly
marked as explicit or implicit, unlike C++ copy constructors (which are implicit by
default) and conversion operators (which are always implicit).
 Enumeration members are placed in their own scope.
 C# provides syntactic sugar for a common pattern of a pair of methods, accessor (getter)
and mutator (setter) encapsulating operations on a single attribute of a class, in form of
properties.
 Full type reflection and discovery is available.
 C# currently (as of 3 June 2008) has 77 reserved words.

Common Type system (CTS)

C# has a unified type system. This unified type system is called Common Type System
(CTS).

A unified type system implies that all types, including primitives such as integers, are
subclasses of the System. Object class. For example, every type inherits a To String () method.
For performance reasons, primitive types (and value types in general feedback) are internally
allocated on the stack.

Categories of data types

CTS separate data types into two categories:

 Value types
 Reference types

42
Value types are plain aggregations of data. Instances of value types do not have
referential identity or referential comparison semantics - equality and inequality comparisons for
value types compare the actual data values within the instances, unless the corresponding
operators are overloaded. Value types are derived from System.ValueType, always have a
default value, and can always be created and copied. Some other limitations on value types are
that they cannot derive from each other (but can implement interfaces) and cannot have a default
(parameter less) constructor. Examples of value types are some primitive types, such as int (a
signed 32-bit integer), float (a 32-bit IEEE floating-point number), char (a 16-bit Unicode code
point), and System.DateTime (identifies a specific point in time with millisecond precision).

In contrast, reference types have the notion of referential identity - each instance of
reference type is inherently distinct from every other instance, even if the data within both
instances is the same. This is reflected in default equality and inequality comparisons for
reference types, which test for referential rather than structural equality, unless the corresponding
operators are overloaded (such as the case for System. String). In general feedback, it is not
always possible to create an instance of a reference type, nor to copy an existing instance, or
perform a value comparison on two existing instances, though specific reference types can
provide such services by exposing a public constructor or implementing a corresponding
interface (such as ICloneable or I Comparable). Examples of reference types are object (the
ultimate base class for all other C# classes), System. String (a string of Unicode characters), and
System. Array (a base class for all C# arrays).

Both type categories are extensible with user-defined types.

Boxing and unboxing

Boxing is the operation of converting a value of a value type into a value of a


corresponding reference type.

ACCESS PRIVIEGES

IIS provides several new access levels. The following values can set the type of access
allowed to specific directories:

43
o Read
o Write
o Script
o Execute
o Log Access
o Directory Browsing.
IIS WEBSITE ADMINISTRATION

Administering websites can be time consuming and costly, especially for people who
manage large internet Service Provider(ISP)Installations. to save time and money Sip’s support
only large company web siesta the expense of personal websites. But is there a cost-effective
way to support both? The answer is yes, if you can automate administrative tasks and let users
administer their own sites from remote computers.This solution reduces the amount of time and
money it takes to manually administer a large installation, without reducing the number of web
sites supported.

Microsoft Internet Information server (IIS) version 4.0 offers technologies to do this:

1. Windows scripting Host(WSH)

2. IIS Admin objects built on top of Active Directory service Interface(ADS))

With these technologies working together behind the scenes, the person can administers
sites from the command line of central computer and can group frequently used commands in
batch files.

Then all user need to do is run batch files to add new accounts,change permissions, add a
virtual server to a site and many other tasks.

.NET FRAMEWORK

The .NET Framework is many things, but it is worthwhile listing its most important
aspects. In short, the .NET Framework is:

44
A Platform designed from the start for writing Internet-aware and Internet-enabled
applications that embrace and adopt open standards such as XML, HTTP, and SOAP.

A Platform that provides a number of very rich and powerful application development
technologies, such as Windows Forms, used to build classic GUI applications, and of course
ASP.NET, used to build web applications.

A Platform with an extensive class library that provides extensive support for date access
(relational and XML), a director services, message queuing, and much more.

A platform that has a base class library that contains hundreds of classes for performing
common tasks such as file manipulation, registry access, security, threading, and searching of
text using regular expressions.

A platform that doesn’t forget its origins, and has great interoperability support for
existing components that you or third parties have written, using COM or standard DLLs.

A Platform with an independent code execution and management environment called the
Common Language Runtime(CLR), which ensures code is safe to run, and provides an abstract
layer on top of the operating system, meaning that elements of the .NET framework can run on
many operating systems and devices.

ASP.NET

ASP.NET is part of the whole. NET framework, built on top of the Common Language
Runtime (also known as the CLR) - a rich and flexible architecture, designed not just to cater for
the needs of developers today, but to allow for the long future we have ahead of us. What you
might not realize is that, unlike previous updates of ASP, ASP.NET is very much more than just
an upgrade of existing technology – it is the gateway to a whole new era of web development.

ASP.NET is a feature at the following web server releases

o Microsoft IIS 5.0 on WINDOWS 2000 Server


o Microsoft IIS 5.1 on WINDOWS XP
ASP.NET has been designed to try and maintain syntax and run-time compatibility with
existing ASP pages wherever possible. The motivation behind this is to allow existing ASP

45
Pages to be initially migrated ASP.NET by simply renaming the file to have an extension
of .aspx. For the most part this goal has been achieved, although there are typically some basic
code changes that have to be made, since VBScript is no longer supported, and the VB language
itself has changed.

Some of the key goals of ASP.NET were to

o Remove the dependency on script engines, enabling pages to be type safe and
compiled.
o Reduce the amount of code required to develop web applications.
o Make ASP.NET well factored, allowing customers to add in their own custom
functionality, and extend/ replace built-in ASP.NET functionality.
o Make ASP.NET a logical evolution of ASP, where existing ASP investment and
therefore code can be reused with little, if any, change.
o Realize that bugs are a fact of life, as ASP.NET should be as fault tolerant as
possible.
Benefits of ASP.NET

The .NET Framework includes a new data access technology named ADO.NET, an
evolutionary improvement to ADO. Though the new data access technology is evolutionary, the
classes that make up ADO.NET bear little resemblance to the ADO objects with which you
might be familiar. Some fairly significant changes must be made to existing ADO applications to
convert them to ADO.NET. The changes don't have to be made immediately to existing ADO
applications to run under ASP.NET, however.

ADO will function under ASP.NET. However, the work necessary to convert ADO
applications to ADO.NET is worthwhile. For disconnected applications, ADO.NET should offer
performance advantages over ADO disconnected record sets. ADO requires that transmitting and
receiving components be COM objects. ADO.NET transmits data in a standard XML-format file
so that COM marshaling or data type conversions are not required.

ASP.NET has several advantages over ASP.

The following are some of the benefits of ASP.NET:

46
o Make code cleaner.
o Improve deployment, scalability, and reliability.
o Provide better support for different browsers and devices.
o Enable a new breed of web applications.

VB Script

VB Script, some times known as Visual Basic Scripting Edition, is Microsoft’s answer to
Java Script. Just as Java Script’s syntax loosely based on Java. VB Script’s syntax is loosely
based on Microsoft Visual Basic a popular programming language for Windows machines.

Like Java Script, VB Script is a simple scripting and we can include VB Script statements
within an HTML document. To begin a VB Script, we use the <script LANGUAGE=”VB
Script”>tag.

VB Script can do many of the same things as Java Script and it even looks similar in
some cases.

It has two main advantages:

For these who already know Visual Basic, it may be easier to learn than Java Script.

It is closely integrated with ActiveX, Microsoft’s standard for Web-embedded


applications.

VB Script’s main disadvantage is that only Microsoft Internet Explorer supports it. Both
Netscape and Internet Explorer, on the other hand, support Java Script is also a much more
popular language, and we can see it in use all over the Web.

ActiveX

47
ActiveX is a specification develops by Microsoft that allows ordinary Windows programs
to be run within a Web page. ActiveX programs can be written in languages such as Visual Basic
and they are complied before being placed on the Web server.

ActiveX application, called controls, are downloaded and executed by the Web browser,
like Java applets. Unlike Java applets, controls can be installed permanently when they are
downloaded; eliminating the need to download them again. ActiveX’s main advantage is that it
can do just about anything.

This can also be a disadvantage:

Several enterprising programmers have already used ActiveX to bring exciting new
capabilities to Web page, such as “the Web page that turns off your computer” and “the Web
page that formats disk drive”.

Fortunately, ActiveX includes a signature feature that identifies the source of the control
and prevents controls from being modified. While this won’t prevent a control from damaging
system, we can specify which sources of controls we trust.

ActiveX has two main disadvantages

It isn’t as easy to program as scripting language or Java.

ActiveX is proprietary.

It works only in Microsoft Internet Explorer and only Windows platforms.

ADO.NET

ADO.NET provides consistent access to data sources such as Microsoft SQL Server, as
well as data sources exposed via OLE DB and XML. Data-sharing consumer applications can
use ADO.NET to connect to these data sources and retrieve, manipulate, and update data.

ADO.NET cleanly factors data access from data manipulation into discrete components
that can be used separately or in tandem. ADO.NET includes .NET data providers for connecting
to a database, executing commands, and retrieving results. Those results are either processed
directly, or placed in an ADO.NET Dataset object in order to be exposed to the user in an ad-hoc

48
manner, combined with data from multiple sources, or remote between tiers. The ADO.NET
Dataset object can also be used independently of a .NET data provider to manage data local to
the application or sourced from XML.

Why ADO.NET?

As application development has evolved, new applications have become loosely coupled
based on the Web application model. More and more of today's applications use XML to encode
data to be passed over network connections. Web applications use HTTP as the fabric for
communication between tiers, and therefore must explicitly handle maintaining state between
requests. This new model is very different from the connected, tightly coupled style of
programming that characterized the client/server era, where a connection was held open for the
duration of the program's lifetime and no special handling of state was required.

In designing tools and technologies to meet the needs of today's developer, Microsoft
recognized that an entirely new programming model for data access was needed, one that is built
upon the .NET Framework. Building on the .NET Framework ensured that the data access
technology would be uniform—components would share a common type system, design
patterns, and naming conventions.

ADO.NET was designed to meet the needs of this new programming model:
disconnected data architecture, tight integration with XML, common data representation with the
ability to combine data from multiple and varied data sources, and optimized facilities for
interacting with a database, all native to the .NET Framework.

Leverage Current ADO Knowledge

Microsoft's design for ADO.NET addresses many of the requirements of today's


application development model. At the same time, the programming model stays as similar as
possible to ADO, so current ADO developers do not have to start from scratch in learning a
brand new data access technology. ADO.NET is an intrinsic part of the .NET Framework
without seeming completely foreign to the ADO programmer.

ADO.NET coexists with ADO. While most new .NET applications will be written using
ADO.NET, ADO remains available to the .NET programmer through .NET COM

49
interoperability services. For more information about the similarities and the differences between
ADO.NET and ADO.ADO.NET provides first-class support for the disconnected, n-tier
programming environment for which many new applications are written. The concept of working
with a disconnected set of data has become a focal point in the programming model. The
ADO.NET solution for n-tier programming is the Dataset.

XML Support

XML and data access are intimately tied—XML is all about encoding data, and data
access is increasingly becoming all about XML. The .NET Framework does not just support
Web standards—it is built entirely on top of them.

SYSTEM ARCHITECTURE

50
Use case diagram

51
Class diagram

52
Sequence diagram

53
Collaboration diagram

54
Activity diagram

55
ER Diagram:

56
userid
id

username
name

password
place
User Search Restaurant

name famous

gender category
Add

city address
password
email description
Admin
name

mobile
url
Add

distance
Place
district

name district district

DFD:

57
Level 1
Fast nearest neighbour search process
User Admin

58
Level 2:

User Registration

User UserTable

Login Validation
Login Table

View Profile
UserTable
Approximate String Search Approximate Keyword Search

Edit Profile
UserTable

Select Place
PlaceTable

Display nearest restaurants


RestaurantTable

View restaurant details


RestaurantTable

Admin
Level 3

59
Level 3:

Admin Register Restaurants Details


RestaurantTable

Add Place Details


Place Table

View User Details


User Table

Delete User
User Table

60
SYSTEM DESCRIPTION

MODULES :

1 . A p pr ox im at e s tr in g s ea rc h
2. Spatial approximate-keyword search
3. Nearest Neighbor and Top-k Queries
4 . L oc at io n- A w ar e T ex t Re tr ie va l Q u er ie s

MODULE DESCRIPTION

A PP R O X I M A T E S TR I N G S EA R C H

We refer to the problem of conjunctive keyword search with relaxed keywords as


approximate keyword search . An important subproblem of approximate keyword search is that
of approximate  string search, defined as follows. Given a collection of strings, find those that
are similar to a given query string. There are two main families of approaches to answer such
queries. (1) Trie based method: The string collection is stored as a trie (or a suffix tree). For
answer an approximate string query, we traverse the trie and prune subtrees that cannot be
similar to the query. Popular pruning techniques use an NFA or a dynamic  programming matrix
with backtracking. (2) Inverted-index method: Its main idea is to decompose each string in the
collection to small overlapping substrings (called grams) and build an inverted index on these
grams. More details on fast indexing and search algorithms can be found in.
 
 
SPATIAL APPROXIMATE-KEYWORD SEARCH

Yao et al proposed a structure called MHR-tree to answer spatial approximate-keyword


queries. They enhance an R-tree with min-wise signatures at each node to compactly represent
the union of the grams contained in objects of that subtree. They then use the concept of set
resemblance between the signatures and the query strings to prune branches in the tree. The main

61
advantage of this approach is that its index size does not require a lot of space since the min-wise
signatures are very small. However, the method could miss query answers due to the
probabilistic nature of the signatures. Sattam Alsubaiee. et.al, we study how to support
approximate keyword search on spatial data. Answering such queries efficiently is critical to
these Web sites to achieve a high query throughput to serve many concurrent users. Although
there are studies on both approximate keyword search and location-based search, the problem of
supporting both types of search simultaneously has received little attention. To answer such
queries, a natural index structure is to augment a tree-based spatial index with approximate string
indexes such as a gram-based inverted index or a trie-based index. The main focus of this work is
a systematic study on how to efficiently combine these two types of indexes, and how to search
the resulting index (called LBAK-tree) to find answers.

NEAREST NEIGHBOR AND TOP-K QUERIES

The processing of k  -nearest neighbor queries (k NNs) in spatial databases is a classical


subject. Most proposals use index structures to assist in the k   NN processing. Perhaps the most
influential k NN algorithm is due to Roussopoulos et al. In this solution, an R-tree indexes the
points, potential nearest neighbors are maintained in a priority queue, and the tree is traversed
according to a number of heuristics. Other branch-and-bound methods modify the index
structures to better suit the particular  problem addressed. Hjaltason and Samet propose an
incremental nearest neighbor algorithm based on an R*-tree.

L OC A T I ON -A WA R E T EX T R E TR I EV A L QU ER I E S

Profitable search engines such as Google and Yahoo! have introduced local search
services that appear to focus on the retrieval of local content, e.g., related to stores and
restaurants. However, the algorithms used are not publicized. Much attention has  been given to
the problem of extracting geographic information from web pages. The extracted information
can be used by search engines. McCurley covers the notion of geo-coding and describes
geographic indicators found in pages, such as zip codes and location names. Recent studies that
consider location-aware text retrieval constitute the work most closely related to this study. Zhou

62
et al.tackle the problem of retrieving web documents relevant to a keyword query within a pre-
specified spatial region. They propose three approaches based on a loose combination of an
inverted file and an R*- tree. The best approach according to their experiments is to build an R*-
tree for each distinct keyword on the web pages containing the keyword. As a result, queries with
multiple keywords need to access multiple R*-trees and to intersect the results. Building a
separate R*-tree for each keyword also requires substantial storage. Ian De Felipe.et.al, the
problem of top k spatial keyword search is defined. The IR2-Tree is proposed as an efficient
indexing structure to store spatial and textual information for a set of objects. Efficient
algorithms are also presented to maintain the IR2-Tree, that is, insert and delete objects. An
efficient incremental algorithm is presented to answer top-k spatial keyword queries using the
IR2-Tree. We present a method to efficiently answer top-k spatial keyword queries, which is
based on the tight integration of data structures and algorithms used in spatial database search
and Information Retrieval (IR). In particular, this method consists of building an Information
Retrieval R-Tree (IR2- Tree), which is a structure based on the R-Tree. At query time an
incremental algorithm is employed that uses the IR2- Tree to efficiently produce the top results
of the query

SPATIAL INVERTED INDEX ALGORITHM

The kNN spatial keyword query process is shown in Algorithm. The inputs to the
algorithm are the query point q, the boundary object BO , the parameter k and keyword Kw. The
kNN result returned by the server is retrieved by calling BO.result () on line 2. H is a min-heap
which sorts points according to their distances to query q. First (lines 3-12), the algorithm
constructs the  boundary cell (BC) of the first object p1 and checks whether q falls inside BC
(p1). If not, p1 is not the first NN and the verification process fails. Otherwise, p1 is verified as
the first NN and is added to the Visited set. The subsequent for loop (lines 13-22) iterates
through all objects in L (kNNs from the BO) and performs the following operations: 1) if the
neighbor of the last verified object (L[i]) has not been visited yet, it is inserted into the min-heap
H and the Visited set (lines 14-18) and 2) it compares the next object in the result set (L[i+1])
with the top of H (lines 19-21). If they are identical, L[i+1] is verified as the next NN. Otherwise,
verification fails and the program returns false.

63
Algorithm 1: KNN (q, BO, K, Kw)

1: H ← ∅; visited ← ∅
 2: L ← BO.result ( ); p1=L[1];
 3: BCP ← compute BC( p1 );
 4: if ( q ∉ BCP ) then
5: return false; {the first NN fails}
6: else
7: if ( Kw ϵ p1 ) then
8: visited.add ( p1);
9: else
10: return false;
11: end if 1
2: end if
13: for i=1 to k-1 do
14: for all ( n ϵ L[i]. Neighbors) do
15: if ( n ∉ visited) then
16: visited.add (n);
17: end if
18: end for
19: if (L [i+1].location ≠ H.pop( ) ) then
 20: return false; {the ( i+1) th NN fails }
21: end if
22: end for
23: return true;

NORMALIZATION

The basic objective of normalization is to be reducing redundancy which means that


information is to be stored only once. Storing information several times leads to wastage of
storage space and increase in the total size of the data stored.

64
If a Database is not properly designed it can gives rise to modification anomalies. Modification
anomalies arise when data is added to, changed or deleted from a database table. Similarly, in
traditional databases as well as improperly designed relational databases, data redundancy can be
a problem. These can be eliminated by normalizing a database.

Normalization is the process of breaking down a table into smaller tables. So that each
table deals with a single theme. There are three different kinds of modifications of anomalies and
formulated the first, second and third normal forms (3NF) is considered sufficient for most
practical purposes. It should be considered only after a thorough analysis and complete
understanding of its implications.

FIRST NORMAL FORM (1NF):

This form also called as a “flat file”. Each column should contain data in respect of a
single attributes and no two rows may be identical. To bring a table to First Normal Form,
repeating groups of fields should be identified and moved to another table.

SECOND NORMAL FORM (2NF):

A relation is said to be in 2NF if it is 1NF and non-key attributes are functionality


dependent on the key attributes. A ‘Functional Dependency’ is a relationship among attributes.
One attribute is said to be functionally dependent on another if the value of the first attribute
depends on the value of the second attribute. In the given description flight number and halt code
is the composite key.

THIRD NORMAL FORM (3NF) :

Third Normal Form normalization will be needed where all attributes in a relation tuple are not
functionally dependent only on the key attribute. A transitive dependency is one in which one in
which one attribute depends on second which is turned depends on a third and so on.

SYSTEM TESTING

65
The goal of testing is to improve the program’s quality. Quality is assured primarily
through some form of software testing. The history of testing goes back to the beginning of the
computing field. Testing is done at two Levels of Testing of individual modules and testing the
entire system. During the system testing, the system is used experimentally to ensure that the
software will run according to the specifications and in the way the user expects. Testing is very
tedious and time consuming. Each test case is designed with the intent of finding errors in the
way the system will process it.

The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement.

LEVELS OF TESTING

The software underwent the following tests by the system analyst.

WHITE BOX TESTING

By using this technique it was tested that all the individual logical paths were executed at
least once. All the logical decisions were tested on both their true and false sides. All the loops
were tested with data in between the ranges and especially at the boundary values.

White Box Testing is a testing in which in which the software tester has knowledge of the
inner workings, structure and language of the software, or at least its purpose. It is purpose. It is
used to test areas that cannot be reached from a black box level.

BLACK BOX TESTING

66
By the use of this technique, the missing functions were identified and placed in their
positions. The errors in the interfaces were identified and corrected. This technique was also used
to identify the initialization and termination errors and correct them.

Black Box Testing is testing the software without any knowledge of the inner workings,
structure or language of the module being tested. Black box tests, as most other kinds of tests,
must be written from a definitive source document, such as specification or requirements
document, such as specification or requirements document. It is a testing in which the software
under test is treated, as a black box .you cannot “see” into it. The test provides inputs and
responds to outputs without considering how the software works.

UNIT TESTING

It is the verification of a single module usually in the isolated environment. The System
Analyst tests each and every module individually by giving a set of known input data and
verifying for the required output data. The System Analyst tested the software Top Down model
starting from the top of the model. The units in a system are the modules and routines that are
assembled to perform a specific function. The modules should be tested for correctness of login
applied and should detect errors in coding. This is the verification of the system to its initial
objective. This is a verification process when it is done in a simulated environment and it is a
validation process when it is done in a line environment.

Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.

67
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as
two distinct phases.

Test strategy and approach


Field testing will be performed manually and functional tests will be written in detail.

Test objectives
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.

Features to be tested
 Verify that the entries are of the correct format
 No duplicate entries should be allowed
 All links should take the user to the correct page.

INTEGRATION TESTING

The purpose of unit testing is to determine that each independent module is correctly
implemented. This gives little chance to determine that the interface between modules is also
correct and for this reason integration testing must be performed. One specific target of
integration testing is the interface. Whether parameters match on both sides as to type,
permissible ranges, meaning and utilization. Module testing assures us that the detailed design
was correctly implemented; now it is necessary to verity that the architectural design
specifications were met. Chosen portions of the structure tree of the software are put together.
Each sub tree should have some logical reason for being tested. It may be a particularly difficult
or tricky part of the code; or it may be essential to the function of the rest of the product. As the

68
testing progresses, we find ourselves putting together larger and longer parts of the tree, until the
entire product has been integrated.

Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components were
individually satisfaction, as shown by successfully unit testing, the combination of components is
correct and consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.

Software integration testing is the incremental integration testing of two or more


integrated software components on a single platform to produce failures caused by interface
defects.

The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.

Functional test
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.

Functional testing is centered on the following items:

Valid Input : identified classes of valid input must be accepted.

Invalid Input : identified classes of invalid input must be rejected.

Functions : identified functions must be exercised.

Output : identified classes of application outputs must be exercised.

Systems/Procedures: interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on requirements, key functions, or


special test cases. In addition, systematic coverage pertaining to identify Business process flows;
data fields, predefined processes, and successive processes must be considered for testing.

69
Before functional testing is complete, additional tests are identified and the effective value of
current tests is determined..

VALIDATION TESTING

The main aim of this testing is to verify that the software system does what it was
designed forThe system was tested to ensure that the purpose of automating the system “Machine
Order”.Alpha testing was carried out to ensure the validity of the system.

OUTPUT TESTING

Asking the users about the format required by them tests the outputs generated by the
system under consideration .The output format on the screen is found to be correct as the format
was designed in the system design. Output testing was done and it did not result in any change or
correction in the system.

USER ACCEPTANCE TESTING

User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional requirements.

The system under consideration is tested for user acceptance by constantly keeping in
touch with prospective system users at time of developing and making changes whenever
required. The following points are considered.

 Input screen design

 Output screen design

 Online message to guide the user

 Menu driven system

 Format of adhoc queries and reports

70
PERFORMANCE TESTING
Performance is taken as the last part of implementation. Performance is perceived as
response time for user queries, report generation and process related activities.

71
Test Cases:
We are following Ad-Hoc testing.

Test Test Case Test Case Expected Result Actual Test Test Test
Case Id Name Description Result Result Sev Prio
erit rity
(P/F)
y

001 login Login stage To verify login Login P - -


name,password,trans successful
action code

002 Registrati Enter the To enter name, Registration P - -


on require details gender, contact, successful
and login address , email id ,
occupation,
password

003 Certificati Certificate To enter the details Certificate P - -


on creation of transaction ID, creation
creation name, contact no,
certification validity, successful
amount, bank,
upload biometrics.

004 Download To browse & To verify username, Download P - -


certificate download transaction code, certificate
certificate biometrics

005 Upload File To enter certificate Encrypted p - -


file and outsourcing no, upload file, file
encryptio outsourching server
n file and encrypted data

006 Server file To select the Server file list Show Download P - -
list file & files and download
download File
successful

007 Save file Encrypted file Save the file Encrypted P - -


File saved

72
CONCLUSIONS

We have seen plenty of applications calling for a search engine that is able to efficiently
support novel forms of spatial queries that are integrated with keyword search. The existing
solutions to such queries either incur prohibitive space consumption or are unable to give real
time answers. In this paper, we have remedied the situation by developing an access method
called the spatial inverted index (SI-index). Not only that the SI-index is fairly space economical,
but also it has the ability to perform keyword-augmented nearest neighbor search in time that is
at the order of dozens of milliseconds. Furthermore, as the SI-index is based on the conventional
technology of inverted index, it is readily incorporable in a commercial search engine that
applies massive parallelism, implying its immediate industrial merits.

73
REFERENCES

[1] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search over
relational databases. In Proc. Of International Conference on Data Engineering (ICDE), pages 5–
16, 2002.

[2] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An efficient and robust
access method for points and rectangles. In Proc. of ACM Management of Data (SIGMOD),
pages 322–331, 1990.

[3] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and
browsing in databases using banks. In Proc. of International Conference on Data Engineering
(ICDE), pages 431–440, 2002.

[4] X. Cao, L. Chen, G. Cong, C. S. Jensen, Q. Qu, A. Skovsgaard, D. Wu, and M. L. Yiu.
Spatial keyword querying. In ER, pages 16–29, 2012

[5] X. Cao, G. Cong, and C. S. Jensen. Retrieving top-k prestige-based relevant spatial web
objects. PVLDB, 3(1):373–384, 2010.

[6] X. Cao, G. Cong, C. S. Jensen, and B. C. Ooi. Collective spatial keyword querying. In Proc.
of ACM Management of Data (SIG- MOD), pages 373–384, 2011.

[7] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The bloomier filter: an efficient data
structure for static support lookup tables. In Proc. of the Annual ACM-SIAM Symposium on
Discrete Algorithms (SODA), pages 30–39, 2004.

[8] Y.-Y. Chen, T. Suel, and A. Markowetz. Efficient query processing in geographic web search
engines. In Proc. of ACM Management of Data (SIGMOD), pages 277–288, 2006.

74
[9] E. Chu, A. Baid, X. Chai, A. Doan, and J. Naughton. Combining keyword search and forms
for ad hoc querying of databases. In Proc. of ACM Management of Data (SIGMOD), 2009.

[10] G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval of the top-k most relevant spatial web
objects. PVLDB, 2(1):337–348, 2009.

[11] C. Faloutsos and S. Christodoulakis. Signature files: An access method for documents and
its analytical performance evaluation. ACM Transactions on Information Systems (TOIS),
2(4):267–288, 1984.

[12] I. D. Felipe, V. Hristidis, and N. Rishe. Keyword search on spatial databases. In Proc. of
International Conference on Data Engineering (ICDE), pages 656–665, 2008.

[13] R. Hariharan, B. Hore, C. Li, and S. Mehrotra. Processing spatialkeyword (SK) queries in
geographic information retrieval (GIR) systems. In Proc. of Scientific and Statistical Database
Management (SSDBM), 2007.

[14] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM Transactions
on Database Systems (TODS), 24(2):265–318, 1999.

[15] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. In


Proc. of Very Large Data Bases (VLDB), pages 670–681, 2002.

75

You might also like