International Journal of Computer Trends and Technology (IJCTT) – volume4Issue8–August 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 2810

A Comparative Study on Confidentiality by Search
Engines while Publishing Search Chunks
Thappita SumaLatha
1
, Ch. Sandhya Rani
2
, BetamSuresh
3

1
Thappita Suma Latha pursuing M.Tech(CSE), Vikas Group of Institutions (Formerly known as Mother Theresa Educational Society
Group of Institutions), Nunna, Vijayawada. Affiliated to JNTU- Kakinada, A.P, India
2
Ch. Sandhya Rani is working as an Assistant Professor in Department of CSE at Vikas Group of Institutions (Formerly known as
Mother Theresa Educational Society Group of Institutions), Nunna, Vijayawada, India.
3
Betam Suresh, working as an HOD at Vikas Group of Institutions (Formerly known as Mother Theresa Educational Society Group
of Institutions), Nunna, Vijayawada, Affiliated to JNTU Kakinada, A.P , India.

Abstract- The Database of Intentions is the aggregate results of
every search ever entered, every result list ever tendered, and every
path taken as a result. This information represents, in aggregate
form, a place holder for the intentions of humankind – a massive
database of desires, needs, wants, and likes that can be discovered,
subpoenaed, archived, tracked, and exploited to all sorts of ends.
On search engine millions of people searches for their query and
every user having different search .The collection of user’s
interaction in the form of search to the search engines known as
Search chunks. Search engine companies collecting these Search
chunks and want to publish these Search chunks because Search
chunks becoming like a gold mines for research. But Search Engine
Companies have to care about that they can publish Search chunks
in order not to disclose their sensitive information. In this paper we
are going to analyze techniques for publishing user’s search
keywords, query, information, clicks etc from Search chunks by
providing a limitation on their sensitive information. In Existing
approach i.e. Privacy-preserving data publishing (PPDP) having
vulnerable variants which can leads to get attack on their search
chunks. In this paper we are going to present an algorithm
ZEALOUS .This algorithm having a large number of experimental
study using real time application where we compare the work for
achieves K-anonymity in search chunk publishing by using
proposed and existing system.
Keyword- Search engine, log, Algorithm, privacy, k-anonymity.

I-INTRODUCTION
Present world is also known as web world. Web world is
collection of information. Peoples getting information according
to their choice by searching on web. Web is a huge collection of
information. Getting appropriate information is not possible
directly. For this purpose Search engine plays a vital role on
web. Search engine not only contain information about index
pages also they store the searching detail of user for more refine
searching. For specific search it is not possible to go through
with whole of web and due to vastness of web it is also not
possible. Search engine having capabilities to navigate through
vastness of web. So that when a user is going to interact with the
web first of all he interacts with the search engine, users place
his query, his detail, and some more detail to search engine.
Search engine companies store these user search query, user
details, user clicks on web etc and this information become a
huge data and known as Search chunks. Search chunks contain
valuable information that search engines use to make their
services better to their users’ needs. Search engine enable the
introduction of fashion, patterns, and intruders in the search
behavior of users, and they can be used in the development and
testing of new algorithms to improve search performance and
quality. Scientists of whole of the world having keen interest in
Search chunks for their own research work. They make pressure
on Search engine companies to publish user search chunks. But
Search engine companies do not release them because log
contains sensitive information about their users. For example
user searches for fashion, life style, personel tastes, political
affiliations etc.
The first and last release of search chunk was happened in 2006
by AOL and that one becomes a biggest debacle in search
industry.AOL publish search chunk of around 650000 users.
Security provided to user data only by hiding their userid by
random number. In this paper, we compare formal methods of
limiting disclosure when publishing frequent keywords, queries,
and clicks of a search chunk. Each method having different
ways of limiting disclosure when publishing search chunks.
Here we are going to show publishing search chunk technique
by using existing approach PDPP for achieving k-anonymity
and we showing it is insufficient in the light of attackers who
International Journal of Computer Trends and Technology (IJCTT) – volume4Issue8–August 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2811

can actively influence the search chunk. We then turn to
differential privacy a much stronger privacy guarantee;
however, we show that it is impossible to achieve good utility
with differential privacy. Then we are going to describe
ZEALOUS algorithm for making guarantee of privacy
preserving in publishing search chunks. Our paper concludes
with an extensive experimental evaluation, where we compare
the utility of various algorithms that guarantee anonymity or
privacy in search chunk publishing. Our evaluation includes
applications that use search chunks for improving boot search
experience and search performance, and our results show that
ZEALOUS’ output is sufficient for these applications while
achieving strong formal privacy guarantees.
We believe that the results of this research enable search engine
companies to make their search chunk available to researchers
without disclosing their users’ sensitive information. Beyond
publishing search chunks we believe that our findings are of
interest when publishing frequent item sets, as ZEALOUS
protects privacy against much stronger attackers than those
considered in existing work on privacy-preserving publishing of
frequent items/item sets.


II-PROBLEM STATEMENT

In this section we discuss about problem of publishing frequent
keywords, clicks user details or other items of a search chunk.
A. Search chunk- Search chunk contains sensitive personal and
non personal information. Sensitive personal information
includes information we know to be related to confidential
medical information, racial or ethnic origins, political or
religious beliefs or sexuality and tied to personal information.
Non-personal information is information that is recorded about
users so that it no longer reflects or references an individually
identifiable user. A user history or search history consist of all
search entries from a single user. If K is a keyword in the query
of a search chunk, we can say that this log contains keyword. A
keyword histogram is a set of (K, Ck) where C
k is
the number of
users. We count the number of click by every user in histogram
and if it is up to some predefined threshold value T, then we can
it is frequent. By this terminology we can define our goal as
publishing frequent items without disclosing sensitive
information.
B. Disclosure Limitation in Publishing search chunk-A
disclosure is the identification of a particular user’s search
history in the published search chunk. The concept of K-
anonymity has been introduced to avoid such identification. A
search chunk is said to be K-anonymous, if the user‘s search
history of every user is identical from the user’s history of at
least K-1 other user in the published search chunk. Stronger
disclosure limitations try to limit what an attacker can learn
about a user. Differential privacy guarantees that an attacker
learns roughly the same information about a user whether or not
the search history of that user was included in the search chunk.
Differential privacy has previously been applied to contingency
tables, learning problems, synthetic data generation and more.
C. Utility measurement in publishing search chunk-Here we will
compare the utility of producing sanitize search chunks.
Traditionally, the utility of a privacy-preserving algorithm has
been evaluated by comparing some statistics of the input with
the output to see how much information is lost. The choice of
suitable statistics is a difficult problem as these statistics need to
mirror the sufficient statistics of applications that will use the
sanitized search chunk, and for some applications the sufficient
statistics are hard to characterize. In this paper we take a similar
approach. We use two real applications from the information
retrieval community: Index caching, as a representative
application for search performance, and query substitution, as a
representative application for search quality. For both
applications the sufficient statistics are histograms of keywords,
queries, or query pairs.

D. Anonymity Insufficiency-k-anonymity and its variants prevent
an attacker from uniquely identifying the user that corresponds
to a search history in the sanitized search chunk. While it offers
great utility even beyond releasing frequent items its disclosure
guarantee might not be satisfactory. Even without unique
identification of a user, an attacker can infer the keywords or
queries used by the user. K-anonymity does not protect against
this severe information disclosure. There is one more issue
which came across with the current implementation of K-
anonymity. That is in place of guaranteeing that the frequent
keyword of k individuals are identical in a search chunk they
only assure that the frequent keyword associated with k different
userid are identical. These two guarantees are not similar
because individual can have multiple accounts or share
accounts. An attacker can use this loophole by creating multiple
accounts and by submitting fake queries multiple times.


E.Impossibility of Differential privacy-Here we discussing that
infeasibility of differential privacy in search chunk publication.
In realistic setting no differential private algorithm can produce
sanitized search chunk means a safe search chunk where
sensitive data of users will not reveal out with reasonable utility.

III-SYSTEM IMPLEMENTATION

In this section we will describe a technique which overcomes
problem occurred in previous techniques for publishing search
chunks. This technique is achieving by ZEALOUS algorithm.
Zealous algorithm ensures probabilistic differential privacy and
it follows a simple two phase framework. In the first phase
International Journal of Computer Trends and Technology (IJCTT) – volume4Issue8–August 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2812

Zealous algorithm generates a histogram of the item in input
search chunk, and then remove the item form histogram whose
frequency will be less than some predefined threshold values. In
the second phase of Zealous Algorithm, it adds noise to the
histogram count, and eliminates the items whose noisy
frequencies are smaller than another threshold. The resulting
histogram that is sanitized histogram return as output.

ZEALOUS ALGORITHM- Algorithm for publishing search
chunk is as follows.

INPUT -Search chunk S, positive numbers m, λ, τ, τ′.

1-For each user u select a set s
u
of up to m distinct items
from u’s search history in S.

2-Based on the selected items, create a histogram consisting of
pairs (k,C
k
), where k denotes an item and C
k
denotes the
number of users u that have k in their search historys
u
. We call
this histogram the original histogram.

3-Delete from the histogram the pairs (k,C
k
), with count C
k

smaller than ¡.

4- For each pair (k,C
k
) , in the histogram, sample a random
number pk from the Laplace distribution Lap(λ), and add pk to
the count C
k
, resulting in a noisy count: ˜C
k
← C
k
+pk.

5-Delete from the histogram the pairs (k,˜C
k
) with noisy
counts ˜C
k
≤ τ′.

6-Publish the remaining items and their noisy counts.




Fig 1-Zealous Algorithm for publishing search chunk
For achieving indistinguishability (e
i
,o
i
) by Zealous, if a
search chunk S and positive number m,¡,¡
i
onJ z and if

z ≥ 2m/ ∈

, and (1)
¡ =1 , and (2)
¡

≥ m(1−
Iog(2δ

/ m)


). (3)


Now we are showing table for (e
i
,o
i
) indistinguishability vs
(e,o) probabilistic differential privacy for around 200k user and
m =7



Privacy Guarantee ¡
i
=50 ¡
i
=100

λ=2(e,e
i
=10) o =1.3×10
-18
o =4.7×10
-41

o
i
=1.4×10
-21
o
i
=5.2×10
-43


λ=5(e,e
i
=5) o =1.3×10
-18
o =1.3×10
-18

o
i
=1.3×10
-18
o
i
=1.3×10
-18



Now we are discussing that how to set the parameter z onJ ¡′
for ensuring that Zealous will achieve (e,o) probabilistic
differential privacy
For a search chunk S and positive number m,¡,¡
i
,z Zealous
achieve (e,o) probabilistic differential privacy if
International Journal of Computer Trends and Technology (IJCTT) – volume4Issue8–August 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2813



z ≥ 2m/ e , and (4)

¡
i
−¡ ≥ max (−z ln[2−2c
-
1
Z¸,−z ln (
26
0.m/ :
)) (5)

Utility guarantee of Zealous algorithm in term of accuracy is
that Zealous is accurate for very frequent items and it provides
the perfect accuracy for the infrequent items.
Our (e,o) probabilistic differentially private algorithm
ZEALOUS is able to retain frequent items with probability at
least 1/2 while filtering out all infrequent items. On the other
hand, for any e -differentially private algorithm that can retain
frequent items with non-zero probability, its inaccuracy for large
item domains is larger than an algorithm that always outputs an
empty set.





IV-EXPERIMENTAL EVALUATION

In this section we are going to discuss implementation of
publishing privacy preserving search chunk. Then after we will
compare results of privacy preserving search chunk generated
by Zealous and search chunk generated by PDPP k-anonymity.
We will take original search chunk as a point of comparison
between both implementations. We are not going to compare
only utility of both algorithm, we also have to focus on
disclosure limitation guarantee of algorithms.
We experimentally compare both algorithms by publishing
search chunk for a portal. This portal having Job search, Study
materials details, Seminar details. Admin add all the detail
regarding jobs, study materials, seminar detail. A user is going
to search for jobs, study material, seminar information
according to their need. Collection of users search detail and his
personal detail becomes search chunk. We publish this search
chunk by implementing Zealous algorithm with high utility and
good privacy guarantees.
There are two way to measure the performance of algorithm. In
first measurement we are going to evaluate that how well output
of algorithm preserve selected statistics of the original search
chunks. Secondly by using our application we check utility of
Zealous. Here we discuss about Index caching as a
representative application for search performance and Query
substitution as representative application for search quality.

A. Utility Evaluation-


Fig 2-Diff between counts in k-anonymity and Zealous with original histogram

The above graph showing the utility evaluation between k-
anonymity and Zealous making original search chunk as a point
of comparison. As expected, with increasing e the average
difference decreases, since the noise added to each count
decreases. Similarly, by decreasing k the accuracy increases
because more queries will pass the threshold. We also computed
other metrics such as the root mean- square value of the
differences and the total variation difference; they all reveal
similar qualitative trends. Despite the fact that ZEALOUS
disregards many search chunk records ZEALOUS is able to
preserve the overall distribution well.

B.Index Caching-In Index caching problem we are going to
store in memory a set of posting list that maximize the hit
probability among all keywords. In our algorithm we proposed a
method by which we can decide which list should kept in
memory. Our algorithm first assigns each keyword a score,
which equals its frequency in the search chunk divided by the
number of documents that contain the keyword. Keywords are
chosen using a greedy bin-packing strategy where we
sequentially add posting lists from the keywords with the
highest score until the memory is filled. In our experiments we
fixed the memory size to be 200 MB, and each document
posting to be 4 Bytes. Our proposed index stores the document
posting list for each keyword sorted according to their relevance
which allows retrieving the documents in the order of their
relevance. We truncate this list in memory to contain at most
10000 documents. Hence, for an incoming query the search
engine retrieves the posting list for each keyword in the query
either from memory or from disk. If the intersection of the
posting lists happens to be empty, then less relevant documents
are retrieved from disk for those keywords for which only the
truncated posting list is kept on memory. Publishing search
chunk by our Zealous algorithm achieves better utility than
publishing search chunk by using PDPP K-anonymity for a
International Journal of Computer Trends and Technology (IJCTT) – volume4Issue8–August 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2814

range of parameters. Here we have seen that by increasing
privacy parameter or anonymity parameter utility going to
marginally suffers. This can be explained by the fact that it
requires only a few very frequent keywords to achieve a high
hit–probability. Keywords with a big positive impact on the hit-
probability are less likely to be filtered out by ZEALOUS than
keywords with a small positive impact. This explains the
marginal decrease in utility for increased privacy.

C. Query Substitution-In this section we will discuss how a
query substitution algorithm examine query pairs to learn that
how user re-phrase queries. Here in our algorithm we are going
to develop related queries for a query. This process is divided
into two phase .In first phase the original query is partitioned
into subsets of keyword whom we call as phrases based on their
mutual information. In the second phase for each phrase query
substitutions are determined based on the distribution of queries.
We run this algorithm to generate ranked substitution on the
sanitized search chunks. We then compare these rankings with
the rankings produced by the original search chunk which serve
as ground truth. To measure the quality of the query
substitutions,
we compute the precision/recall, MAP (mean average precision)
and NDG (normalized discounted cumulative gain) of the top
suggestions for each query.
Consider query q and list of top ranked substitution
q′
0
,…….q′
]-1
computed based on a sanitized search chunk.
The precision and recall of a query from the sanitized search
chunk is as follows-


Precision(q) =
| {q
0,………,q
]-1
}∩{q′
0,…………….,q′
]-1
} |
| {q′
0,…………….,q′
]-1
} |




Recall(q) =
| {q
0,………,q
]-1
}∩{q′
0,…………….,q′
]-1
} |
| {q
0,………,q
]-1
} |


and

MAP(q) =∑
ì+1
¡unk o] q
i
ìn jq

0
,………….q

]-1
[+1
]-1
ì=0



V- CONCLUSION

Publishing search chunk is a very useful phenomenon for
researcher and scientists. But search chunk contains sensitive
information and previously given techniques were not sufficient
for publishing search chunk. In this paper we introduced a new
technique for publishing search chunk by using Zealous
Algorithm. In this paper we showed that by implementing this
technique we are able to publish search chunk in efficient and
safe manner. Sensitive information will not disclose for all. By
using this technique we are able to publish search chunk with
high utility and low disclosure probability. In this paper we
compared proposed technique with previous technique and
shown their comparison result on behalf of which we can
declare that publishing search chunk by using proposed
technique is much efficient and safe compare to anonymity
techniques.

REFFERENCES

Soumen Chakrabarti, Rajiv Khanna, Uma Sawant, and Chiru
Bhattacharyya. Structured learning for non-smooth ranking
losses. In KDD, pages 88–96, 2008.

Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya
Mironov, and Moni Naor. Our data, ourselves: Privacy via
distributed noise generation. In EUROCRYPT, 2006.

Yeye He and Jeffrey F. Naughton. Anonymization of set-valued
data via top-down, local generalization. PVLDB, 2(1):934–945,
2009.

Eytan Adar. User 4xxxxx9: Anonymizing query logs. In WWW
Workshop on Query Log Analy sis, 2007.

Roberto Baeza-Yates. Web usage mining in search engines.
Web Mining: Applications and techniques.


AUTHORS PROFILE




Thappita Sumalatha,
Pursuing M.Tech(CSE) from
Vikas Group Of Institutions
(Formerly known as Mother
Theresa Educational Society
Group of Institutions),
Nunna, Vijayawada.
Affiliated to JNTU-Kakinada,
A.P., India


International Journal of Computer Trends and Technology (IJCTT) – volume4Issue8–August 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2815


Ch. Sandhya Rani,
working as a Assistant
Professor of CSE
department at Vikas Group
Of Institutions, Nunna,
Vijayawada, Affiliated to
JNTU-Kakinada, A.P., India
Betam Suresh, is working
as an HOD, Department of
Computer science
Engineering at Vikas Group
of Institutions (Formerly
Mother Teresa Educational
society Group of
Institutions), Nunna,
Vijayawada, Affiliated to
JNTU-Kakinada, A.P., India