You are on page 1of 5

2009 IEEE International Advance Computing Conference (IACC 2009)

Patiala, India, 6-7 March 2009

A Novel and Efficient Approach For Near Duplicate


Page Detection in Web Crawling
V.A.Narayana P. Premchand Dr. A. Govardhan
Associate Professor & Head, CSE Professor, Department of Computer Professor & Head of the department
Department Science & Engineering Department of Computer Science
CMR College of Engineering & University College of Engineering, and Engineering
Technology, JNTU Osmania University JNTU College of Engineering
Hyderabad, India Hyderabad-500007, AP, India Kukatpally, Hyderabad, India
narayanaphd@gmail.com p.premchand@uceou.edu

Abstract-The drastic development of the World Wide Web in mining knowledge from the web pages besides other web
the recent times has made the concept of Web Crawling receive objects is known as Web content mining. Web structure
remarkable significance. The voluminous amounts of web mining is the process of mining knowledge about the link
documents swarming the web have posed huge challenges to the structure linking web pages and some other web objects. The
web search engines making their results less relevant to the users. mining of usage patterns created by the users accessing the
The presence of duplicate and near duplicate web documents in web pages is called Web usage mining [27].
abundance has created additional overheads for the search
engines critically affecting their performance and quality. The The World Wide Web owes its development to the Search
detection of duplicate and near duplicate web pages has long engine technology. The chief gateways for access of
been recognized in web crawling research community. It is an information in the web are Search engines. Businesses have
important requirement for search engines to provide users with turned beneficial and productive with the ability to locate
the relevant results for their queries in the first page without contents of particular interest amidst a huge heap [34]. Web
duplicate and redundant results. In this paper, we have presented crawling, a process that populates an indexed repository of
a novel and efficient approach for the detection of near duplicate web pages is utilized by the search engines in order to respond
web pages in web crawling. Detection of near duplicate web to the queries [32]. The programs that navigate the web graph
pages is carried out ahead of storing the crawled web pages in to and retrieve pages to construct a confined repository of the
repositories. At first, the keywords are extracted from the s
crawled pages and the similarity score between two pages is
calculated based on the extracted keywords. The documents
segenof th web tatey sit. Earlir thers progras,
were known by diverse names such as wanderers, robots,
having similarity scores greater than a threshold value are
considered as near duplicates. The detection has resulted in imagery [31].
spiders, fish, and worms, words in accordance with the web
reduced memory for repositories and improved search engine Generic and Focused crawling are the two main types of
quality. crawling. Generic crawlers [1] differ from focused crawlers
Keywords-Webmiing;
Keywords-WebMining; WebContentMinig;WebCra[28]
Webmontent Ning We
in atopics
diverse that the former
way whereas
Caweng
crawl
the latter documents
limits and oflinks
the number of
pages
Web pages;Stemming;Commonwords;
Near dupilcate pages;ion
with the aid of some obtained knowledge.
Repositories of web pages are built by the web crawlers so as
prior specialized
I. INTRODUCTION to present input for systems that index, mine, and otherwise
The employment of automated tools to locate the analyze pages (for instance, the search engines) [9]. The
information resources of interest, and for tracking and subsistence of near duplicate data is an issue that accompanies
analyzing tevsame,h anayzig.teamehabeom
become inevitab
inevi te
drastic development
teseiday
ong to incorporate dthe
heterogeneous
e
Internet
of thedata [36]. and growing
Eventhethough the need
near
to the drastic development on duplicate data are not bit wise identical they bear a striking
server-side andWebthis hasemade te develpmen of similarity [36]. Web search engines face huge problems due to
ser-sciden anowledgclient-side10Aof
ent manindatoythat the duplicate and near duplicate web pages. These pages either
inte system

defficwienthth
dealswit tenowledgesminig[
k
anis of WorlddatadWideWebis
Wd Webtsknow minwngaseWe increase the index
serving costs storage
thereby spacetheor users.
irritating slow down theincrease
Thus or the
algorithms
ares
areas uchasDat
such as Data
Mining. Web
Mininowesitsoiginoconeptsromforssuesdetecting
M nig Internet
Mining,Y Inere technolog,y
teholg and an World
ol such pages
such as freshness P
are inevitable [24]. Web crawling
and efficient resource usage have
____11-
Wid Web, and lael, We. [30], web structure
Seani
includtes the sub areas: web content mining
6.Wbmnn been addressed previously [5], [14], [11]. Lately, the
elmnto of dulct n erdplct e ouet
minin [1] an.e sg iig[5 n a edfnda has become a vital issue and has attracted significant research
the proced ure of d etermining hidden yet potentially beneficial [20]
knowledge from the data accessible in the web. The process of []

978-1T-4244- 1888-6/08/f$25.00 Q 2008 IEEE 1492

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on January 2, 2010 at 04:24 from IEEE Xplore. Restrictions apply.
Identification of the near duplicates can be advantageous detection.
to many applications. Focused crawling, enhanced quality and
diversity of the query results and identification on spams can *Sergey Brin et al. [2] have proposed a system for
be facilitated by determining the near duplicate web pages [ ,
19, 24]. Numrerous web mlining applications depend
reistern documents and then detecn coi Tes
complete copies or partial copies. They described algorithms
either

Documean clusteien midndetection


tificationo neaer lcthe
accurat and proficien nd on the for detection, and metrics required for evaluating detection
[ uplicated
of ar
e mechanisms covering accuracy, efficiency and security. They
collections[2]
collections [12] detecting plagiarisme[23] community mining
detecting plagiarism [23, communitymining also described a prototype implementation
and presented of the service,
in a social network site [35], collaborative filtering [7] and COPS p experimental
p results that
n indeed detect violations of interest.
suggest the
. . . 1 _ _ r ~~~~~~~~servicecan
discovering large dense graphs [21] are a notable few among
those applications. Reduction in storage costs and Andrei Z. Broder et al. [3] have developed an efficient way
enhancement in quality of search indexes besides considerable to determine the syntactic similarity of files and have applied
bandwidth conservation can be achieved by eliminating the it to every document on the World Wide Web. Using their
near duplicate pages [1]. Check summing techniques can mechanism, they have built a clustering of all the documents
determine the documents that are precise duplicates (because that are syntactically similar. Possible applications include a
of mirroring or plagiarism) of each other [13]. The recognition "Lost and Found" service, filtering the results of Web
of near duplicates is a tedious problem. searches, updating widely distributed web-pages, and
Research on duplicate detection was initially done on identifying violations of intellectual property rights.
databases, digital libraries, and electronic publishing. Lately Jack G. Conrad et al. [15] have determined the extent and
duplicate detection has been extensively studied for the sake the types of duplication existing in large textual collections.
of numerous web search tasks such as web crawling, Their research is divided into three parts. Initially they started
document ranking, and document archiving. A huge number with a study of the distribution of duplicate types in two
of duplicate detection techniques ranging from manually broad-ranging news collections consisting of approximately
coded rules to cutting edge machine learning techniques have 50 million documents. Then they examined the utility of
been put forth [36, 24, 18, 2, 3, 15, 38]. Recently few authors document signatures in addressing identical or nearly identical
have projected near duplicate detection techniques [37, 38, 8, duplicate documents and their sensitivity to collection updates.
22]. A variety of issues such as from providing high detection Finally, they have investigated a flexible method of
rates to minimizing the computational and storage resources characterizing and comparing documents in order to permit
have been addressed by them. These techniques vary in their the identification of non-identical duplicates. Their method
accuracy as well. Some of these techniques are has produced promising results following an extensive
computationally pricey to be implemented completely on huge evaluation using a production-based test collection created by
collections. Even though some of these algorithms prove to be domain experts.
efficient they are fragile and so are susceptible to minute
changes ofthe text. Donald Metzler et al. [29] have explored mechanisms for
measuring the intermediate kinds of similarity, focusing on the
The primary intent of our research is to develop a novel task of identifying where a particular piece of information
and efficient approach for detection of near duplicates in web originated. They proposed a range of approaches to reuse
documents. Initially the crawled web pages are preprocessed detection at the sentence level, and a range of approaches for
using document parsing which removes the HTML tags and combining sentence-level evidence into document-level
java scripts present in the web documents. This is followed by evidence. They considered both sentence-to-sentence and
the removal of common words or stop words from the crawled document-to-document comparison, and have incorporated the
pages. Then the stemming algorithm is applied to filter the algorithms into RECAP, a prototype information flow analysis
affixes (prefixes and the suffixes) of the crawled documents in tool.
order to get the keywords. Finally, the similarity score
between two documents is calculated on basis of the extracted Hui Yang et al. [37] have explored the use of simple text
keywords. The documents with similarity scores greater than a clustering and retrieval algorithms for identifying near-
predefined threshold value are considered as near duplicates, duplicate public comments. They have focused on automating
We have conducted an extensive experimental study using the process of near-duplicate detection, especially form letter
several real datasets, and have demonstrated that the proposed detection. They gave a clear near-duplicate definition and
algorithms outperform previous ones. explored simple and efficient methods of using feature-based
document retrieval and similarity-based clustering to discover
The rest of the paper is organized as follows. Section 2 near-duplicates. The methods were evaluated in experiments
presents a brief review of some approaches available in the with a subset of a large public comment database collected for
literature for duplicates and near duplicates detection. In EPA rule.
Section 3, the novel approach for the detection of near
duplicate documents is presented. The conclusions are Monika Henzinger [24] has compared the two algorithms
summed up in Section 4. namely shingling algorithm [3] and random projection based
approach [13] on a very large scale set of 1 .6B distinct web
II. RELATED WORK pages. The results showed that neither of the algorithms works
Our wrk ben
hs inpire by numer o preious well for finding near-duplicate pairs on the same site, while
works on duplicate and near duplicate document and web page bohaiee ig prcsnfrna-dlcte arsn

2009 IEEE Inxternational Advanxce Computing Conference (IACC 2009) 1493

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on January 2, 2010 at 04:24 from IEEE Xplore. Restrictions apply.
different sites. She has presented a combined algorithm which should utilize minimal number of machines [22].
achieves precision 0.79 with 79% of the recall of the other The near duplicate detection is performed on the keywords
algorithms. extracted from the web documents. First, the crawled web
Hui Yang et al. [38] have presented DURIAN (DUplicate documents are parsed to extract the distinct keywords. Parsing
Removal In lArge collectioN), a refinement of a prior near- includes removal of HTML tags, java scripts, stop
duplicate detection algorithm. DURIAN uses a traditional bag- words/common words and stemming of remaining words. The
of-words document representation, document attributes extracted keywords and their counts are stored in a table to
("metadata"), and document content structure to identify form ease the process of near duplicates detection. The keywords
letters and their edited copies in public comment collections. are stored in the table in a way that the search space is reduced
The results have demonstrated that statistical similarity for the detection. The similarity score of the current web
measures and instance-level constrained clustering can be document against a document in the repository is calculated
quite effective for efficiently identifying near-duplicates. from the keywords of the pages. The documents with
In the course of developing a near-duplicate detection similarity score greater than a predefined threshold are
system for a multi-billion page repository, Gurmeet Singh considered as near duplicates.
Manku et al. [22] have made two research contributions. First, A. Near Duplicate Web Documents
they demonstrated that Charikar's [13] fingerprinting Even though the near duplicate documents are not bitwise
technique is appropriate for this goal. Second, they presented identical they bear striking similarities. The near duplicates are
an algorithmic technique for identifying existing f-bit not considered as "exact duplicates" but are files with minute
fingerprints that differ from a given fingerprint in at most k differences. Typographical errors, versioned, mirrored, or
bit-positions, for small k. This technique is useful for both plagiarized documents, multiple representations of the same
online queries (single fingerprints) and batch queries (multiple physical object, spam emails generated from the same
fingerprints). template, and many such phenomenons may result in near
Ziv BarYossef et al. [8] have considered the problem of duplicate data. A considerable percentage of web pages have
DUST: Different URLs with Similar Text. They proposed a been identified as be near-duplicates according to various
novel algorithm, DustBuster, for uncovering dust; that is, for studies [3, 196 and 24]. These studies propose that near
discovering rules that transform a given URL to others that are duplicates constitute almost 1.7% to 7% of the web pages
likely to have similar content. DustBuster mines dust traversed by crawlers. The steps involved in our approach are
effectively from previous crawl logs or web server logs, presented in the following subsections.
without examining page contents. B. Web Crawling
Chuan Xiao et al. [36] have proposed exact similarity join The analysis of the structure and informatics of the web is
algorithms with application to near duplicate detection and a facilitated by a data collection technique known as Web
positional filtering principle, which exploits the ordering of Crawling. The collection of as many beneficiary web pages as
tokens in a record and leads to upper bound estimates of possible along their interconnection links in a speedy yet
similarity scores. They demonstrated the superior performance proficient manner is the prime intent of crawling. Automatic
of their algorithms to the existing prefix filtering-based traversal of web sites, downloading documents and tracing
algorithms on several real datasets under a wide range of links to other pages are some of the features of a web crawler
parameter settings. program. Numerous search engines utilize web crawlers for
gathering web pages of interest besides indexing them. Web
III. NOVEL APPROACH FOR NEAR DUPLICATE WEBPAGE crawling becomes a tedious process due to the subsequent
DETECTION features of the web, the large volume and the huge rate of
A novel approach for the detection of near duplicate web change due to voluminous number of pages being added or
pages is presented in this section. In web crawling, the removed each day.
crawled web pages are stored in a repository for further Seed URLs are a set of URLs that a crawler begins
process such as search engine formation, page validation, working with. These URLs are queued. A URL is obtained in
update nohtifcation
structural analysis and visualization, structural analysis ad vsome order from the queue by a crawler. Then the crawler
mirroring and personal web assistants or agents and more. downloads the page. This is followed by the extracting the
Duplicate and near duplicate web page detection is an
important stpi,ebcaln.
~ important
In ore tostepfacliat serc URLs from the downloaded page and enqueuing them. The
inwecawin.nrdroaprocess
engines to provide search results free of redundancy to users crawling continues unless the crawler settles to stop [33]. A
loop consists of obtaining a URL from the queue,
and to provide distinct and useful results on the first page, downloadin the correspondin file with the aid of HTTP,
duplicate and near duplicate detection is essential. Numerous gt g
challenges are encountered by the systems that aid in the
detection of near duplicate pages. First is the concern of scale
URLsato queue [3 1
the
since the search engines index hundreds of millions of web- C. Web Document Parsing
pages thereby amounting to a multi-terabyte database. Next is Information extracted from the crawled documents aid in
the issue of making the crawl engine crawl billions of web determining the future path of a crawler. Parsing may either be
pages everyday. Thus marking a page as a near duplicate as simple as hyperlink/RL extraction or complex ones such
should be done at a quicker pace. Furthermore, the system as analysis of HTML tags by cleaning the HTML content [31].

11494 2009 IEEE Internactionalz Advance Computing Conference (IACC 2009)

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on January 2, 2010 at 04:24 from IEEE Xplore. Restrictions apply.
It is inevitable for a parser that has been designed to traverse T K, K3 K2 K4 Kn1
the entire web to encounter numerous errors. The parser tends T2 c1 C3 C2 C4 Cn
to obtain information from a web page by not considering a
few common words like a, an, the and more, HTML tags, Java The keywords in the tables are considered individually for
Scripting and a range of other bad characters [16] the similarity score calculation. If a keyword is present in
1) Stop Words Removal. It is necessary and beneficial to both the tables, the formula used to calculate the similarity
remove the commonly utilized stop words such as "it", "can" score of the keyword is as follows.
,"an"y "and","by", "for", "from", "of', "the", "to", "with" and a = A[Ki ]T
more either while parsing a document to obtain information b
about the content or while scoring fresh URLs that the page 2

recommends. This procedure is termed as stop listing [31]. SDC = log(count(a) / count(b)) * Abs(I + (a - b))
Stop listing aids in the reduction of size of the indexing file
besides enhancing efficiency and value. Here 'a' and 'b' represent the index of a keyword in the
two tables respectively.
D. StemmingAlgorithm
Variant word forms in Information Retrieval are restricted If the keywords of T. \ T2 # P, we use the following
to a common root by Stemming. The postulation lying behind formula to calculate the similarity score. The amount of the
is that, two words posses the same root represent identical keywords present in T, but not in T2 is taken as NT1
concepts. Thus terms possessing to identical meaning yet
appear morphologically dissimilar are identified in an IR SD =log(count (a))*( + T2 )
system by matching query and document terms with the aid of T,
Stemming [4]. Stemming facilitates the reduction of all words If the keywords of T2 \ T1 #& (p, we use the below
possessing an identical root to a single one. This is achieved mentioned formula to calculate the similarity score. The
by removing each word of its derivational and inflectional occurrences of the keywords present in T2 but not in T, is
suffixes [26]. For instance, "connect," "connected" and taken as N
"connection" are all condensed to "connect".
SDFT2 = log(count(b)) * (I + IT1 I)
The similarity score (SSM) of a page against another page
We posses the distinct keywords and their counts in each is calculated by using the following equation.
of the each crawled web page as a result of stemming. These
keywords are then represented in a form to ease the process of INCI INT1 NT2
near duplicates detection. This representation will reduce the
search space for the near duplicate detection. Initially the
NSD
Z D
+
.
'SDDT, +vYST2
keywords and their number of occurrences in a web page have SSM -= =1 ,=1 NN i=
been sorted in descending order based on their counts.
Afterwards, n numbers of keywords with highest counts are Where n = 1T UT2| andN = T + )/2. The web
stored in a table and the remaining keywords are indexed and documents with similarity score greater than a predefined
stored in another table. In our approach the value of n is set to threshold are near duplicates of documents already present in
be 4. The similarity score between two documents can be repository. These near duplicates are not added in to the
calculated if and only the prime keywords of the two repositoryforfurtherprocess suchas searchengine indexing.
documents are similar. Thus the search space is reduced for
near duplicates detection. IV. CONCLUSION
F. Similarity Score Calculation Though the web is a huge information store, various
If the prime keywords of the new web page do not match features such as the presence of huge volume of unstructured
with the prime keywords of the pages in the table, then the or semi-structured data; their dynamic nature; existence of
new web page is added in to the repository. If all the keywords duplicate and near duplicate documents and the like pose
serious difficulties for Information Retrieval. The voluminous
of both pages are same then the new page is considered as
duplicate and thus is not included in the repository. If the amounts of web documents swarming the web have posed a
prime keywords of new page are same with a page in the huge challenge to the web search engines making them render
repository, then the similarity score between the two results of less relevance to the users. The detection of
documents is calculated. The similarity score of two web duplicate and near duplicate web documents has gained more
documents is calculated as follows: attention in recent years amidst the web mining researchers. In
this paper, we have presented a novel and efficient approach
Let T1 and T2 be the tables containing the extracted for detection of near duplicate web documents in web
keywords and their corresponding counts. crawling. The proposed approach has detected the duplicate
~~~~~~~~and
near duplicate web pages efficiently based on the
Kll l1 K2 *- K11|
K
K K
K4 K5 keywords extracted
from the web pages. F urther more,
C1 C2 C4 C5 C11 reduced memory spaces for web repositories and improved
search engine quality have been accomplished through the

2009 IEEE International Advance Computing Conference (IACC 2009) 1495

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on January 2, 2010 at 04:24 from IEEE Xplore. Restrictions apply.
proposed duplicates detection approach. [19] Fetterly, D., Manasse, M., and Najork, M., (2003) "On the evolution of
clusters of near-duplicate web pages", Proceedings of the First
REFERENCES Conference on Latin American Web Congress, pp. 37.
[1] Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. [20] Garcia-Molina, H., Ullman, J.D., and Widom, J., (2000) "Database
(2001) "Searching the web", ACM Transactions on Intenet Technology, System Implementation", Prentice Hall.
vol. 1, no. 1: pp. 2-43. [21] Gibson, D., Kumar, R., and Tomkins, A., (2005) "Discovering large
[2] Brin, S., Davis, J. and Garcia-Molina, H. (1995) "Copy detection dense subgraphs in massive graphs", Proceedings of the 31st
mechanisms for digital documents", In Proceedings of the Special .ter7a2onal conference on Very large data bases, Trondheim, Norway,
Interest Group on Management of Data (SIGMOD 1995), ACM Press, pp[2 G S.
pp.398-409, May. [22] Gurmeet S. Manku, Arvind Jain, Anish D. Sarma, (2007) "Detecting
[3] Broder, A. Z., Glassman, S. C., Manasse, M. S. and Zweig, G., (1997) near-duplicates for web crawling," Proceedings of the 16th international
"Syntactic clustering of the web", Computer Networks, vol. 29, no. 8- conference on World Wide Web, pp: 141 - 150
13, pp.1157-1166. [23] Hoad, T. C., and Zobel, J., (2003) "Methods for identifying versioned
[4] Bacchin, M., Ferro, N., Melucci, M., (2002) "Experiments to evaluate a and plagianzed documents", JASIST, vol. 54, no. 3, pp.203 215.
statistical stemming algorithm", Proceedings of the international Cross- [24] Henzinger, M., (2006) "Finding near-duplicate web pages: a large-scale
Language Evaluation Forum Workshop, University of Padua at CLEF evaluation of algorithms," Proceedings of the 29th annual international
2002. ACM SIGIR conference on Research and development in information
[5] Broder, A. Z., Najork, M., and Wiener, J. L., (2003) "Efficient URL retrieval, pp. 284-291.
caching for World Wide Web crawling", Proceedings of the 12th [25] Kosala, R., and Blockeel, H., (2000) "Web Mining Research: A Survey,"
intemational conference on World Wide Web, pp. 679 - 689. SIGKDD Explorations, vol. 2, Issue. 1.
[6] Berendt, B., Hotho, A., Mladenic, D., Spiliopoulou, M., and Stumme, [26] Lovins, J.B. (1968) "Development of a stemming algorithm". Mechanical
G., (2004) "A Roadmap for Web Mining: From Web to Semantic Web", Translation and Computational Linguistics, vol. 1 1, pp. 22-31.
Lecture Notes in Artificial Intelligence Springer-Verlag Heidelberg , [27] Lim, E.P., and Sun, A., (2006) "Web Mining - The Ontology Approach",
Berlin; Heidelberg; New York, Vol. 3209, pp. 1-22. ISBN: 3-540- International Advanced Digital Library Conference (IADLC'2005),
23258-3. Nagoya University, Nagoya, Japan.
[7] Bayardo, R. J., Ma, Y., and Srikant, R., (2007) "Scaling up all pairs [28] Menczer, F., Pant, G., Srinivasan, P., and Ruiz, M. E., (2001)
similarity search", In Proceedings of the 16th International Conference "Evaluating topic-driven web crawlers", In Proceedings 24th Annual
on World Wide Web, pp.131-140. International ACM SIGIR Conference on Research and Development in
[8] Bar-Yossef, Z., Keidar, I., Schonfeld, U., (2007) "Do not crawl in the dust: Information Retrieval, pp. 241-249.
different URLS with similar text", Proceedings of the 16th international [29] Metzler, D., Bernstein, Y., and Bruce Croft, W., (2005) "Similarity
conference on World Wide Web, pp: 111 - 120. Measures for Tracking Information Flow", Proceedings of the fourteenth
[9] Balamurugan, S., Newlin Rajkumar, and Preethi, J., (2008) "Design and international conference on Information and knowledge management,
Implementation of a New Model Web Crawler with Enhanced CIKM'05, October 31 .November 5, Bremen, Germany.
Reliability", Proceedings of world academy of science, engineering and [30] Pazzani, M., Nguyen, L., and Mantik, S., (1995) "Learning from hotlists
technology, volume 32, August, ISSN 2070-3740. and coldlists: Towards a www information filtering and seeking agent",
[10] Cooley, R., Mobasher, B., Srivastava, J., (1997) "Web mining: Proceedings, Seventh International Conference on Tools with Artificial
information and pattern discovery on the World Wide Web", Intelligence, pp. 492 - 495.
Proceedings of the Ninth IEEE International Conference on Tools with [31] Pant, G., Srinivasan, P., Menczer, F., (2004) "Crawling the Web". Web
Artificial Intelligence, pp: 558 - 567, 3-8 November, DOI: Dynamics: Adapting to Change in Content, Size, Topology and Use,
10.1 109/TAI.1997.632303. edited by M. Levene and A. Poulovassilis, Springer- verlog, pp: 153-
[11] Cho, J., Garcia-Molina, H., and Page, L., (1998) "Efficient crawling 178, November.
through URL ordering", Computer Networks and ISDN Systems, vol. [32] Pandey, S.; Olston, C., (2005) "User-centric Web crawling", Proceedings
30, no. 1-7: pp. 161-172. of the 14th international conference on World Wide Web, pp: 401 - 411.
[12] Cho, J., Shivakumar, N., and Garcia-Molina, H., (2000) "Finding [33] Shoaib, M., Arshad, S., "Design & Implementation of web information
replicated web collections", ACM SIGMOD Record, Vol. 29, no. 2,pp. gathering system," NITS e-Proceedings, Internet Technology and
355 - 366. Applications, King Saud University, pp.142-1.
[13] Charikar, M., (2002) "Similarity estimation techniques from rounding [34] Singh, A., Srivatsa, M., Liu, L., Miller, T., (2003) "Apoidea: A
algorithms". In Proc. 34th Annual Symposium on Theory of Computing Decentralized Peer-to-Peer Architecture for Crawling the World Wide
(STOC 2002), pp: 380-388. Web", Proceedings of the SIGIR 2003 Workshop on Distributed
[14] Chakrabarti, S., (2002) "Mining the Web: Discovering Knowledge from Information Retrieval, Lecture Notes in Computer Science, Vol. 2924,
Hypertext Data", Morgan-Kaufmann Publishers, pp: 352, ISBN 1- pp. 126-142.
55860-754-4. [35] Spertus, E., Sahami, M., and Buyukkokten, O., (2005) "Evaluating
[15] Conrad, J. G., Guo, X. S., and Schriber, C. P., (2003) "Online duplicate similarity measures: a large-scale study in the orkut social network", In
document detection: signature reliability in a dynamic retrieval KDD (2003), pp. 678-684.
environment", Proceedings of the twelfth international conference on [36] Xiao, C., Wang, W., Lin, X., Xu Yu, J.,(2008) "Efficient Similarity Joins
Information and knowledge management, New Orleans, LA, USA, pp. for Near Duplicate Detection", Proceeding of the 17th international
443 - 452. conference on World Wide Web, pp:131--140.
[16] Castillo, C., (2005) "Effective web crawling", SIGIR Forum, ACM Press, [37] Yang, H., and Callan, J., (2005) "Near-duplicate detection for
Volume 39, Number 1, N, pp.55-56. eRulemaking", Proceedings of the 2005 national conference on Digital
[17] David, G., Jon, K., and Prabhakar R., (1998) "Inferring web communities government research, pp: 78 - 86.
from link topology", Proceedings of the ninth ACM conference on [38] Yang, H., Callan, J., Shulman, S., (2006) "Next steps in near-duplicate
Hypertext and hypermedia: links, objects, time and space-structure in detection for eRulemaking", Proceedings of the 2006 international
hypermedia systems, Pittsburgh, Pennsylvania, United States, pp.225 - conference on Digital government research, Vol. 151, pp: 239 - 248.
234.
[18] Deng, F., and Rafiei, D., (2006) "Approximately detecting duplicates for
streaming data using stable bloom filters," Proceedings of the 2006
ACM SIGMOD international conference on Management of data, pp.
25-36.

1496 2009 IEEE International Advance Computing Conference (IACC 2009)

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on January 2, 2010 at 04:24 from IEEE Xplore. Restrictions apply.

You might also like