Professional Documents
Culture Documents
net/publication/237013286
CITATION READS
1 5,849
5 authors, including:
Bidisha Roy
University of Mumbai
21 PUBLICATIONS 119 CITATIONS
SEE PROFILE
All content following this page was uploaded by Bidisha Roy on 03 June 2014.
ABSTRACT who have the queried name. The task of disambiguating and
Web People Search (WePS) is a desktop application which finding the web pages related to the specific person of interest
aims to find relevant pages from the web related to a person’s is left to the user. The approach is based on extracting named
name. The project work is targeted to design an advanced entities from the web pages and then querying the web to
version of the deep extraction tool using Clustering collecting co-occurrence statistics, which are used as
Algorithm. In this research work, the focus is mainly on additional similarity measures.
querying for personal information of scientists and One particular case of this people-document association
researchers. The user has to set the proper target name for task is referred to as personal name resolution. The task is as
search, which when completed, the user will receive complete follows: given a set of documents all of which refer to a
PDF files based on the search. Each group of information particular person name but not necessarily a single individual
items (cluster) will be defined by its key and the user make (usually called referent), identify which documents are
the choice. The result page will be produced from the chosen associated with each referent by that name. Different methods
clusters. For making the search operationally accurate, we will have been used to represent documents that mention a
assume the usage of research and conference doc files as they candidate, including snippets, text around the person name,
carry a standard format of name, e-mail ID, publication, entire documents, extracted phrases, etc.
images, and links to the full images.
1.1 Existing System and its effect
General Terms Searching for information on the Web is not an easy task.
Searching for personal information is sometimes even more
API (Web service platform), PDF files, Java Swing, Net complicated. Below are several common problems we face
beans. when trying to get personal details from the web:
Keywords Majority of the Information is distributed between
different sites.
Clustering, Web People Search, WePS, Named Entity, Web It is not updated.
Querying, Crawling, Indexing. Multi-Referent ambiguity – two or more people
1. INTRODUCTION with the same name.
World Wide Web has more and more online Web databases Multi-morphic ambiguity which is because one
which can be searched through their Web query interfaces. name may be referred to in different forms.
The number of Web databases has reached 25millions In the most popular search engine Google, one can
according to a recent survey. All the Web databases make up set the target name and based on the extremely
the deep Web (hidden Web or invisible Web). Often the limited facilities to narrow down the search, still the
retrieved information (query results) is enwrapped in Web user has 100% feasibility of receiving irrelevant
pages in the form of data records. These special Web pages information in the output search hits. Not only this,
are generated dynamically and are hard to index by traditional the user has to
crawler based search engines, such as Google and Yahoo. manually see, open, and then download their respective file
Each data record on the deep Web pages corresponds to an which is extremely time consuming. The major reason behind
object. In order to ease the consumption by human users, this is that there is no uniform format for personal
most Web databases display data records and data items information.
regularly on Web browsers. Maximum of the past work is based on exploiting the link
Searching for people on the Web is one of the most common structure of the pages on the web, with hypothesis that web
query types to the web search engines today. Internet users pages belonging to the same person are more likely to be
access billions of web pages online using search engines. linked together.
However, when a person name is queried, the returned result
often contains web pages related to several distinct namesakes
28
International Conference & Workshop on Recent Trends in Technology, (TCET) 2012
Proceedings published in International Journal of Computer Applications® (IJCA)
1.2 Motivation
29
International Conference & Workshop on Recent Trends in Technology, (TCET) 2012
Proceedings published in International Journal of Computer Applications® (IJCA)
2. SYSTEM OVERVIEW
3. Clustering: The filtered web pages are then
clustered and saved.
The clustering part consists of the following
stages:
TF/IDF Similarity: TF/IDF similarity on
NEs only is computed.
Querying the Web: For each pair of web
pages di and dj several co-occurrence
Pre-processing queries are formed and issued to a Web
search engine.
Applying Clustering: A clustering
algorithm, such as nearest neighbor
algorithm are used and takes into account
the TF/IDF similarity.
30
International Conference & Workshop on Recent Trends in Technology, (TCET) 2012
Proceedings published in International Journal of Computer Applications® (IJCA)
3. FLOW CHART
31
International Conference & Workshop on Recent Trends in Technology, (TCET) 2012
Proceedings published in International Journal of Computer Applications® (IJCA)
4. ALGORITHM AND WORKING types of filtering processes (a) Remove common English
words (b) Remove Location names (c) Remove single named
A. MINI-BATCH K- MEANS entities (d) Remove only last names.
The k-means optimization problem is to find the set C of
C. INDEXING
The web pages are indexed based on the ranks given by the
cluster centers c Rm, with |C| = k, to minimize over a set X
search engine.
of examples x Rm the following objective function:
5. ACKNOWLEDGMENT
We sincerely thank the faculty of Computer Engineering
Dept. at St. Francis Institute of Technology for giving us the
chance as well as the support to write this paper. Without their
Here, f(C, x) returns the nearest cluster center c ∈ C to x using
active involvement and the right guidance this would not have
Euclidean distance. It is well known that although this been possible.
problem is NP-hard in general, gradient descent methods
converge to a local optimum when seeded with an initial set 6. REFERENCES
of k examples drawn uniformly at random from X. [1] Rabia Nuray-Turan, Zhaoqi Chen, Dmitri V.
Kalashnikov, Sharad Mehrotra, “Exploiting Web
querying for Web People Search in WePS2” : 18th
Given: k, mini-batch size b, iterations t, data set X WWW Conference, April 2009.
Initialize each c C with an x picked randomly from X
[2] Javier Artiles, Julio Gonzalo, Satoshi Sekine, “WePS 2
v ←0 Evaluation Campaign: Overview of the Web People
for i = 1 to t do Search Clustering Task”, Appeared in ACM SIGIR
M ← b examples picked randomly from X Conference, July 2008.
for x M do
d[x] ← f(C, x) // Cache the center nearest to x [3] Wahiba Ben Abdessalem Karaa, “Named Entity
end for Recognition Using web Document Corpus”, International
for x M do Journal of Managing Information Technology (IJMIT)
Vol.3, No.1, February 2011.
c ← d[x] // Get cached center for this x
v[c] ← v[c] + 1 // Update per-center counts [4] Wei Lai, Xiaodi Huang, Ronald Wibowo, and Jiro
η ← 1/v[c] // Get per-center learning rate Tanaka, “An On-Line Web Visualization System with
c ← (1 − η)c + ηx // Take gradient step Filtering and Clustering Graph Layout”, IEEE Intelligent
end for Informatics Bulletin (2005), Volume: 5.
end for
[5] Sculley,”Web-Scale K-Means Clustering”, published in
WWW '10 Proceedings of the 19th international
B. FILTERING conference on World Wide Web
To make the system more efficient we filter the web pages to
eliminate the unwanted or redundant web pages. We use four
32