Professional Documents
Culture Documents
Abstract numbers of Web pages. Two keys to this issue are a search
engine that can access the huge indexes for retrieval, and
Web crawlers collect Web content on the Internet and in- a crawler that can collect and index huge numbers of Web
dex them to be retrieved when demanded by a user query. To pages.
provide useful Web search services, we need a quick crawler While search engines need no special feature for in-
that can collect and index the massive amount of content dexing mobile-terminal-specific pages, crawlers must have
continually accumulating in an efficient way. We introduce some special functions for efficiently collecting these pages;
a new Web crawler that collects Web content suitable for conventional crawlers fail to meet our performance require-
viewing on mobile terminals such as PDA or cell phones. ments. We significantly modified a conventional crawler to
Moreover, we describe ”Mobile Search Service” that pro- create a special crawler for mobile terminal use. In this pa-
vides content suitable for mobile terminals. As of the begin- per, we present the concept of ‘ mobile goo ‘ in Section
ning of 2006, the service offers tens of millions of mobile 2, the detailed functions and architectures of the proposed
content entries. In this paper we present the system archi- crawler in Section 3, its performance in actual service in
tecture of our crawler and its performance in actual service. Section 4 and the practical issues of operating the collect-
ing operation in Section 5.
USER INTERFACE
MOBILE CONTENTS
efficient way.
web pages
CRAWLER
FILTER
Besides these requirements, the following features are search index
result
also required in order to guarantee the service level for com-
mercial use and to minimize the costs of server machines.
3. Architecture of CRAWLER
SEARCH ENGINE
subsequent collection. SEARCH ENGINE meets this re- SERVER
SERVER
mobile contents
CRAWLER
quirement through the use of the real-time indexing feature URL
CRAWLER
history db
URL
history db
CRAWLER
- the search index is updated while crawling is in progress. URL
history db
CRAWLER
URL
history db
This means that CRAWLER itself does not manage the col-
lected pages. The procedure is simple as follows. fetch
FETCH
date
FETCH
DATE
1. Identify mobile content in collected Web pages. FETCH
DATE
DATE
2. Insert or update mobile content on SEARCH ENGINE
while dropping others.
Figure 3. Management of Collection History
3.2.2 Managing Collecting History
Collecting specific pages repeatedly in a short interval is
not meaningful as a search service and may even cause
poor efficiency. To guard against this problem, the URLs
and crawled date of all collections (held in the history db)
should be stored. The history db is used by CRAWLER so The time required for downloading a Web page is deter-
as to prevent duplicate collections. Since we must collect mined by external factors such as network bandwidth be-
massive numbers of pages, the history db is also massive, tween the crawler and Web server, and the performance
so a large volume database is needed. of the Web servers. Such factors cannot be controlled by
With regard to the crawlers in ordinary search services, crawlers. Accordingly, in order to achieve the high speed
the role of the history db can be achieved by the SEARCH collection of Web pages, multiple pages should be collected
ENGINE. This is because the URLs and crawled date can at the same time. We call this approach increasing the multi-
be stored in the search index when indexing Web pages. plicity of collection. Figure 4 shows the system architecture
That allows CRAWLER to refer to the URLs and crawled of a crawler that offers higher multiplicity.
date and so realize efficient crawling. In the case of mo-
bile search services, however, SEARCH ENGINE can- comm-board
not provide this role; since it stores only mobile content,
CRAWLER cannot refer to the collection history of non- Establishes a connection to a WebServer via the HTTP
mobile content which is necessary to prevent the duplicative protocol and fetches Web pages. In this paper, we put
collection of non-mobile content. some restrictions on connection manner as follows for
For this reason, we concluded that CRAWLER should clarifying multiplexity. Each comm-board can estab-
have its own history db for management of the collec- lish only one HTTP session, so multiple HTTP ses-
tion history(Figure3) Each crawling server implements sions require the same number of comm-boards.
3 SEARCH ENGINE was unchanged but we increased the number of 4 history db is implemented using the same database program as
URL
Figure 5. Web servers with Virtual host / Mir- site-info
ror server
Master URL
Distributer site-info db
site-info db is the site-info’s database, and allows any record Figure 6. Semi-static distribution method
to be retrieved by domain name or IPaddress.
3. If site-info was not retrieved, get IPaddress and com- 3.5.1 Expanding HTML into Link Structure
plete the record of site-info. The record is then sent
to the Master URL Distributor leaving the field for the The HTML analyzer was developed based on our HTML
comm-board identifier empty. link analyzer that we had already developed. the system
expands the HTML of each Web page into a list structure
4. When Master URL Distributor gets unknown site-info (Figure 7). The expanded lists enable flexible content anal-
5
, The URL is assigned to a comm-board following the ysis.
rule below and added to the site-info db.
Efficiency(%)
Efficiency(%)
50000 100
90
45000 3000
80
40000 95 2500 70
pages/hour
35000 60
num of index
2000
pages/hour
num of index
50
30000 90
1500
40
25000
1000 30
20000 85
numof get
20
500
15000 10
num of get
0 0
10000 80 61 62 63 64 65 66 67 68 69 70 71 72 73
5000 elapsed time(hour)