Developing A Web Crawler For Massive Mobile Search Services

Developing a Web Crawler for Massive Mobile Search Services
Hiroshi Takeno Makoto Muto Noriyuki Fujimoto

NTT Resonant Inc. NTT Resonant Inc. Osaka University
takeno@nttr.co.jp muto@nttr.co.jp fujimoto@ist.osaka-u.ac.jp
Kenichi Hagihara
Osaka University
hagihara@ist.osaka-u.ac.jp
Abstract numbers of Web pages. Two keys to this issue are a search
engine that can access the huge indexes for retrieval, and
Web crawlers collect Web content on the Internet and in- a crawler that can collect and index huge numbers of Web
dex them to be retrieved when demanded by a user query. To pages.
provide useful Web search services, we need a quick crawler While search engines need no special feature for in-
that can collect and index the massive amount of content dexing mobile-terminal-specific pages, crawlers must have
continually accumulating in an efficient way. We introduce some special functions for efficiently collecting these pages;
a new Web crawler that collects Web content suitable for conventional crawlers fail to meet our performance require-
viewing on mobile terminals such as PDA or cell phones. ments. We significantly modified a conventional crawler to
Moreover, we describe ”Mobile Search Service” that pro- create a special crawler for mobile terminal use. In this pa-
vides content suitable for mobile terminals. As of the begin- per, we present the concept of ‘ mobile goo ‘ in Section
ning of 2006, the service offers tens of millions of mobile 2, the detailed functions and architectures of the proposed
content entries. In this paper we present the system archi- crawler in Section 3, its performance in actual service in
tecture of our crawler and its performance in actual service. Section 4 and the practical issues of operating the collect-
ing operation in Section 5.
2. Massive Mobile Search Services

1. Introduction
NTT Resonant Inc. is developing and operating the por-

tal site ‘goo‘ 1 , a pioneer in the area of commercial portal The development concepts of the new ‘mobile goo‘ are
site service in Japan. ‘goo‘ is a total portal site that fea- as follows.
tures keyword search service with Japanese language sup-
port. One of the various services is ‘mobile goo‘2 , a service Provide search-based access to all mobile terminal
for mobile terminals users. It features keyword search for oriented Web pages (called mobile content hereafter)
the Web content created for use on mobile terminals. Its written in Japanese.
user interface is simple and suitable for use on mobile ter-
minals such as cell phones. Allow search-based access to recently-uploaded or
‘mobile goo‘ currently offers approximately 15 million frequently-updated mobile content.
Web pages. However, the number of such Web pages is in-
creasing year by year, so it’s necessary to enlarge the index- Provide sufficient performance to permit commercial
ing systems in order to keep the service of practical value. service use.
For this reason, NTT Resonant Inc. is focusing on the
Considering these concepts, we designed CRAWLER to
enhancement of ‘mobile goo‘. The main goal is to increase
have the following features.
its indexing capacities so that it can handle more massive
1 http://www.goo.ne.jp/ 1. Completeness : Can collect massive numbers of Web
2 http://mobile.goo.ne.jp/ pages.
Proceedings of the 7th International Conference on Mobile Data Management (MDM'06)

0-7695-2526-1/06 $20.00 © 2006 IEEE
mobile contents
query
2. High Speed: Web Pages are collected in an quick and SEARCH ENGINE
USER INTERFACE
MOBILE CONTENTS
efficient way.
web pages
CRAWLER
FILTER
Besides these requirements, the following features are search index
result
also required in order to guarantee the service level for com-
mercial use and to minimize the costs of server machines.
3. Minimally invasive: CRAWLER must not put

an excessive load on the Web servers accessed by Figure 1. System architecture of the new ‘mo-
CRAWLER. bile goo‘
4. Selecting: All mobile content and only mobile content

should be collected
USER INTERFACE interprets user inputs, generates a
query for the SEARCH ENGINE module, receives the re-
sults, and finally presents the results to the user. The inter-
As for 1.Completeness, the number of pages on entire face is designed to be simple given its use on mobile termi-
Web is enormous, so crawling system should be designed nals. Figure 2 shows a sample.
to be scalable[3]. In such a sharable crawler, managing col- SEARCH ENGINE creates a search index from text data
lecting history tend to be a main issue. In our system, we set to realize a full-text search service. The SEARCH EN-
cope with the issue by localization of history db that is dis- GINE for ’mobile goo’ features real-time indexing[4] which
cussed in Section 3.2.2 allows the search index to be updated even while the results
As for 2.High Speed, multithreading of crawling process of crawling are being received. These features are imple-
is necessary. In a large system, however, multithreading mented as distributed programs on multiple servers.
could be easily complicated. But there are few papers de- CRAWLER collects Web pages by exhaustively and in-
scribing the implementation, so we present the architecture termittently accessing Web Servers on the Internet. It’s im-
of multithread crawler in Section 3.3. Collecting perfor- plemented as distributed programs.
mance was measured by Effective Multiplicity that is dis-
MOBILE CONTENTS FILTER extracts mobile content
cussed in Section 4.1.
that was created to be displayed and enjoyed on cell phones
As for 3.Gentleness, to avoid overloading any Web
from Web pages. In this paper, the MOBILE CONTENTS
server, frequent access to one server should be prohibited.
FILTER is regarded as a part of CRAWLER.
To do so, allocating only one worker thread to any given
server is a reasonable solution[3]. However it’s no longer
easy to identify each Web server due to the advanced Web
server technologies such as mirroring or virtual host. We
present Web server identification method using semi-static
distribution discussed in Section 3.4.
As for 4., some methods have been proposed for focused
crawling[1][2]; they allow Web pages in particular fields
to be accurately collected. Such methods aim to collect a
limited number of pages with high accuracy, so they fail to
meet requirement 1. Completeness. We developed a HTML
analyzer discussed in Section 3.5 that distinguishes mobile
content.
3. Architecture of CRAWLER
Figure 2. Sample of USER INTERFACE pro-

The system architecture of the new ‘mobile goo‘ is vided by ‘mobile goo‘
shown in Figure 1. Each module is implemented on multi-
ple servers. We explain the function of each module below.

0-7695-2526-1/06 $20.00 © 2006 IEEE
Since the original ‘mobile goo‘ system used scripting CRAWLER and history db 4 . The history db must handle
languages to implement CRAWLER and MOBILE CON- massive numbers of pages, so we establish its own execu-
TENTS FILTER, their performance is insufficient to handle tion process. This allows us to cope with the increase in
the volume of mobile contents desired. For this reason, we collection records by instantiating extra execute processes
rewrote CRAWLER and MOBILE CONTENTS FILTER. for the history db.
3
The new CRAWLER is based on Distributed Crawler In the system mentioned above, CRAWLER must refer
with Ultra-Multiplexed Information-Gathering Control [5] every history db, so connection traffic among servers be-
which is a Linux-cluster featuring high crawling multiplic- comes heavy and could be a bottleneck. To guard against
ity and minimal intrusion. this, we adopted distributed crawler, which localizes the
The following details the architecture of CRAWLER and collection history. In this system, the URL string of the page
shows how it fulfills our development policy. to be collected determines the unique CRAWLER in charge
of it. By adopting this allocation method, no CRAWLER
has to access any other history db, which improves the ef-
ficiency of collection management. We discuss this dis-
3.2.1 Managing collected pages tributed crawler architecture in 3.4.2.
To collect massive numbers of pages, it’s essential that data
already collected be referred to by CRAWLER to optimize SERVER
SERVER
SEARCH ENGINE
subsequent collection. SEARCH ENGINE meets this re- SERVER
SERVER
mobile contents
CRAWLER
quirement through the use of the real-time indexing feature URL
CRAWLER
history db
URL
history db
CRAWLER
- the search index is updated while crawling is in progress. URL
history db
CRAWLER
URL
history db
This means that CRAWLER itself does not manage the col-
lected pages. The procedure is simple as follows. fetch
FETCH
date
FETCH
DATE
1. Identify mobile content in collected Web pages. FETCH
DATE
DATE
2. Insert or update mobile content on SEARCH ENGINE
while dropping others.
Figure 3. Management of Collection History
3.2.2 Managing Collecting History
Collecting specific pages repeatedly in a short interval is
not meaningful as a search service and may even cause
poor efficiency. To guard against this problem, the URLs
and crawled date of all collections (held in the history db)
should be stored. The history db is used by CRAWLER so The time required for downloading a Web page is deter-
as to prevent duplicate collections. Since we must collect mined by external factors such as network bandwidth be-
massive numbers of pages, the history db is also massive, tween the crawler and Web server, and the performance
so a large volume database is needed. of the Web servers. Such factors cannot be controlled by
With regard to the crawlers in ordinary search services, crawlers. Accordingly, in order to achieve the high speed
the role of the history db can be achieved by the SEARCH collection of Web pages, multiple pages should be collected
ENGINE. This is because the URLs and crawled date can at the same time. We call this approach increasing the multi-
be stored in the search index when indexing Web pages. plicity of collection. Figure 4 shows the system architecture
That allows CRAWLER to refer to the URLs and crawled of a crawler that offers higher multiplicity.
date and so realize efficient crawling. In the case of mo-
bile search services, however, SEARCH ENGINE can- comm-board
not provide this role; since it stores only mobile content,
CRAWLER cannot refer to the collection history of non- Establishes a connection to a WebServer via the HTTP
mobile content which is necessary to prevent the duplicative protocol and fetches Web pages. In this paper, we put
collection of non-mobile content. some restrictions on connection manner as follows for
For this reason, we concluded that CRAWLER should clarifying multiplexity. Each comm-board can estab-
have its own history db for management of the collec- lish only one HTTP session, so multiple HTTP ses-
tion history(Figure3) Each crawling server implements sions require the same number of comm-boards.
3 SEARCH ENGINE was unchanged but we increased the number of 4 history db is implemented using the same database program as
servers used to ensure adequate performance. SEARCH ENGINE.

0-7695-2526-1/06 $20.00 © 2006 IEEE
mobile content filter Operating experience of SEARCH ENGINE shows
that at least 10 history db processes are necessary.
CRAWLER
The above analysis lead to us to allocate 2 CRAWLER
Web comm-
url history processes to each server. This enables each server to collect
Server board distributor db up to 1000 pages in parallel. We plan to conduct a follow-up
index data link
data
investigation of computation power overheads and recon-
seed
sider the number of CRAWLER processes and the number
SEARCH of comm-boards.
ENGINE extracted file seed file
Figure 4. Architecture of CRAWLER Table 1. Measured computation power over-

head
Execution processes virtual memory [MB] FD
url distributor CRAWLER 400 200
mcf 80 80
Provide each comm-board with the URLs of Web
pages to be collected. In doing this, exclude recently history db 25 10
collected pages by referring to local history db. Extract
link-URLs from the collected Web pages and provide
each comm-board with a list of URLs.
seed file
Lists of URLs (called seeds) that crawlers initially ac- 3.4.1 Controlling collection interval
cess and collect pages from.
Issuing repetitive page fetch requests to a unique Web server
extracted file puts an extra load on the server which could block Web page
Lists of link-URLs that are extracted from Web pages access for regular users. To guard against this problem, we
collected by CRAWLER. set a minimum server access interval (uses recorded crawled
date) for each Web server.
mobile content filter
Distinguish mobile content from collected Web pages 3.4.2 WebServer Identification
by parsing the HTML. And extract link URLs from
HTML. For minimally intrusive collection, identifying each Web
server is required as mentioned above. Identifying Web
Improving the collecting speed is equivalent to increas- servers means determining whether two unique Web servers
ing collection multiplicity, so more comm-boards should be exist on the same machine or not. In earlier days, each Web
allocated to each CRAWLER in order to improve the col- server was allocated only one domain name, so identifying
lecting speed. To that end, as many comm-boards as possi- Web servers was easy. In the case of newer Web server ma-
ble should be assigned to each url distributor, which results chines, however, this is not so easy due to the use of virtual
in high multiplicity. hosts and mirror servers (Figure 5). Virtual hosts allow a
We implemented CRAWLER, mobile content filter and single Web server to own multiple domain name, whereas
history db as independent execution processes. comm- mirror server allows multiple Web servers to share a single
board was implemented using the multi-threading technique domain name. Such technologies prevent Web servers from
in CRAWLER. Table 1 shows the measured computation being identified via just their domain names.
overhead for each execution process where mcf stands for To guard against this problem, we devised the semi-static
mobile contents filter and FD stands for file descriptor. We distribution method(Figure 6); it identifies Web servers as
allocated 500 comm-boards to each CRAWLER process. follows. By adopting this method, Web pages in the same
In addition to the above, the following conditions must WebServer are assured to be allocated to the same comm-
be considered. board. This realizes localization of the history db, which is
needed as discussed in 3.2.2.
Available memory on commercial servers ranges from
2 to 4 GB. In Figure 6, site-info defines the relation between Web-
Server and comm-board. Each record consists of domain
Some margin should be allowed for possible fluctua- name and IPaddress of each WebServer, and the identifier
tions in power consumption. of the comm-board that is responsible for that WebServer.

0-7695-2526-1/06 $20.00 © 2006 IEEE
Virtual Host Mirror Server comm- comm- comm-
board board board
get from WebPage from get from WebPage from
domain1 domain1 domain0 domain0 URL URL URL
Distributer Distributer Distributer
domain0
domain1
Web domain2 Web Web Web site-info db site-info db site-info db
Server0 Server0 Server1 Server2
IPAddress0 domain0 domain0 domain0 server0 server1 serverN-1
IPAddress0 IPAddress1 IPAddress2
URL
Figure 5. Web servers with Virtual host / Mir- site-info
ror server
Master URL
Distributer site-info db
site-info db is the site-info’s database, and allows any record Figure 6. Semi-static distribution method
to be retrieved by domain name or IPaddress.
1. URL Distributor gets seeds or links and extracts do-

main names from the URLs. It then tries to retrieve the
domain name from its own site-info db.
2. If site-info was retrieved, it distinguishes whether any

comm-board in its own execution process is responsi- Additional to the three requirements mentioned above,
ble for the domain name by referring to the identifier the crawler must collect mobile content to support ’mobile
of the comm-board. If it is responsible, the URL is goo’. For this reason, we developed a HTML analyzer that
sent to the comm-board in charge. If not, send it to the distinguishes whether HTML content is mobile content or
comm-board in charge via inter process communica- not.
tion.
3. If site-info was not retrieved, get IPaddress and com- 3.5.1 Expanding HTML into Link Structure
plete the record of site-info. The record is then sent
to the Master URL Distributor leaving the field for the The HTML analyzer was developed based on our HTML
comm-board identifier empty. link analyzer that we had already developed. the system
expands the HTML of each Web page into a list structure
4. When Master URL Distributor gets unknown site-info (Figure 7). The expanded lists enable flexible content anal-
5
, The URL is assigned to a comm-board following the ysis.
rule below and added to the site-info db.
any URL sets with the same domain name or <tr>

<td> <img src="image1"> </td>
IPaddress should be allocated to the same comm- <td> This is image1 </td>
board. </tr>
In the case of completely new site-info - no
record in site-info db - , the sender of site-info tag=tr;
takes responsibility for the domain. daughter
tag=td;
Domain name is prioritized if there is mismatch daughter
sister tag=img;
between domain name and IPaddress. attr="src=image1"
5. Master URL Distributor sends the completed site-info tag=td;

to the sender of the site-info and the newly-allocated daughter tag=text;
server. Both servers update their own site-info db with attr="This is image1"
the new site-info.
Figure 7. An expanded html tree
5 Either domain name or IPaddress could not be retrieved from site-info
db

0-7695-2526-1/06 $20.00 © 2006 IEEE
3.5.2 Distinguishing Mobile Content
It’s difficult to distinguish mobile content because each con-

The crawling system of the new ’mobile goo’ was built
tent provider has its own specification. As the first step, we
around 15 servers as shown in Table 2. Table 3 details the
used the specifications of 3 major Japanese carriers as the
configuration of the crawlers.
criterion to identify mobile content. The keys include the
size of HTML, character set, size, and number of images,
and the presence of certain HTML tags. Table 2. specification of servers
The identification procedure is as follows. num of server 15
CPU XEON 3.0GHz * 2
1. Check superficial features such as character set or RAM 4GB
HTML size that can be analyzed without expanding NETWORK 100Mbps
the HTML as in Figure 7. Using the results, determine
whether mobile content is present or not by referring to Table 3. System configuration of each server
each carrier’s criteria. If this check fails for all carriers, num of processes 2
judge it as non-mobile and finish this procedure. num of comm-boards 500 * 2
num of mcf 1
2. Check more detailed features such as size and number num of history db 10
of images and some specific HTML tags by tracing the
tree of HTML as in Figure 7.
3. Judgement is made with each carrier’s criteria and the

results are output.
We operated the implemented crawlers and collected
Generally, most of the link-URLs extracted from mobile Web pages from the Internet. The measured collection per-
content tend to identify other mobile content. Accordingly, formance is shown by the typical results in Figure 8. The
we devised a function that prioritizes mobile content when measures were collection speed (num of get), the number
picking up URLs from extracted files. of mobile content indexed with SEARCH ENGINE (num
of index), and Effective Efficiency (Efficiency). Fetch in-
terval was 30 seconds. Seeds were taken from the existing
4. Implementation and Evaluation of Crawlers ’mobile goo’ for the purpose of excluding non-mobile con-
tent. 168,000 seeds were used.
As shown in Figure 8 we could find the following points.
The Effective Multiplicity rose to 100% in 6 hours.

Increasing multiplicity by using more comm-boards This means that the distributed system was working
does not always result in higher performance because some properly.
comm-boards might not be assigned any Web Server. Since
multiplicity does not always indicate the true collecting The collecting speed ranged from 40,000 to 45,000
speed, we devised another evaluation metric. Effective Mul- pages per hour with slight fluctuation.
tiplicity is defined as the ratio between the number of Web
servers actually assigned to a comm-board and the number The ratio of indexed pages to collected pages was ap-
of comm-boards. Effective Multiplicity indicates the av- proximately 1/8. That means the ratio of mobile con-
erage collecting speed, whereas multiplicity indicates the tent to all content was low as 1/8, although the seeds
maximum speed. were specific to mobile content. The actual collecting
In the distributed system discussed in 3.4.2, assigning speed was around 5,000 pages per hour.
multiple comm-boards to a single WebServer is prohib-
ited. That means that Effective Multiplicity can not ex- Considering that a complete system would have 30
ceed the number of WebServers to be accessed. If there crawler processes (Table 2), we can expect to collect
are many WebServers, Effective Multiplicity should reach 150,000 unique sets (pages) of mobile content per hour.
almost 100% depending on the processing speed of the The temporary slow down at around 4 to 5 elapsed hours
MASTER URL Distributor and the communication speed was caused, we guess, by some external factors such as net-
between CRAWLERS and URL Distributor. work congestion.

0-7695-2526-1/06 $20.00 © 2006 IEEE
% %
3500 100
Efficiency(%)
Efficiency(%)
50000 100
90
45000 3000
80
40000 95 2500 70
pages/hour
35000 60
num of index
2000
pages/hour
num of index
50
30000 90
1500
40
25000
1000 30
20000 85
numof get
20
500
15000 10
num of get
0 0
10000 80 61 62 63 64 65 66 67 68 69 70 71 72 73
5000 elapsed time(hour)
0 75 Figure 9. Performance of dead-link weeding

1 2 3 4 5 6 7 8 9 10 11 1213
elapsed time(hour)
Figure 8. Collecting performance Since dead-link weeding is inefficient, we assign it to

servers different from those used for original collection.
Besides the above operations, flexible crawling is also
needed for cases in which the crawlers must immediately
5 Discussion of Crawler Operation collect certain pages demanded by the users.
Due to these requirements, we divide crawlers into three
The new ‘mobile goo‘ was launched recently, so we have groups G1, G2, and G3 as shown in Table 4
not collected enough data to fully discuss crawler opera-
tions. However, we already have rich experience in operat-
ing a similar system, and we are getting some information Table 4. Crawler Groups
through testing the new crawler. The results allowed us re-
target URLs to access # servers
fine the operating procedure. In this section we discuss how
we this was done. G1 new pages extracted links 10
G2 collected pages pages in search index 4
G3 specific sites given URL list 1
Listing dead-link pages as part of the result decreases the

quality of the commercial search service. To guard against
this problem, crawlers need to periodically recollect mo-
bile content in the search index and remove dead-link pages.
For this function, we prepare seed files by picking up URLs To keep the search index exhaustive and up-to-date,
from the search index, the link-extraction process is omit- newly uploaded or updated content must be swiftly updated
ted, to enhance the efficiency of dead-link detection. in the search index. To do so, our crawlers are expected to
Figure 9 shows the performance of the dead-link func- run continuously. However, they need to be restarted every
tion mentioned above. It differs from original collection few days as discussed below.
only in seed size (700,000) and the elapsed time (after 60
hours) 6 . Comparing Figure 8 to 9, the ratio of indexed 5.2.1 Hardware Failure and Backup Process
pages to collected pages is higher, but the effective multi-
plicity and collecting efficiency is lower. It’s assumed that Hardware failure occurred only twice since the launch of
the number of seed URLs decreased as time elapsed, due the new ‘mobile goo‘ However, it’s only 2 months since the
to the omission of link-extraction process, which yielded launch of the service, so we cannot find any failure trend
lower efficiency. out of such scant data.
By way of comparison, we have another system that is
6 crawlers were temporarily stopped at 60 hours for maintenance similar to the new ‘mobile goo‘ in terms of architecture.

0-7695-2526-1/06 $20.00 © 2006 IEEE
The system consists of 350 Linux-clusters, i.e. 5 times 6days 2hours 6-7hours
larger than the new ‘mobile goo‘. Table 5 shows the fail- SEARCH indexing backup pickking-up URL
ure record of the system over the last 9 months. Regarding ENGINE
this data, we could expect at least one hardware problem ev- Group1 collecting backup refiningseed files
ery one month supposing that failure rates are proportional Group2 collecting backup refiningseed files
to the number of servers and that the failure rates of both Group3 collecting backup collecting
systems are approximately the same.
In this reason, we stop crawling and backup data at least Figure 10. Operation schedule
once per week. The backup process, where search index,
history db and seed files are duplicated on another machine,
takes around 2 hours. index. In the case of CRAWLER for specific site crawling,
G3, refining is not necessary because seed files are given
before each operation.
Table 5. failure record of a similar system
over the last 9 months
failure type frequence average recovery time
HDD 22 31 hours
server 13 a few days In the above reasons, we are operating crawlers on the
Network 3 2 hours schedule shown in Figure 10. During the backup process of
SEARCH ENGINE, all crawlers in the same group need to
be stopped.
5.2.2 Refining seed files 6. Conclusion

During the backup process, we refine the seed files to be We developed a crawler for the new ’mobile goo’, a
used for subsequent crawling in order to improve the crawl- search service for cell phone users. The crawler is focused
ing efficiency. In the refining process, unnecessary URLs on the efficient collection of mobile content. We imple-
are dropped while new URLs are added. mented the crawler and set up an operating environment.
In the case of new page collection by CRAWLER, G1, We continue to run it in an actual service and have devel-
the refining procedure is as follows. oped a search index with over 60 million entries. The new
’mobile goo’ was released on December 2005.
1. Drop URLs that are collected in prior crawling opera-
Future work includes continuing to perform the crawling
tion by referring to the history db. operation to increase index size. The more mobile content
2. Extract links from collected pages. Extract URLs for we collect, the more useful ’mobile goo’ gets.
which the CRAWLER is responsible by referring to the
site-info. Drop URLs that have already been collected References
by referring to the history db.
[1] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawl-
3. Merge seed files(1.) and Extracted files (2.) and drop ing: a new approach to topic-specific Web resource discov-
duplicate entries. ery. Computer Networks (Amsterdam, Netherlands: 1999),
31(11–16):1623–1640, 1999.
4. Reorder the URLs randomly.
[2] J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling
through URL ordering. Computer Networks and ISDN Sys-
Procedure 4 prevents URLs with some specific domain
tems, 30(1–7):161–172, 1998.
from acquiring a dominant position on the list. The proce- [3] A. Heydon and M. Najork. Mercator: A scalable, extensible
dure also tens to balance the load over the Web servers and web crawler. World Wide Web, 2(4):219–229, 1978.
so offers minimal intrusion. It also provides higher Effec- [4] T. I. Hiroshi Takeno. Distributed information gathering and
tive Multiplicity due to the increased chance that each Web full text search system infobee / evangelist. NTT R&D, Febru-
server is allocated to a comm-board. 7 ary. (in Japanese).
It takes a few hours to refine the seed files after 1 week [5] H. Takeno. Distributed crawler using linux clus-
of operation. ter - development and operation. Proceedings of
In the case of CRAWLER for dead-link weeding, G2, the The Second Language Observatory Workshop 2005
refining procedure extracts indexed URLs from the search (LOWS2), January 2005. http://www.language-
observatory.org/lows2/proceedings.htm.
7 We have no quantitative results at this moment.

0-7695-2526-1/06 $20.00 © 2006 IEEE

Developing A Web Crawler For Massive Mobile Search Services

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Developing A Web Crawler For Massive Mobile Search Services

Uploaded by

Copyright:

Available Formats

Developing a Web Crawler for Massive Mobile Search Services

Hiroshi Takeno Makoto Muto Noriyuki Fujimoto

2. Massive Mobile Search Services

NTT Resonant Inc. is developing and operating the por-

Proceedings of the 7th International Conference on Mobile Data Management (MDM'06)

3. Minimally invasive: CRAWLER must not put

4. Selecting: All mobile content and only mobile content

Figure 2. Sample of USER INTERFACE pro-

Proceedings of the 7th International Conference on Mobile Data Management (MDM'06)

servers used to ensure adequate performance. SEARCH ENGINE.

Proceedings of the 7th International Conference on Mobile Data Management (MDM'06)

Figure 4. Architecture of CRAWLER Table 1. Measured computation power over-

Proceedings of the 7th International Conference on Mobile Data Management (MDM'06)

1. URL Distributor gets seeds or links and extracts do-

2. If site-info was retrieved, it distinguishes whether any

any URL sets with the same domain name or <tr>

5. Master URL Distributor sends the completed site-info tag=td;

Proceedings of the 7th International Conference on Mobile Data Management (MDM'06)

It’s difﬁcult to distinguish mobile content because each con-

3. Judgement is made with each carrier’s criteria and the

The Effective Multiplicity rose to 100% in 6 hours.

Proceedings of the 7th International Conference on Mobile Data Management (MDM'06)

0 75 Figure 9. Performance of dead-link weeding

Figure 8. Collecting performance Since dead-link weeding is inefﬁcient, we assign it to

Listing dead-link pages as part of the result decreases the

Proceedings of the 7th International Conference on Mobile Data Management (MDM'06)

5.2.2 Reﬁning seed ﬁles 6. Conclusion

Proceedings of the 7th International Conference on Mobile Data Management (MDM'06)

You might also like