Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
Comparison of existing open-source tools for Web crawling and indexing of free Music

Comparison of existing open-source tools for Web crawling and indexing of free Music

Ratings: (0)|Views: 3,091|Likes:
Journal of Telecommunications, ISSN 2042-8839, Volume 18, Issue 1, January 2013

Journal of Telecommunications, ISSN 2042-8839, Volume 18, Issue 1, January 2013


More info:

Categories:Types, Research
Published by: Journal of Telecommunications on Jan 31, 2013
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





© 2012 JOTwww.journaloftelecommunications.co.uk
Comparison of existing open-source tools forWeb crawling and indexing of free Music
André Ricardo and Carlos Serrão
— This paper presents a portrait of existing open-source web crawlers tools that also have an indexing component.The goal is to understand what tool is best suited to crawl and index a large collection of music MP3 files freely available in theInternet. In this study each piece of software is briefly described, with an overview, identification of some users, and their mainadvantages and disadvantages. In order to better understand the most significant differences between the different tools aresume of features like: programming language in which they are written, the platform used for deployment, the type of indexused, database integration, front-end capabilities, existence of a plugin system, MP3 and Adobe Flash (SWF files) parsingsupport, is presented. Finally the tools were classified according to the prospected collection size, being divided into tools tomirror small collections, medium and large collections with software capable of handling large amounts of data. In conclusion,an assessment on which tools are best suited to handle large collections in a distributed way is made.
Index Terms
— Content Analysis and Indexing, Information Storage and Retrieval, Information Filtering, Retrieval Process,Selection Process, Open Source, Creative Commons, Music, MP3.
1 I
he objective of this paper is to identify and study thetools that can be used to create a similar index to theones used by existing commercial music recommen-dation systems, but with the purpose of indexing allfreely available music in the Internet. The paper is pri-marily focused on the discovery and indexation free mu-sic over the Internet as a way to create a huge distributeddatabase with the capability of offering meta-informationand recommendation systems.In the first section of this paper, it will be provided anoverview of all the existing tools that can be used for tothe purpose of indexing and crawling on the Internet con-sidering the software projects that are open-source (Table1). Also in this section data is presented about all the mostimportant characteristics of such tools such as program-ming language in which they were developed, the type of index created, database integration, front-end, pluginstructure and MP3 and Flash parsing support.Concluding the analysis, for each tool the most relevantkey advantages and drawbacks are stated, followed by anoverview on how adequate the tool is to solve the prob-lem addressed by this work.Finally some conclusions and future work is presentedhaving into account the major objective of this work: theability to develop an open and free music recommenda-tion system.
2 T
This section presents a summary of the different charac-teristics of each of the tools that were considered, andresumed in a set of tables to facilitate the tools compari-son process. First each tool in analysis is introduced witha short description, stating the most notable users operat-ing with each piece of software, and then an overview,advantages and drawbacks.Considering all the software tools in analysis, Table 1states the programming language used for their devel-opment (language) and the platforms in which they run,if there is some type of indexing done by web crawlingtools (index) and finally possible connections to databasesare also considered (database).
2.1 ASPSeek
The ASPseek tool consists of an indexing robot, a searchdaemon, and a CGI search frontend. The ASPseek tool(http://www.aspseek.org/) is an outdated tool and itsapplicability in this scenario it is not a reliable option.The major advantage of this tool is that it supports exter-nal parsers. However, as referred before, the tool is out-dated and cannot scale for global web crawling, since it is based on a relational database.
2.2 Bixo
Bixo (http://openbixo.org) is web-mining toolkit thatruns as a series of cascading pipes on top of Hadoop(mostly used by companies/services such as Bebo, EMIMusic, Share This and Bixo Labs). Bixo is a tool that might be very interesting to projects looking for a web-mining
André Ricardo is with ISCTE Instituto Universitário de Lisboa (ISCTE-IUL), Av. das Forças Armadas, 1649-026 Lisboa, Portugal.
Carlos Serrão is with the ISCTE Instituto Universitário de Lisboa (ISCTE-IUL), IUL School of Technology and Architecture, Department of Infor-mation Science and Technology (ISTA/DCTI), Av. das Forças Armadas,1649-026 Lisboa, Portugal.
framework that can be integrated with existing infor-mation systems - for example, to inject data into a data-warehouse system.Based on the Cascading API that runs on a Hadoop Clus-ter, Bixo is suitable to crawl large collections. In a projectthat has the need to handle large collections and to inputdata into existing systems, Bixo is a tool to have a closelook at.Bixo major advantages are its orientation to data miningand its capability to support large sets of data, as it wastested with Public Terabyte Dataset [19][18]. The majordrawback of Bixo is its little built-in support to create anindex.
2.3 Crawler4J
 Java Crawler (http://code.google.com/p/crawler4j/) is atool, which provides a programmable interface for crawl-ing. It is a piece of source code to incorporate in a project but there are more suitable tools to index content.Its main advantage is that it is easy to integrate in Javaprojects that need a crawling component. On the otherhand, it does not offer support for “robots.txt” neither forpages without UTF-8 encoding and it is necessary to cre-ate the entire complementary framework for indexing.
DatapakSearch (http://www.dataparksearch.org/) isweb crawler and search engine (used, for instance byNews Lookup). DataparkSearch is a tool that benefitsfrom MP3 and Flash parser but unfortunately, due to lack of development, it is still using outdated technology likeCGI and does not have a modular architecture making itdifficult to extent. The index is not in a format that could be used by other frameworks.The major advantage of this tool is that it offers supportfor MP3 and Flash parser. On the other hand, it still usesoutdated technology and its development seems to havestopped.
Ebot (http://www.redaelli.org/matteo-blog/projects/ebot/) is web crawler written on top of Erlang. There isno proof of concept that Ebot would scale well to indexthe desired collection. Because Erlang and CouchDB wereused to solve the crawl and search problem, people keenon these languages might find this tool attractive. There-fore, Ebot is distributed and scalable [8] however there isonly one developer active in the project and there is not aproven working system deployed.
GNU Wget
GNU Wget (http://www.gnu.org/software/wget/) isnon-interactive command line tool to retrieve files fromthe most widely used Internet protocols. Wget is a reallyuseful command line tool to download a simple HTMLwebsite, but it does not offer indexing support. It is lim-ited to the mirroring and downloading process.Its main advantage is that with simple commands it iseasy to mirror an entire website or to explore the wholesite structure. However, there is the need to create theentire indexing infrastructure and it is primarily built forpages mainly working with HTML with no Flash or Ajaxsupport.
GRUB (http://grub.org/) is a web crawler with distrib-uted crawling. GRUB distributed solution requires aproof of concept that is suitable for a large-scale index. Italso requires proving that distributed crawling is a bettersolution than centralized crawling.GRUB tries a new approach to searching by distributingthe crawling process. However, the documentation in-complete and it was banned from Wikipedia for badcrawling behavior. According to the Nutch FAQ distrib-uted crawling may not be a good deal, while it saves bandwidth in the long run this saving is not significant.Because it requires more bandwidth to upload query re-sults pages, “making the crawler use less bandwidth doesnot reduce overall bandwidth requirements. The domi-nant expense of operating a large search engine is notcrawling, but searching”. The project development looksto have halted since it lacks news since 2009.
Heritrix (http://crawler.archive.org/) is an extensible,web-scale, archival-quality web crawler project (it is usedin the Internet Archive and on “Arquivo da Web Portu-guesa”). Heritrix is the piece of software used and written by “The Internet Archive” to make copies of the Internet.The disadvantage for Heritrix is the lack of indexing ca-pabilities; the content is stored in ARC files [2]. It is a real-ly good solution to archiving websites and makes copiesfor future reference.The Heritrix software is a use-case proven by InternetArchive that is really adjusted to make copies of websites.However it needs to process Arc files and the architectureis more monolithic and not designed to add parsers andextensibility.
ht://Dig (http://www.htdig.org/) is a search Engineand web Crawler. ht://Dig is a searching system towardsgenerating search for a website. Like a website already built in HTML that wants to add searching functionality.Until 2004, date of the last release, it was one of the mostpopular web crawlers and search engine, enjoying a largeuser base with notable sites such as the GNU Project andMozilla Foundation but with no updates over the time,slowly lost most of the user base to newer solutions.ht://Dig was until 2004 one of the most popular webcrawlers and search engine, enjoying a large user basewith notable sites such as the GNU Project, MozillaFoundation however, its development has ceased in 2004.
HTTrack (http://www.httrack.com/) is a website mirrortool. HTTrack is designed to create mirrors from existingsites and not for indexing. A good tool for users unfamil-iar with web crawling and that enjoy a good GUI.HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash [11]. However,HTTrack does not have integration with indexing sys-
Hyper Estraier
Hyper Estraier (http://fallabs.com/hyperestraier/ in-dex.html) is full-text search engine system (used by theGNU Project). Hyper Estraier has some characteristicslike high performance search and P2P support making itan interesting solution to add search to an existing web-site. The GNU Project is using Hyper Estraier to search itshigh number of docs making it a good solution whenlooking at collections approximately 8 thousands docu-ments in size.Using this tool is useful to add search functionality to asite and it offers P2P support. However it has only onecore developer.
mnoGoSearch (http://www.mnogosearch.org/) is a websearch engine (one of the users of this too is MySQL).mnoGoSearch is a solution for a small enterprise appli-ance to add search ability to an existing site or intranet.The project is a bit outdated and due to the dependencyon a specific vendor other solutions should be considered.One of its major advantages is that MySQL uses it. On theother hand there is little information about scalability andextensibility and it is extremely dependent from the ven-dor Lavtech for future development.
Nutch (http://nutch.apache.org/) is a web search, crawl-er, link-graph database, parsers and plugin system (it isused on sites such as Creative Commons and WikiaSearch). Nutch is one of the most developed and activeprojects in the web crawling field. The need to scale anddistribute the Nutch, lead to Doug Cutting, the projectcreator, started developing Hadoop - a framework forreliable, scalable and distributed computing.This means that not only the project is developing itself  but it also works with Hadoop, Lucene, Tika and Solr.The project is seeking to integrate other pieces of softwaresuch as HBase too [5]. Another strong point for Nutch arethe existing deployed systems with published case stud-ies [14] and [16].The biggest drawback in Nutch is the configuration andtuning process, combined with the need to understandhow the crawler works to get the desired results. Forlarge scale web crawling, Nutch is a stable and completeframework.The major advantages of Nutch can be resumed in thefollowing:
Nutch has a highly modular architecture allow-ing developers to create plugins for the followingactivities: media-type parsing, data retrieval,querying and clustering [12].
Nutch works under the Hadoop framework so itfeatures cluster capabilities, distributed computa-tion (using MapReduce) and a distributedfilesystem (HDFS) if needed.
Built in scalability and cost effectiveness in mind[6].
Support to parse and index a diverse range of documents using Tika, a toolkit to detect and ex-tract metadata.
Integrated Creative Commons plugin.
The ability to use other languages such as Pythonto script Nutch.
There is an adaption of Nutch called NutchWAX(Nutch Web Archive eXtensions) allowing Nutchto open ARC files used by Heritrix.
Top-level Apache project, high level of expertiseand visibility around the project.However Nutch as some complexity and the integrat-ed MP3 parser is deprecated based on “Java ID3 Tag Li- brary” and did not work when tested in Nutch.
Open Search Server
The Open Search Server (http://www.open-search-server.com/) is a search engine with support for businessclients. Open Search Server is a good solution for smallappliances. Unfortunately it is not well documented interms of how extensible it is.This tool is quite easy to implement and set it running.However it is dependent on the commercial componentfor development, has a small community, scarce docu-mentation, has some problems handling special charac-ters and there is little information on extending the soft-ware.
OpenWebApider (http://www.openwebspider.org/) is aweb spider for the .NET platform. This is an interestingproject based on the .NET framework and in C# pro-gramming for those intended to build a small to middlesized data collection. It supports MP3 indexing and offerscrawling and database integration.However it has only one developer, the source is dis-closed but since no one else is working on the project and because there is no source code repository, it is not be-have as a real open source project. The Mono Framework might constitute a problem for those concerned with pa-tent issues, there is no proof of concept and using rela-tional database might not scale well.
Pavuk (http://pavuk.sourceforge.net/) is a web crawler.Pavuk is a complement to tools like Wget, still it does notoffer indexing functionality. Its main advantage is that itcomplements solutions like Wget and HTTrack with fil-ters for regular expressions and functions alike. However,development has stopped since 2007 and has no indexingfeatures.
Sphider (http://www.sphider.eu/) is a PHP search en-gine. Sphider is a complete solution with crawler and websearch that can run on a server just with PHP andMySQL. To add integrated searching functionality forexisting web appliances might be a good solution withlittle requirements.It is easy to setup and integrate into an existing solution.However the index is a relational database and might notscale well to millions of documents.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->