JOURNAL OF TELECOMMUNICATIONS, VOLUME 18, ISSUE 1, JANUARY 20131
© 2012 JOTwww.journaloftelecommunications.co.uk
Comparison of existing open-source tools forWeb crawling and indexing of free Music
André Ricardo and Carlos Serrão
— This paper presents a portrait of existing open-source web crawlers tools that also have an indexing component.The goal is to understand what tool is best suited to crawl and index a large collection of music MP3 files freely available in theInternet. In this study each piece of software is briefly described, with an overview, identification of some users, and their mainadvantages and disadvantages. In order to better understand the most significant differences between the different tools aresume of features like: programming language in which they are written, the platform used for deployment, the type of indexused, database integration, front-end capabilities, existence of a plugin system, MP3 and Adobe Flash (SWF files) parsingsupport, is presented. Finally the tools were classified according to the prospected collection size, being divided into tools tomirror small collections, medium and large collections with software capable of handling large amounts of data. In conclusion,an assessment on which tools are best suited to handle large collections in a distributed way is made.
— Content Analysis and Indexing, Information Storage and Retrieval, Information Filtering, Retrieval Process,Selection Process, Open Source, Creative Commons, Music, MP3.
he objective of this paper is to identify and study thetools that can be used to create a similar index to theones used by existing commercial music recommen-dation systems, but with the purpose of indexing allfreely available music in the Internet. The paper is pri-marily focused on the discovery and indexation free mu-sic over the Internet as a way to create a huge distributeddatabase with the capability of offering meta-informationand recommendation systems.In the first section of this paper, it will be provided anoverview of all the existing tools that can be used for tothe purpose of indexing and crawling on the Internet con-sidering the software projects that are open-source (Table1). Also in this section data is presented about all the mostimportant characteristics of such tools such as program-ming language in which they were developed, the type of index created, database integration, front-end, pluginstructure and MP3 and Flash parsing support.Concluding the analysis, for each tool the most relevantkey advantages and drawbacks are stated, followed by anoverview on how adequate the tool is to solve the prob-lem addressed by this work.Finally some conclusions and future work is presentedhaving into account the major objective of this work: theability to develop an open and free music recommenda-tion system.
This section presents a summary of the different charac-teristics of each of the tools that were considered, andresumed in a set of tables to facilitate the tools compari-son process. First each tool in analysis is introduced witha short description, stating the most notable users operat-ing with each piece of software, and then an overview,advantages and drawbacks.Considering all the software tools in analysis, Table 1states the programming language used for their devel-opment (language) and the platforms in which they run,if there is some type of indexing done by web crawlingtools (index) and finally possible connections to databasesare also considered (database).
The ASPseek tool consists of an indexing robot, a searchdaemon, and a CGI search frontend. The ASPseek tool(http://www.aspseek.org/) is an outdated tool and itsapplicability in this scenario it is not a reliable option.The major advantage of this tool is that it supports exter-nal parsers. However, as referred before, the tool is out-dated and cannot scale for global web crawling, since it is based on a relational database.
Bixo (http://openbixo.org) is web-mining toolkit thatruns as a series of cascading pipes on top of Hadoop(mostly used by companies/services such as Bebo, EMIMusic, Share This and Bixo Labs). Bixo is a tool that might be very interesting to projects looking for a web-mining
André Ricardo is with ISCTE Instituto Universitário de Lisboa (ISCTE-IUL), Av. das Forças Armadas, 1649-026 Lisboa, Portugal.
Carlos Serrão is with the ISCTE Instituto Universitário de Lisboa (ISCTE-IUL), IUL School of Technology and Architecture, Department of Infor-mation Science and Technology (ISTA/DCTI), Av. das Forças Armadas,1649-026 Lisboa, Portugal.