Report on the ‘Metasearching, Better Searching?

’ Conference, 22nd July 2004

Document Title File Name File Size Pages Document No. Author:

Report on ‘Metasearching, Better Searching?’ Conference 137238.doc 119KB 9

Revision 0.1 09/08/2004 Adrian Stevenson

Last Modification

Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

Contents
1. INTRODUCTION................................................................................................................................3 2. CONFERENCE TOPICS....................................................................................................................3 2.1 An Overview of Metasearching.......................................................................... ..............3 Cross-Searching............................................................................................ ....................4 Harvesting.................................................................................................................... ......4 Hybrid..................................................................................................... ...........................4 Scraping Content........................................................................................... ....................4 Metasearching for all sectors........................................................................... ..................4 Web Services............................................................................................................ .........4 Metasearch Requirements..................................................................................... ............5 Metadata Issues.............................................................................................................. ...5 Trust Issues........................................................................................... ............................5 Knowledge Bases ......................................................................................... ....................5 NISO MetaSearch Initiative......................................................................................... .......6 2.2 The Integration of Course Management Systems, Library Systems, OpenURL Resolvers, and Content Repositories................................................................ ....................6 2.3 The Ex Libris Approach.................................................................................................. ..6 2.4 The Knowledge4Health Portal.......................................................................... ...............7 2.5 Using Structured Metadata to Streamline and Refine Searching for News and Company Information from Different Collections and Repositories.................................................. ......7 2.6 Information Clustering and Natural Language Retrieval.................................. ................8 Vivisimo............................................................................................................................ ..8 Verity K2............................................................................................................................. 8 3. REFERENCES.....................................................................................................................................9

137238.doc

2 of 9

09/02/2008 5:54 a2/p2

Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

1. Introduction
The ’Metasearching, better searching?’ conference was held at The Said School of Business’ in Oxford on 22nd July 2004. The conference synopsis was: “When searching was invented, many people thought the problems of information management were over. We know today that the problem only moved elsewhere. For example: content, especially online content, doesn’t sit conveniently in a single repository; it is typically distributed across many collections. Secondly, searching is for a purpose, and integrated computing allows the searcher seamlessly to make use of their search results in a different environment. Metasearching is a collective term for tools of this kind that aid searching and make it more powerful. This one-day conference looks at recent developments in metasearching. Open to non-members, the meeting will draw on best-case examples of theory and practice, and will be of interest to several sectors, including publishing, libraries, commercial organisations and education – in fact anyone who could benefit from integrating information retrieval more closely with their business.” The agenda covered the following topics: • • • • • • An overview of Metasearching The integration of course management systems, library systems, OpenURL resolvers, and content repositories The Ex Libris approach The Knowledge4Health portal Using structured metadata to streamline and refine searching for news and company information from different collections and repositories Information clustering and natural-language retrieval

This report briefly summarises the issues discussed on the day.

2. Conference Topics
2.1 An Overview of Metasearching
Andy Powell, UKOLN Web users such as researchers or tutors frequently require information from a variety of different sources. To do this the user is usually required to search many different information service interfaces, each with a different look and feel, different metadata schemas and subject classifications. The results are almost always supplied in HTML, which makes them difficult to merge. Users are searching not only services and portals such as the RDN, zetoc and COPAC but also image resources, e-prints, learning objects, external and internal resources. If a user wants to obtain a local copy of the range of search results, they often have to merge the results themselves, for example by creating a text file. An indication of the scale of the problem can be seen in the figures for the JISC Information Environment in 2001, when there were 206 collections plus content from projects such as 5/99 and X4L.
137238.doc 3 of 9 09/02/2008 5:54 a2/p2

Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

Users require an effective means to search across all these varying resources. Metasearching aims to solve the problems of searching across disparate resources. This can be achieved via cross-searching and harvesting. Cross-Searching A portal sends a real-time query to a number of content providers and a results set is returned to the user. This commonly uses the Z39.50 protocol, and more recently may be achieved via SRW (Search and Retrieve Web Services) [1], which takes the core of Z39.50 and re-implements it as a Web Service. Harvesting This uses a mechanism by which metadata is harvested into a service or portal using The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [2]. The metadata is ‘pulled in’ locally and stored in a local database. The user therefore does not need to wait for the results of a real-time search across a network. The harvesting occurs periodically such as once a day or once an hour so may not be completely current. Hybrid A third option is to use a combination of cross-searching and harvesting. For example, the RDNs use harvesting, but the central database is also available for cross-searching. Scraping Content Many services do not support Z39.50, SRW or OAI-PMH. In this situation, the services’ Web interface has to be used and software has to be written to ‘scrape’ content out of the HTML/CGI search results. This is often difficult, laborious and unreliable. Metasearching for all sectors The need for metasearching is shared across many sectors, e.g. students, lecturers, researchers, shoppers looking for a second-hand car or new house. In the domain of e-learning the IMS Digital Repositories Interoperability Specification (DRI) [3] addresses the issues of metasearching for learning materials. Web Services Andy Powell defined Web Services for the purposes of his presentation as strictly machine interfaces between services on the Web. Web Services use SOAP [4] to encapsulate transport information in XML which can operate over HTTP and is therefore web friendly. With a SOAP based search interface a service can be integrated into a metasearch service. (NB it is important to bear in mind that terms and conditions of use must be taken into account). The service API (Application Programming Interface) defines the kinds of queries that can be sent and the results retrieved. API’s differ across services.
137238.doc 4 of 9 09/02/2008 5:54 a2/p2

Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004 Metasearch Requirements For a metasearch to work, agreement must be made on: • • • • • • • The protocol used (SOAP, OAI, Z) The query syntax The metadata format sent to the user (SUTRS, MARC, IEEE-LOM) Quality assurance (e.g. how to handle names, duplicates, choosing mandatory elements) Intellectual property rights and usage rights Middleware issues such as authentication How the user knows exactly what they’re searching – this is being addressed by the DCMI Collection Description Working Group [5]

Metadata Issues Format As users are likely to be searching cross-domain, it makes sense to use a crossdomain metadata schema. Dublin Core is a good contender for this and has become increasingly popular. Indeed, it is required for use of OAI-PMH. However, domains will continue to develop and use their own metadata schemas, such as the IEEE-LOM for learning objects. This means that mapping is required to enable cross searching, but some of the semantic richness of the original resource may be lost. Common Meaning There needs to be agreement amongst content providers about the meaning of terms, subject classifications and what a resource type actually consists of (e.g. ‘article’, ‘research paper’, ‘learning object’). There will inevitably be difficulties in reaching agreement about the meaning of metadata elements, as they are often used differently in different contexts. Metadata Registries Metadata practice is documented in ‘application profiles’ such as the eGIF or the UK LOM Core. There is a need for these application profiles to be disclosed via registries such as the Information Environment Service Registry Project (IESR) [6]. Trust Issues There are trust issues involved in using a portal, which are generally issues of authorisation. This area has been researched to some extent by the EDINA GetRef Service [7] and Shibboleth Authentication software. Knowledge Bases Currently, if research services are cross-searching a number of services, they have had to create and maintain their own ‘knowledge bases’. But as the number of content providers increases, the maintenance of these knowledge bases will become more difficult.

137238.doc

5 of 9

09/02/2008 5:54 a2/p2

Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004 One answer is to do away with these local knowledge bases and have a single knowledge base in the form of a ‘service registry’. This would describe the content of the collections and the technical interface details. Metadata formats need to be agreed for how to describe these. The Dublin Core Collection Description Working Group is looking at how to describe this content. The technical information can be captured in Web Services Description Language (WSDL) [8], and ZeeRex [9] (Z39.50 Explain, Explained and Re-Engineered in XML) for Z39.50 enabled services. There is a need to agree the way that collection descriptions are made available to portals, e.g. using Z39.50, SRW, UDDI. NISO MetaSearch Initiative This NISO MetaSearch initiative [10] is trying to bring the area of metasearching together. It is looking to enable: • • • metasearch service providers to offer more effective and responsive services content providers to deliver enhanced content and protect their intellectual property libraries to deliver services that distinguish their services from Google and other free web services.

2.2 The Integration of Course Management Systems, Library Systems, OpenURL Resolvers, and Content Repositories
John Davidson, Sentient Learning UK Sentient’s involvement with metasearching arose from solving a particular problem they were asked to work on by a university library. The problem was that academics were not communicating course reading lists to either the library or local book shops. As a result of looking into this they became aware of the sheer scale of the growth in information available in books, web based resources, learning material repositories such as Merlot [11] and other places. They also became aware of the lack of integration between VLEs, portals, content management systems and the library management systems. Sentient is essentially a ‘reading list’ system that attempts to solve these problems by providing references not only to books but on-line journals, learning objects and other on-line resources in one place. The system can be integrated into a wide range of VLE’s to enable students direct seamless access to the resources.

2.3 The Ex Libris Approach
James Culling, ExLibris James gave an outline of the metasearching possibilities from a commercial perspective. They provide an institutional library portal system called ‘MetaLib’ [12] that enables users to access institutions e-resources. They are best known for their ‘SFX’ product that provides context sensitive linking to ‘appropriate’ copies of resources via OpenURL link resolvers. James pointed out that librarians have been dealing the problem of searching multiple information resources for many years. This has traditionally been
137238.doc 6 of 9 09/02/2008 5:54 a2/p2

Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004 addresses by bibliographic instruction, but this model is problematic in the domain of web based resources. The Ex Libris portal brings together resource discovery, metasearching and retrieval using metadata. It is a live query system that doesn’t use any local indexing. A combination of structured searching using SRW methods and unstructured searching using screen scraping is employed. James noted that merging the results is a complex and major challenge as results returned are ordered in different ways and some method of ranking the results is required. James also made reference to the NISO MetaSearch Initiative which is looking to get input from vendors, content providers and the library community. Ex Libris see the major challenges for the future as being bringing in resources not yet integrated and the widespread adoption of interoperability standards. The following discussion noted that that are hardly any SRW enabled services currently available and that there has been some resistance from vendors to create XML gateways to their systems as they prefer users to use their web interface.

2.4 The Knowledge4Health Portal
Hilary Ollerenshaw, North Bristol NHS Trust The Knowledge4Health portal [13] provides access to quality filtered healthcare resources provided internally and externally via PC’s based on the wards. The key objectives were described as: • • • A single point of access to Trust patient information Access to customised health care information Encourage the sharing of knowledge resources

The portal brings together 11 sets of resources that currently include NHS Direct Online and zetoc. Athens is used for authentication. The full text of some of the resources is made available through ‘Dialog Datastar’ [14].

2.5 Using Structured Metadata to Streamline and Refine Searching for News and Company Information from Different Collections and Repositories
Chris Knowles, Magus Research Chris was involved with a company that was addressing a requirement from investment banks to cross search the databases of a number of subscription based banking news services. This was a complex task that required the development of different ‘site agents’ for each of the news services. The ranking needs were based on the authority of the publication i.e. Results from The Financial Times are ranked higher than The Times, and The Times results are ranked higher than eg. The Manchester Evening News. Chris has worked for a number of companies that have sold solutions to knowledge intensive sectors such as banking and the legal sector. He noted that it was very difficult to apply generic business logic across domains – the law sector wanted to search external systems using the same methodologies they were using to search their internal repositories, a very different approach to the banking sector. Chris went on to give a demonstration of the Magus research product, the Vrisko News Tracker system.
137238.doc 7 of 9 09/02/2008 5:54 a2/p2

Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

2.6 Information Clustering and Natural Language Retrieval
Martin Kelly, Institute of Physics Publishing Traditional searching methods have some inherent problems, some of which were identified by the audience at this presentation. Those that were identified as the major problems were; not having a very specific search (i.e. wanting to search a subject area in a more general way), not knowing which terms to use in a search and being overwhelmed by the number of results returned. The Institute decided to investigate complimentary approaches. The challenge was to be able to explore large datasets whilst reducing overload and providing context to search results. They have been looking at the potential of searching using ‘clustering’, which is the classification of data using structured taxonomies. They presented case studies of their observations of using two software products for this, Vivisimo [15] and Verity K2 [16]. Vivisimo This software looks for patterns in a retrieved dataset and dynamically builds taxonomies in real time based on these patterns. The user is presented with a browse ‘tree’ and can then drill down into the taxonomy to get a more focused set of results for their research area. Conclusions on the use of Vivisimo: • It is a problem to create a taxonomy for several thousand records in this way, as it is very demanding of computer processing power. Therefore, the cluster size had to be limited. After some research into this, the Institute decided on limiting the clustering to the first 250 results. The original results were ordered by the relevance ranking technology within Vivisimo, but this may not match with the research requirements of the user. Vivisimo is useful for unstructured information with very little or no metadata. It is also easier to implement than Verity K2

Verity K2 This is a powerful top-end searching technology that includes clustering tools. The Institute has implemented this for the New National Journal of Physics. However, performance was an issue, and therefore it was decided to ‘can’ clusters overnight using the INSPEC classification tree [17]. Conclusions on the use of Verity K2 • • • • The cost could be prohibitive, as this software does not come cheap (the presenter declined to give specific figures) Performance was an issue, and so canning was considered to be the only option Verity K2 is a good option for records with high quality metadata, but would not be suitable for records without metadata It is not an out-of-the-box solution and the implementation issues turned out to be complex and time-consuming, though the software did have a substantial amount of flexibility built into it

137238.doc

8 of 9

09/02/2008 5:54 a2/p2

Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

3. References
[1] Search/Retrieve Web Service < http://www.loc.gov/z3950/agency/zing/srw/ > [2] The Open Archives Initiative Protocol for Metadata Harvesting < http://www.openarchives.org/OAI/openarchivesprotocol.html > [3] IMS Digital Repositories Specification < http://www.imsglobal.org/digitalrepositories/ > [4] SOAP Version 1.2 Part 1: Messaging Framework < http://www.w3.org/TR/soap12-part1/ > [5] DCMI Collection Description Working Group < http://dublincore.org/groups/collections/ > [6] Information Environment Service Registry Project < http://www.mimas.ac.uk/iesr/ > [7] EDINA GetRef Service < http://edina.ac.uk/getref/ > [8] Web Services Description Language (WSDL) 1.1 < http://www.w3.org/TR/wsdl > [9] ZeeRex: The Explainable “Explain” Service < http://explain.z3950.org/ > [10] NISO MetaSearch initiative < http://www.niso.org/committees/MetaSearch-info.html > [11] Merlot: Multimedia Educational Resource for Learning and Online Teaching < http://www.merlot.org/ > [12] MetaLib: The Library Portal < http://www.exlibrisgroup.com/metalib.htm > [13] Knowledge4Health Portal < http://www.k4h.northbristol.nhs.uk/ > [14] Dialog Datastar < http://www.dialog.com/products/productline/datastar.shtml > [15] Vivisimo clustering engine < http://vivisimo.com/ > [16] Verity K2 Enterprise < http://www.verity.com/products/k2_enterprise/ > [17] Outline of INSPEC Classification 1999 < http://www.iee.org/publish/support/inspec/document/class/classif.cfm >

137238.doc

9 of 9

09/02/2008 5:54 a2/p2