You are on page 1of 9

Report on the ‘Metasearching, Better Searching?


Conference, 22nd July 2004

Document Title Report on ‘Metasearching, Better Searching?’


Conference
File Name 20497116.doc
File Size 119KB
Pages 9
Document Revision 0.1
No.
Last Modification 09/08/2004
Author: Adrian Stevenson
Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

Contents

1. INTRODUCTION................................................................................................................................3

2. CONFERENCE TOPICS....................................................................................................................3
2.1 An Overview of Metasearching.......................................................................... ..............3
Cross-Searching............................................................................................ ....................4
Harvesting.................................................................................................................... ......4
Hybrid..................................................................................................... ...........................4
Scraping Content........................................................................................... ....................4
Metasearching for all sectors........................................................................... ..................4
Web Services............................................................................................................ .........4
Metasearch Requirements..................................................................................... ............5
Metadata Issues.............................................................................................................. ...5
Trust Issues........................................................................................... ............................5
Knowledge Bases ......................................................................................... ....................5
NISO MetaSearch Initiative......................................................................................... .......6
2.2 The Integration of Course Management Systems, Library Systems, OpenURL
Resolvers, and Content Repositories................................................................ ....................6
2.3 The Ex Libris Approach.................................................................................................. ..6
2.4 The Knowledge4Health Portal.......................................................................... ...............7
2.5 Using Structured Metadata to Streamline and Refine Searching for News and Company
Information from Different Collections and Repositories.................................................. ......7
2.6 Information Clustering and Natural Language Retrieval.................................. ................8
Vivisimo............................................................................................................................ ..8
Verity K2............................................................................................................................. 8
3. REFERENCES.....................................................................................................................................9

20497116.doc 2 of 9 06/08/2009 12:54


a8/p8
Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

1. Introduction
The ’Metasearching, better searching?’ conference was held at The Said School of
Business’ in Oxford on 22nd July 2004. The conference synopsis was:

“When searching was invented, many people thought the problems of


information management were over. We know today that the problem only
moved elsewhere. For example: content, especially online content, doesn’t
sit conveniently in a single repository; it is typically distributed across
many collections. Secondly, searching is for a purpose, and integrated
computing allows the searcher seamlessly to make use of their search
results in a different environment.

Metasearching is a collective term for tools of this kind that aid searching
and make it more powerful.

This one-day conference looks at recent developments in metasearching.


Open to non-members, the meeting will draw on best-case examples of
theory and practice, and will be of interest to several sectors, including
publishing, libraries, commercial organisations and education – in fact
anyone who could benefit from integrating information retrieval more
closely with their business.”

The agenda covered the following topics:

• An overview of Metasearching
• The integration of course management systems, library systems, OpenURL
resolvers, and content repositories
• The Ex Libris approach
• The Knowledge4Health portal
• Using structured metadata to streamline and refine searching for news and
company information from different collections and repositories
• Information clustering and natural-language retrieval

This report briefly summarises the issues discussed on the day.

2. Conference Topics

2.1 An Overview of Metasearching

Andy Powell, UKOLN

Web users such as researchers or tutors frequently require information from a


variety of different sources. To do this the user is usually required to search many
different information service interfaces, each with a different look and feel,
different metadata schemas and subject classifications. The results are almost
always supplied in HTML, which makes them difficult to merge. Users are
searching not only services and portals such as the RDN, zetoc and COPAC but
also image resources, e-prints, learning objects, external and internal resources. If
a user wants to obtain a local copy of the range of search results, they often have
to merge the results themselves, for example by creating a text file.

An indication of the scale of the problem can be seen in the figures for the JISC
Information Environment in 2001, when there were 206 collections plus content
from projects such as 5/99 and X4L.
20497116.doc 3 of 9 06/08/2009 12:54
a8/p8
Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

Users require an effective means to search across all these varying resources.
Metasearching aims to solve the problems of searching across disparate
resources. This can be achieved via cross-searching and harvesting.

Cross-Searching

A portal sends a real-time query to a number of content providers and a results


set is returned to the user. This commonly uses the Z39.50 protocol, and more
recently may be achieved via SRW (Search and Retrieve Web Services) [1], which
takes the core of Z39.50 and re-implements it as a Web Service.

Harvesting

This uses a mechanism by which metadata is harvested into a service or portal


using The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [2].
The metadata is ‘pulled in’ locally and stored in a local database. The user
therefore does not need to wait for the results of a real-time search across a
network. The harvesting occurs periodically such as once a day or once an hour
so may not be completely current.

Hybrid

A third option is to use a combination of cross-searching and harvesting. For


example, the RDNs use harvesting, but the central database is also available for
cross-searching.

Scraping Content

Many services do not support Z39.50, SRW or OAI-PMH. In this situation, the
services’ Web interface has to be used and software has to be written to ‘scrape’
content out of the HTML/CGI search results. This is often difficult, laborious and
unreliable.

Metasearching for all sectors

The need for metasearching is shared across many sectors, e.g. students,
lecturers, researchers, shoppers looking for a second-hand car or new house.

In the domain of e-learning the IMS Digital Repositories Interoperability


Specification (DRI) [3] addresses the issues of metasearching for learning
materials.

Web Services

Andy Powell defined Web Services for the purposes of his presentation as strictly
machine interfaces between services on the Web.

Web Services use SOAP [4] to encapsulate transport information in XML which can
operate over HTTP and is therefore web friendly. With a SOAP based search
interface a service can be integrated into a metasearch service. (NB it is
important to bear in mind that terms and conditions of use must be taken into
account).

The service API (Application Programming Interface) defines the kinds of queries
that can be sent and the results retrieved. API’s differ across services.

20497116.doc 4 of 9 06/08/2009 12:54


a8/p8
Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

Metasearch Requirements

For a metasearch to work, agreement must be made on:

• The protocol used (SOAP, OAI, Z)


• The query syntax
• The metadata format sent to the user (SUTRS, MARC, IEEE-LOM)
• Quality assurance (e.g. how to handle names, duplicates, choosing
mandatory elements)
• Intellectual property rights and usage rights
• Middleware issues such as authentication
• How the user knows exactly what they’re searching – this is being
addressed by the DCMI Collection Description Working Group [5]

Metadata Issues

Format

As users are likely to be searching cross-domain, it makes sense to use a cross-


domain metadata schema. Dublin Core is a good contender for this and has
become increasingly popular. Indeed, it is required for use of OAI-PMH.

However, domains will continue to develop and use their own metadata schemas,
such as the IEEE-LOM for learning objects. This means that mapping is required to
enable cross searching, but some of the semantic richness of the original resource
may be lost.

Common Meaning

There needs to be agreement amongst content providers about the meaning of


terms, subject classifications and what a resource type actually consists of (e.g.
‘article’, ‘research paper’, ‘learning object’). There will inevitably be difficulties in
reaching agreement about the meaning of metadata elements, as they are often
used differently in different contexts.

Metadata Registries

Metadata practice is documented in ‘application profiles’ such as the eGIF or the


UK LOM Core. There is a need for these application profiles to be disclosed via
registries such as the Information Environment Service Registry Project (IESR) [6].

Trust Issues

There are trust issues involved in using a portal, which are generally issues of
authorisation. This area has been researched to some extent by the EDINA GetRef
Service [7] and Shibboleth Authentication software.

Knowledge Bases

Currently, if research services are cross-searching a number of services, they have


had to create and maintain their own ‘knowledge bases’. But as the number of
content providers increases, the maintenance of these knowledge bases will
become more difficult.

20497116.doc 5 of 9 06/08/2009 12:54


a8/p8
Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

One answer is to do away with these local knowledge bases and have a single
knowledge base in the form of a ‘service registry’. This would describe the content
of the collections and the technical interface details. Metadata formats need to be
agreed for how to describe these. The Dublin Core Collection Description Working
Group is looking at how to describe this content. The technical information can be
captured in Web Services Description Language (WSDL) [8], and ZeeRex [9]
(Z39.50 Explain, Explained and Re-Engineered in XML) for Z39.50 enabled
services.

There is a need to agree the way that collection descriptions are made available
to portals, e.g. using Z39.50, SRW, UDDI.

NISO MetaSearch Initiative

This NISO MetaSearch initiative [10] is trying to bring the area of metasearching
together. It is looking to enable:

• metasearch service providers to offer more effective and responsive


services
• content providers to deliver enhanced content and protect their intellectual
property
• libraries to deliver services that distinguish their services from Google and
other free web services.

2.2 The Integration of Course Management Systems, Library


Systems, OpenURL Resolvers, and Content Repositories

John Davidson, Sentient Learning UK

Sentient’s involvement with metasearching arose from solving a particular


problem they were asked to work on by a university library. The problem was that
academics were not communicating course reading lists to either the library or
local book shops. As a result of looking into this they became aware of the sheer
scale of the growth in information available in books, web based resources,
learning material repositories such as Merlot [11] and other places. They also
became aware of the lack of integration between VLEs, portals, content
management systems and the library management systems.

Sentient is essentially a ‘reading list’ system that attempts to solve these


problems by providing references not only to books but on-line journals, learning
objects and other on-line resources in one place. The system can be integrated
into a wide range of VLE’s to enable students direct seamless access to the
resources.

2.3 The Ex Libris Approach

James Culling, ExLibris

James gave an outline of the metasearching possibilities from a commercial


perspective. They provide an institutional library portal system called ‘MetaLib’
[12] that enables users to access institutions e-resources. They are best known
for their ‘SFX’ product that provides context sensitive linking to ‘appropriate’
copies of resources via OpenURL link resolvers.

James pointed out that librarians have been dealing the problem of searching
multiple information resources for many years. This has traditionally been
20497116.doc 6 of 9 06/08/2009 12:54
a8/p8
Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

addresses by bibliographic instruction, but this model is problematic in the


domain of web based resources.

The Ex Libris portal brings together resource discovery, metasearching and


retrieval using metadata. It is a live query system that doesn’t use any local
indexing. A combination of structured searching using SRW methods and
unstructured searching using screen scraping is employed.

James noted that merging the results is a complex and major challenge as results
returned are ordered in different ways and some method of ranking the results is
required. James also made reference to the NISO MetaSearch Initiative which is
looking to get input from vendors, content providers and the library community.
Ex Libris see the major challenges for the future as being bringing in resources not
yet integrated and the widespread adoption of interoperability standards.

The following discussion noted that that are hardly any SRW enabled services
currently available and that there has been some resistance from vendors to
create XML gateways to their systems as they prefer users to use their web
interface.

2.4 The Knowledge4Health Portal

Hilary Ollerenshaw, North Bristol NHS Trust


The Knowledge4Health portal [13] provides access to quality filtered healthcare
resources provided internally and externally via PC’s based on the wards. The key
objectives were described as:

• A single point of access to Trust patient information


• Access to customised health care information
• Encourage the sharing of knowledge resources

The portal brings together 11 sets of resources that currently include NHS Direct
Online and zetoc. Athens is used for authentication. The full text of some of the
resources is made available through ‘Dialog Datastar’ [14].

2.5 Using Structured Metadata to Streamline and Refine


Searching for News and Company Information from Different
Collections and Repositories

Chris Knowles, Magus Research

Chris was involved with a company that was addressing a requirement from
investment banks to cross search the databases of a number of subscription
based banking news services. This was a complex task that required the
development of different ‘site agents’ for each of the news services. The ranking
needs were based on the authority of the publication i.e. Results from The
Financial Times are ranked higher than The Times, and The Times results are
ranked higher than eg. The Manchester Evening News.

Chris has worked for a number of companies that have sold solutions to
knowledge intensive sectors such as banking and the legal sector. He noted that
it was very difficult to apply generic business logic across domains – the law
sector wanted to search external systems using the same methodologies they
were using to search their internal repositories, a very different approach to the
banking sector. Chris went on to give a demonstration of the Magus research
product, the Vrisko News Tracker system.

20497116.doc 7 of 9 06/08/2009 12:54


a8/p8
Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

2.6 Information Clustering and Natural Language Retrieval

Martin Kelly, Institute of Physics Publishing

Traditional searching methods have some inherent problems, some of which were
identified by the audience at this presentation. Those that were identified as the
major problems were; not having a very specific search (i.e. wanting to search a
subject area in a more general way), not knowing which terms to use in a search
and being overwhelmed by the number of results returned. The Institute decided
to investigate complimentary approaches. The challenge was to be able to explore
large datasets whilst reducing overload and providing context to search results.

They have been looking at the potential of searching using ‘clustering’, which is
the classification of data using structured taxonomies. They presented case
studies of their observations of using two software products for this, Vivisimo [15]
and Verity K2 [16].

Vivisimo

This software looks for patterns in a retrieved dataset and dynamically builds
taxonomies in real time based on these patterns. The user is presented with a
browse ‘tree’ and can then drill down into the taxonomy to get a more focused set
of results for their research area.

Conclusions on the use of Vivisimo:

• It is a problem to create a taxonomy for several thousand records in this


way, as it is very demanding of computer processing power. Therefore, the
cluster size had to be limited. After some research into this, the Institute
decided on limiting the clustering to the first 250 results. The original
results were ordered by the relevance ranking technology within Vivisimo,
but this may not match with the research requirements of the user.
• Vivisimo is useful for unstructured information with very little or no
metadata. It is also easier to implement than Verity K2

Verity K2

This is a powerful top-end searching technology that includes clustering tools. The
Institute has implemented this for the New National Journal of Physics. However,
performance was an issue, and therefore it was decided to ‘can’ clusters overnight
using the INSPEC classification tree [17].

Conclusions on the use of Verity K2

• The cost could be prohibitive, as this software does not come cheap (the
presenter declined to give specific figures)
• Performance was an issue, and so canning was considered to be the only
option
• Verity K2 is a good option for records with high quality metadata, but
would not be suitable for records without metadata
• It is not an out-of-the-box solution and the implementation issues turned
out to be complex and time-consuming, though the software did have a
substantial amount of flexibility built into it

20497116.doc 8 of 9 06/08/2009 12:54


a8/p8
Report on the ‘Metasearching, Better Searching?’ Conference, Oxford, 22nd July 2004

3. References
[1] Search/Retrieve Web Service
< http://www.loc.gov/z3950/agency/zing/srw/ >
[2] The Open Archives Initiative Protocol for Metadata Harvesting
< http://www.openarchives.org/OAI/openarchivesprotocol.html >
[3] IMS Digital Repositories Specification
< http://www.imsglobal.org/digitalrepositories/ >
[4] SOAP Version 1.2 Part 1: Messaging Framework
< http://www.w3.org/TR/soap12-part1/ >
[5] DCMI Collection Description Working Group
< http://dublincore.org/groups/collections/ >
[6] Information Environment Service Registry Project
< http://www.mimas.ac.uk/iesr/ >
[7] EDINA GetRef Service
< http://edina.ac.uk/getref/ >
[8] Web Services Description Language (WSDL) 1.1
< http://www.w3.org/TR/wsdl >
[9] ZeeRex: The Explainable “Explain” Service
< http://explain.z3950.org/ >
[10] NISO MetaSearch initiative
< http://www.niso.org/committees/MetaSearch-info.html >
[11] Merlot: Multimedia Educational Resource for Learning and Online Teaching
< http://www.merlot.org/ >
[12] MetaLib: The Library Portal
< http://www.exlibrisgroup.com/metalib.htm >
[13] Knowledge4Health Portal
< http://www.k4h.northbristol.nhs.uk/ >
[14] Dialog Datastar
< http://www.dialog.com/products/productline/datastar.shtml >
[15] Vivisimo clustering engine
< http://vivisimo.com/ >
[16] Verity K2 Enterprise
< http://www.verity.com/products/k2_enterprise/ >
[17] Outline of INSPEC Classification 1999
< http://www.iee.org/publish/support/inspec/document/class/classif.cfm >

20497116.doc 9 of 9 06/08/2009 12:54


a8/p8

You might also like