You are on page 1of 22

DESIDOC Bulletin of Information T e c h m y , Vol. 21. No. 6, November 2001, pp.

3-24
Q 2001, DESIDOC

Building Digital Libraries: An Overview


Jagdish Arora

Abstract
Tools, techniques and protocols necessary for building up digital libraries have
now fully evolved with computing power that allows parallel processing,
multitasking, parallel consultation and parallel knowledge navigation and
sohare tools that facilitate artificial intelligence and interadivity. Coincided with
availability of sofhvare, hardware and networking technology, the advent of world
wide web 0, its ever increasing usage and highly evolved browsers have
paved the way for creation of digital libraries. With rapid developments in
technologies necessary for developing digital libraries, the world of digital
information resources has expanded rapidly and exponentially. Increasing
number of publishers are using the internet as a global way to offer their
publications to the international community of scientists and technologists. With
the technology available at an effordable cost, the libraries are initiating small
digitisation projects as an individual library or as a group of libraries. Building-up
digital collection and the infrastructure required to access them is a challenge
that every library has to deal with. The article delves into technological evolution,
cultural revolution and contents enrichment that led to revolution in growth and
development of digital libraries. The artide has two distinct parts, while the first
part deals with various aspects of building, accessing and organising digital
resources and collection, the second part elaborates on the process, technology,
formats, compression techniques and tools used in digital imaging. The article
describes optical character recognition (OCR) and advocates for hybrid solution
for preservation of digital information.

1.. INTRODUCTION articles, users had to depend heavily on


Computerisatioo of the library during past physical collection available either in their
few decades has focused heavily on creation institutional library or on interlibrary loan from
of surrogate records of printed documents other libraries for references retrieved from
available in a library or for providing the secondary services. Attempts were made
computerised services through secondary in past to make the full text of research
databases held locally on CD-ROM or articles available through online search
magnetic tapes. The integrated library services, although technology available till
packages have served well in providing late 1980s and early 1990s supported only
access to documents at bibliographic level. simple text (ASCII) without graphics. Tools,
Similarly, secondary services like MEDLINE, techniques and protocols necessary for
INSPEC, COMPENDEXplus and CAS have building-up digital libraries evolved with
proved themselves as effective tools for availability of computing power that allows
bibliographic control of research information. parallel processing, multitasking, parallel
However, since these databases provided consultation and parallel knowledge
only bibliographic information on research navigation and software tools that facilitate
-.- - -.-
DESlDOC Bulletin of Inf Technol,2001.21(6) 3
artificial intelligence and interactivity. Building-up digits\ collection and infrastructure
Coincided with the availability of software, required to access them is a challenge that
hardware and networking technology, the every library has to deal with. Today's digital
advent of world wide web (WWW), its ever libraries are built around Internet and web
increasing usage and highly. evolved technologies with electronic journals as their
browsers have paved the way for creation of building blocks. The increasing popularity of
digital libraries. With rapid developments in lnternet and developments in web
technologies necessary for developing digital technologies are catalyst to the concept of
libraries, the world of digital information digital library. Fig.. 7 is a pictorial
resources has expanded quickly and representation of digital library infrastructure
exponentially. A large numbers of STM and services that can be generated from
(science, technology and medical) electronic them.
journals are appearing on the web. Digital
information resources include not only rapidly 2. DIGITAL LI6RARY:VHISTORICAL
growing collection of electronic full text PERSPECTIVES
resources, but also images, video, sound, and Although the term digital library has gained
even objects of virtual reality.
popularity in recent years but they have
The most significant shift in building digital evolved along the technological ladder for
coflections is greater interoperability among past more than thirty years. In early 19?0s,
information systems across the country and digital libraries were built around mini and
internationally. Wtth the technology available mainframe computers providing remote
at an effordable cost, the libraries are access and online search and retrieval
initiating small digitisation projects as services to users using computer and
individual library or as a group of libraries. communication technology available at that

Library Sewices
- 0 P A C to WebPAC
sCDR0.M to Web Databases
-Manual to Digital Reference

Infrastructure

infrastructure Library SEwkes


.Virtual Library T o r n
.Librsry Wrb Site
.Library Ponalr

.Library Calndar
.Web F a m a
Standards,Protocols .Bmlklin Bomrds, Dbcurrion

Figurn 1. Digital Library Infrastructure and Services

4 DESIDOC Bulletin d l n f Technol, 2001,21(6)


time. The earliest application of digital library International Business Database, UMI's
concepts involved character-coded storage General Reference Periodicals, Espace
and full text indexing of legal and scientific World, US Patents, etc.
documents. The legal information through Digital document imaging system, which
electronics (LITE) system was first employs computer hardware and software to
implemented by the US Air Force in 1967. scan and store images of documents in
Several software packages were released digitized formats, evolved in early 1980s to
during mid-1970s and late 1970s for overcome the limitation of text storage and
computer-based storage, indexing and retrieval systems which could only store
retrieval of documents in character-coded textual information. The earliest application of
form. Some of the better known text storage a document imaging system was the Optical
and retrieval packages included: IBM's Disk Pilot Project at the Library of Congress.
storage and information retrieval system Several document imaging software
(STAIRS), battelle automated search packages are currently available in the
information system (BASIS), INQUIRE, market. OmniDoc (Newgen) and Datascan
BRSBEARCH, DOCUAUASTER, ASSASSIN, (Stacks India) are two important document
STATUS, CAIRS, etc. By 1980s, text storage imaging software from India.
and retrieval programs were available from
dozens of vendors for major computing The beginning of full text digital library
environment including main-frame, involved building-up several client systems
microcomputers and local area network usable in a multitude of environments, such
as MS Wndows, MS DOS, Apple Macintosh
(LAN).
and a diversity of UNlX systems as well as for
Sophisticated information storage and terminal-oriented mainframe systems, notably
retrieval systems were built during 1980s VT-100 and VT-220. Upscaling of digital
using state-of-the-art technology of distributed library in those days entailed huge
database management system linking maintenance problems because all clients
different . remote systems. These online system had to be upgraded and scaled for
information retrieval services used data files new facilities and emerging new techniques
generated in the process of electronic and processes.
phototypesetting of printed abstracting and
indexing services and other primary journals. However, 1990s brought in a true
As such, online hosts like DIALOG and STN revolution in digital library system. The advent
were not only offering online databases, but of WWW offered a crucial advantage with the
also full text online journals for past several availability of ready-to-use, publicly available,
years, although as a simple ASCII or text files user-friendly graphical web browser for all
without graphics and pictures. In 1989, there prevalent platforms. Standard WWW clients
were almost 1700 full text sources available such as Netscape Navigator and Internet
through'sixteen online systems. Availability of Explorer are being upgraded regularly for
CD-ROM in late 1980s as a media with high added functionality such as e-mail client,
storage capacity, longitivity, and ease of support for JAVA, Active X and the ability to
transportation triggered production of several view important document formats without
CD-ROM information products which were having to install plug-ins for them. These
earlier available through online vendors or as browsers solved the maintenance problem
conventional abstracting and indexing allowing developers to concentrate fully on
services in printed format. Moreover, several the server side and not to bother with the
full text databases also started appearing in ctient side. These browsers are available
late 1980s and early 1990s leading tb freely and are easy to use eliminating the
beginning of digital era. Some of the need of extensive support and user's training.
important full text digital collections available The internet and associated technologies,
on CD-ROM include: ADONIS, lEEG7EE made it possible for digital libraries to include
Electronic Library (IEL), ABIANFO, UMl's multimedia objects such as text, image, audio
DESIDOC Bulletin of Inf Technol, 2001,21(6) 5
and video. These internet and web making the transition into the digital world.
technologies thus brought-in the graphical Such resources are converted for increased
components to the digital library which was access and to reduce reliance on physical
missing in earlier digital library libraries. The transition resources are either
implementation. There has thus been a digitized images or images that are converted
steady move up the technological scale for to text by the process of OCR.
the digital libraries from previous (late 1980s) New digital resources: these are either
low-end electronic publications available as expressly created as digital or are created in
ASCII files, to being organised and parallel to print. Publishers are increasingly
searchable on gophers (1992), and to being moving to XML or SGML format. These
tagged and graphically viewable on WWW formats are used to generate datafiles
sites (1994). Recent growth and development required for producing printed outputs. The
in digital Libraires can be attributed to
SGMUXML databases are also used for
generating HTMLIPDFIXHTML or postscript
availability ~f the internet and web technology
file dynamically using appropriate DTDs. New
as a media of information presentation and digital resources are designed with a
delivery and convenience it offers. particular use in mind employing new Internet
and web technologies embodying a great
3. BUILD1NG-UP D1GITAL variation and value addition.
COLLECTIONS Future digital resources: there is an
The most important component of a digital increasingly wide range of digital resources
, library is the digital collection it holds or has from formally published electronic journals
access to. Viability and extent of usefulness and electronic books through databases and
of a digital library would depend upon the datasets in many formats, i.e., bibliographic,
critical mass of digital collection it has. A full text, image, audio, video, statistical and
digital library can have a wide range of numeric datasets. Future resources may
resources. It may contain both paper-based contain data sets which are not formally
specified.
conventional documents or information
contained in computer-processible form. A digital library is not a single entity
Information contents of a digital library, although if may have digital contents created
depending on the media type it contains, may in-house or acquired in digital formats stored
include a combination of structured/ locally on servers. A digital library may also
unstructured text, numerical data, scanned act as a portal site providing access to digital
images. graphics, audio and video recordings. collections held elsewhere. The digital
Different types o f resources need to be constituents of a digital library are shown in
handled differently in a digital library. fig. 2 and are described below:
us bridge'^ divided resources for a digital
library in following four distinct categories, i.e., 3.1 Acquisition of Collections
legacy, transition, new and future. available in Digital Formats
Legacy resources. these are largely Thousands of CD-ROM databases are
non-digital resources, including manuscript, currently available from multitude of CD-ROM
print, slides, maps, audio and video producers including Silver Platter which alone
recordings. lnspite of the fact that large produces more than 250 information products
investments are being made in the process of on CD-ROM. Moreover, several full text
digitisation of resources, vast majority of databases also started appearing in late
existing legacy resources will remain outside 1980s and early 1990s, launching the
the electronic domain for many years to beginning of a new digital era. CD-ROM
come. These legacy resources are the major
networking technology is now available for
resources of existing libraries
providing weta-based simultaneous access to
Transition resources: primarily designed for CD-ROM databases on the LAN as well as on
another medium (mostly print), are those wide area network (WAN). More evolved
which are being or have been digitised,
6 DESIDOC Bulletin of Inf Technol,2001,21(6)
Acquiring : Content Creation
Buying Access Born Digital
Di~italMedia

Digital Library

Content Creation
Scanning -.
Portal Sites
--- .-.
, Integrated
Access

Figure 2. Collection Infrastructure of a Digital Library

technology allows caching of the contents of


CD-ROMs on to a server, which, in turn, The technology is now available for
provides web-based simultaneous and faster creation of fully digitised multimedia products
access to the information contents of and make them accessibe through the
CD-ROMs. The libraries have an option to internet. Technological changes, especially
subscribe to these full text databases as a the internet and web technology, continue to
part of their digital library. attract more and more traditional players to
adopt it as a global way to offer their
The Silver Platter's Electronic Reference publications to the international community of
Library (ERL) technology facilitates uploading scientists and technologists. Most of the
of contents of ERL-complaint CD-ROM important publishers now have their
databases on the hard disc of an intranet web-based interfaces to offer full texts of their
server, which, in turn, provides integrated journals.
access and search on ERL-complaint
databases. Moreover, individual research The current electronic publishing market
articles in the ERL-complaint database are consists of traditional players offering
linked to their full text research articles using electronic versions of their print journals as
Silver Platter's Silver Linker. well as several new enterprises offering new
products and services that are 'borne digital'.
3.2 Buying Access to External The market also has several subscription
agents in their new role as aggregators.
Digital Collections
These players include:
The libraries will not become digital
libraries, but will rather acquire access to ever 3.2.1 Publishers and Scholarly
growing digital collections on behalf of their Societies
users. Majority of these collections would be Most well-known commercial publishers of
provided by external sources like commercial traditional journals such as Elsevier Science,
publishers, collections mounted by scholarly Kluwer Academic Press, Academic Press,
societies, resources at other libraries, Springer Verlag, Wiley Interscience and
electronic journal sites, etc. Internet has long scholarly societies such as SIAM, ACM,
been a favorite media for experimenting with IEEEIIEE are making their publications
electronic publishing and delivery. available online through their web sites.
DESIDOC Bulletin of Inf Techno!. 2001.21 (6) 7
Several universities host specialised on part of the publisher. Tenopir and King
collections on their web sites. Several (1997) in a study concluded that the cost of
universities, as members of the networked electronic journals can not be substantially
digital library of theses and dissertations lower than their printed versions.
(NDLTD) initiative, host doctoral dissertations Journals are made available through the
submitted to their respective universities. web at varying price models. In a survey of
3.2.2 Aggregators 8001 peer reviewed electronic journals
conducted by EBSCO, it was found that 50%
Third party aggregators provide access to
of electronic journals are free with their print
numerous journals from a variety of
journals, 34% require additional payment over
publishers. Aggregators include organisations
their print subscription and 16% are available
like JSTOR 4hat offer extensive backfiles for
online only without their print counter-part.
more than 100 academic journals and OCLC
Overall, 84% of joumals require a print
Electronic Collection Online which offer full
subscription to journals as a prerequisite for
text access to more than two thousand titles
online access to their electronic version3.
via their First Search service. Other
Following are prevalent pricing models:
aggregators like Lexis-Nexis, Bell and Howell
(UMI) and Web of Science (ISI) offer 3.3.1 Electronic Subscription linked
searchable indexes with links to full text to Print Subscription
journals on publisher's site. EBSCOHost, IAC The electronic subscription Do journals in
Trac SearchBank and Blacbell's Electronic most of the cases are linked to their printed
Journal Navigator (EJN) provide common counterparts, i.e., it may be offered free with
search interface for the journals aggregated print subscription (e.g. publications of
by them from assortment of publishers. American Society for Physics and ASCE) or
Growing number of susbcription agents are priced at a fixed percentage over the print
working with publishers to provide aggregated subscriptions (e.g. IEEE's ASPP package).
services for packages of titles or for full text
databases. 3.3.2 Electronic Subscription with
Campus Licenses
3.3 Pricing Model Electronic publishers facilitate campus
Total number of electronic journals wide unlimited access to subscribed journals
available on the web has grown steadily from on payment of a fixed amount of platform fee.
less than 10 in 1989 to 8500 in April, 2000. Example: Elsevier Science (ScienceDirect).
One of the major issue that the publishers 3.3.3 Electronic Subscriptions are
are concerned with is to save their economic Bundled
interest in the process of providing electronic
access to their printed publications. The Several electronic publishers offer access
publishers make a significant investment in to the entire range of their electronic journals
the process of production of a journal which and other publications bundled into one. For
involves activities like peer-review, example IEL and AGM Digital Library offer
administration, editing, layout design, access to their entire site on subscription.
production, subscription management and Access to individual journals or a subset is
distribution. Most activities that are performed not permissible. Similarly, Academic Press
for publishing a journal are common to both offer all journals available on their site
electronic and paper media, except for (Academic's Project IDEAL) for 10% more
production and distribution where the cost than the print subscription to library consortia.
involved is relatively low. Moreover, electronic
version of journals generally provides Publishers and aggregators have started
additional features like link to corrections, link experimenting with models wherein a user
to additional materials, e-mail link to can search a database online for a modest
author(s), etc., which require additional work
8 DESlDOC Bulletin of Inf Techno/,2001,21(6)
usage fee, identify articles of interest, and Data Unit (http:/hww.esdu.com) are some of
then call up such aftides in full text on a the important examples.
per-look basis. Electronic resources accessible on the
3.3.5 Electronic Only web for free or for a fee are undeniably major
and important constituent of a digital library.
A few publishers and aggregators have
started offering only electronic version of their 3.4 Converting Digital Datasets
journals providing a modest discount for those
who forego print versions. The libraries or the institutions
implementing digital libraries may have
3.3.6 Consortium Licensing datasets that are originally created in digital
Consortia provide union strength to format. Doctoral dissertations submitted to
negotiate with electronic publishers for the universities and research institutions are
best possible price and rights. Most undisputedly highly valuable documents that
publishers already have well-defined policies qualify to be an important component of any
and offer for libraries subscribing as digital library. Moreover, institutions may have
consortia. The consortia licensing is widely in-house journal(s), annual reports, technical
used the world over by the libraries. It is reports, or other datasets, that may be
slowly pickingup in India also. included in digital collection. Items listed
above are invariably composed in one of the
3.3.7 Nati~nalLicensing word processing programme or in a desk-top
National licenses can atso be negotiated publishing package.
with electronic publishers for core collections. The documents composed on word
Singapore, Taiwan and UK have arranged processing packages or desktop publishing
national licenses for some of the important full
packages can be converted into HTML,
text resources.
Postscript and PDF using tools like Acrobat
Besides, electronic journals, there are 4.0 or Acrobat Exchange. Online converters
several online databases that are now are also available through Adobe's site.
available through.the web including Medline HTML, as a defacto language of the web and
(several versions), AGRICOLA and ERIC (all PDF as a defacto standard for online
free). Most online search services like STN distribution of electronic information, can be
and DIALOG also have their web-based employed to facilitate transition from
interfaces. Reference works like computer processible files to a format
encyclopedia, dictionaries, handbooks, accessible on the web.
atlases, etc., are also making their electronic
For regular publications, the institutions
appearance on the web. Electronic resources
may adopt SGML or XML to provide structure
created exclusively for the web include
and functionality to their publications. Most
web-based educational tutorials called 'online
electronic publishers are increasingly using
courseware'.
SGMUXML to ripe the benefit that the format
The online courseware are proliferating the offer. SGML (or XML) documents provide
web as a strong contender for distant benefit of a database management system
education. Telecampus, Canada without being one. Publishers code the
(www.telecampus.edu/) lists more than accepted submissions in SGMUXML in a
12,000 online courseware available on the semi-automated process using assortment of
web. Moreover, highly specialised web sites sotWaf8 packages available b them or using
are now coming-up in various disciplines custom-made software specially designed for
khich offer information in totality including all this purpose. The database of SGML
kinds of resources in electronic format, El documents are used for providing search by
Engineering Village (http://www.ei.org/), ISI authors, keywords, etc., and browse the
Nectronic Library (http:/lwww. isinetxom), /EL content pages of journals. Behind the web
(http:l/www.ieee.org/), Engineering Sciences interface lies a relational database like Oracle
DESlDOC Bulletin of Inf Techno/,2001,21(6) 9
that store SGMLIXML documents. Search mirrored at Princeton University, USA. Using
and browse operation on highly strctured technology developed at Michigan, high
XMLISGML datasets provides dynamically resolution (600 dpi) bit-mapped images of
generated web pages (HTML-on-fly). These each page are linked to a text file generated
HTML. files provide link to full text of with OCR software. Linking a searchable text
documents in HTMLIPDFIPostScript, all file to the page images of the entire published
formats are generated dynamically from the record of a journal along with newly
same SGMLIXML datasets using pre-defined constructed table of contents, indexes,
DTDs. permits high level of access, search and
retrieval of the journal content previously
3.5 Conversion of Existing Print unimaginable8.
Media into Digital Format Capturing page image format is
Several digital library projects are comparatively easy and inexpensive, it is
concerned with providing digital access to reproduction of a page maintaining page
mateials that already exist with traditional integrity and originality. The scanned textual
libraries in printed media. Scanned page images, however, are not searchable unless it
images are practically the only reasonable is OCRed, which, in itself, is highly error
solutions for institutions such as libraries for prone process specially when it involves
converting existing paper collection (legacy scientific texts.
documents) without having access to the
original data in computer processible formats 3.6 Creating Portals or Gateways
convertible into HTMLISGML or in any other to the Electronic Collection
structured or unstructured text. Scanned available on the Web
images are natural choice of large-scale
conversions for major digital library initiatives. The web has become the most successful
Printed text, pictures, and figures are networked hypermedia-based system that
transformed into computer-processible forms allows rapid access to a wide variety of
using a digital scanner or a digital camera in a networked information resources. It allows
process called document imaging or linking amongst electronic resources stored
scanning. The digitally scanned images are on servers dispersed geographically on
stored in a file as a bit-mapped page image, distant locations. The portal sites or gateways
irrespective of the fact that a scanned page redirect a user to the holders of the original
contains a photograph, a line drawing or text. digital material. It may provide its own
A bit-mapped page image is a type of indexing and search services or may combine
computer graphic, litera!ly an electronic original resources from a number of different
picture of the page which can most easily be providers. The portal sites or the gateways
equated to a facsimile image of the page and restrict their operation to providing linkages to
as such they can be read by humans, but not independent third party sources. Home pages
by the computers, i.e.. 'text' in a page image of all the major education and research
is not searchable on a computer using the institutions, specially in developed world,
present-day technology. An image-based provide an organised and structured guide to
implementation require a large space for data electronic resources available on the internet.
storage and transmission. There are several Some of the major portal sites or gateways
large projects using page images as their that provide access to electronic resources on
primary storage format, including project the internet are as follows:
JSTOR (www.jstor.org) at Princeton WWW Virtual Library
University, USA funded by the Melon http:liwww.edoc.comi
Foundation. The project JSTOR has a Internet Publ~cLibrary
complete set of more than 120 journals http:l/www.ipl.org/
scanned and hosted on web servers that Michigan Electronic Library
reside at the University of Michigan-..- and is http://mel.lib.mi.us/
10 DESlDOC Bulletin of Inf Technot,2001.21 ( 6 )
Penn Electronic Library product called 'ENCompass' from Endeavour
http:llwww.library,upenn.edulresourcesl would come up as a single, seamless search
BUBL lnformation Service across disparate remote and local collections
http:l/bubl.ac.uk/ if?- (Science Direct Connections, 2000).
Argus Clearing House
http:llwww.clearinghouse.net/ 4. PRINT TO DIGITAL: OPTIONS
Internet Index FOR CONVERSlON
http:llsunsite.berkeley.edu/lnternetlndex/ The digital imaging technology offers a
number choices that can be adopted
3.7 Integrated Access Interface depending on the objective of scanning, end
Digital libraries typically integrate multitude users, availability of finances, etc. There are
of resources and media types. Constituents of four basic approaches that can be adapted to
a digital library may have: translate from print to digital:
(a) Collection acquired in digital form o Scanned as image
(b) Collections digitised in-house u OCR
(c) Buying access to electronic resources o Retaining page layout using acrobat
including e-journals capture
(d) Subject gateways and the library OPACs. o Re-keying the data.
In effect, a digital library may not only have 4.1 Scanned as lmage
multitude of resources but also multitude
mechanism to access these resources. Most lmage only is the lowest cost option in
libraries having sizable coHections in digital which each page is an exact replica or the
forms and have adopted two-fold strategy that original source document. Since OCR is not
include: (a) provide access to resources carried out, the document is not searchable.
through the library catalogue wherever Most scanning software generate TIFF
possible; and (b) provide access to electronic format, by default, which can be converted in
resourc5s and specialised collections through to PDF using a number of software tools.
the library home page. Scan to TIFFiPDF format is recommended
only when the requirement of project is to
In cases of electronic journals, web access make documents portable and accessible
via an alphabetical listing and/or subject index
from any computing platform. The images can
of all titles, offers a quick and simple means
be browsed through a table of contents
of inventory and direct hypertext links to full composed in HTML provid~nglink to scanned
text sources. Similarly, access to other images.
specialised collection can also be provided
through a set of menus that serves as rough 4.2 Optical Character Recognition
and ready finding aid. It is particularly useful
for institutions that have not implemented (OCR)
web-based catalogues and cannot offer The latest versions of both Xerox's
hypertext links from a catalogue record. On TextBridge and Caere's Omnipage
the other hand, access to e-journals is incorporate technology allow the option of
separate from the online catalogue and other maintaining text and graphics in their original
journals that are part of the library's collection. lay6ut as well as plain ASCII and
Web-based catalogues can enable users to word-processing formats. Output can also
connect directly to the full text source via include HTML with attributes l~ke bold,
hypertext links in the catalogue record. underline, and italic retained.
Acquisition of Endeavour's lnformation 4.2.1 Retaining Layout After OCR
system by the Elsevier Science, marks Several software packages now offer
integration of Elsevier's contents into facil~tyof retaining the page layout after it has
Endeavour's digital library technology. A- new
DESIDOC Bulletm of Inf Technol, 2001.21 (6)
been OCRed. The process for retaining the 4.3 Retaining Page Layout Using
page layout is software dependent. Caere's
Omn~prooffers two ways of retaining page Acrobat Capture
layout following OCR. It calls them True Page The Acrobat Capture 2.0 provides several
Classic and True Page Easy. True Page options for retaining not only the page layout
Classic places each paragraph within a but also the fonts, and to fit text into the exact
separate frame of a word processor into space occupied in the original, so that the
which the OCR output IS saved. If one wishes scanned and OCRed copy never over- or
to edit anything subsequently, then the under-shoots the page. Accordingly, it treats
relevant paragraph box may need to be unrecognizable text as pasted-in images,
r6sized. However, 'Easy Edit' facilitates which are perfectly readable by anyone
editing of pages without the necessity of looking at the PDF file, but which will be
resizing the boxes although there is a greater absent from the editable and searchable text
chances of spillage over the page. Xerox Text file. In contrast, ordinary OCR programs treat
Bridge offers similar feature called DocuRT unrecognised text as tildes or some other
which is broadly equivalent to True Page special character in the ASCII output. Acrobat
Easy Edit. The process of OCR dismantle the Capture can be used to scan pages as
page, OCR it, and then reassemble it in such images, image+text and as normal PDF, all
as way that all the component parts such as the three options retain page layout.
tabs, columns, table, graphics can be used in lmage Only: lmage only option has already
a text manipulation package such as word been described in option 1.
processor. Image+Text: In image+text solutions, a
The process of OCR results in OCRed text is generated for each image
computer-processible file that is less accurate where each page is an exact replica of the
than rekeying-in the data. At an accuracy ratio original and left untouched, however, the
of 98%, a page having 1800 characters will OCRed text sits behind the image and is used
have 36 error per page on an average. It is for searching. The OCRed text is generally
therefore, imperative to cleanup after OCR not corrected for errors since it is used only
unless original scanned image will be viewed for searching. The cost involved is much less
as page and OCR being used purely for than PDF Normal. However, the entlre page is
a bitmap and neither fonts nor line drawings
creating a searchable index on the words that
are vectorised, so the file size of lmage + Text
will be searched via a fuzzy retrieval engine PDFs is considerably larger than the
like Excalibur which is highly tolerant to OCR corresponding PDF Normal files and pages
errors. will not display as quickly or cleanly on
Another possibility for cleaned-up OCR is screen.
use of specialist OCR system such as Prime PDF Normal: POF normal gives the clearest
Recognition. With production OCR in mind, on-screen display, is searchable, and yet with
Prime OCR licenses leading recognising significantly smaller file size than Image+Text.
engine and passes the data through several The result is not, however, an exact replica of
~f them using voting technology along with the scanned page. While all graphics and
artificial intelligent algorithms. The use of formatting are preserved, substitute fonts may
multiple OCR engines improves the result be used where direct matches are not
achieved by a single engine by 65-80 %. The possible. It is a good choice when files need
technology is available at price depending to be posted to the web or otherwise
delivered online. If, during the capture and
upon number of search engine that one would
OCR process, a word cannot be recognised
like to incorporate. Michigan Digital Library
to the specified confidence level, capture, by
production services used Prime OCR for default, substitutes it with a small portion of
placing more 'th$n two million pages of the original bitmap image. Capture's 'best
SGML-encoded text and the same number of guess' of the suspect word lies behind the
page images on the web. bitmap so that searching and indexing are still
.--- ---. .- --. --

12 DESlDOC Bulletin of lnf Technol,2001,21(6)


possible. However, one cannot guarantee that 5.2 Indexing
these bit-mapped words are correctly
guessed. In addition, the bit-map is somewhat Indexing of a document converted into an
obtrusive, detracting from the 'look' of the image or text file is the second step in the
page. Further, capture provides option to process of document imaging. The process of
correct suspected errors left as bit-mapped indexing scanned image ~nvolveslinking of
image or leave them untouched. database of scanned images to a text
database. Scanned images are just like a set
of pictures that need to be related to a text
A classic solution of this kind would database describing them and their contents.
comprise of keying-in the data and its An imaging system typically stores a large
verification. This involves a complete keying amount of unstructured data in a two file
of the text, followed by a full rekeying by a system for storing and retrieving scanned
different operator, the two keying-in operation images. The first is traditional file that has a
might take place simultaneously. The two text description of the image along with a key
keyed-in files are compared and any errors or to a second file. The second file contains the
inconsistencies are corrected. This would document location. The user selects a record
guarantee at least 99.9% accuracy, but to from the first file using a search algorithm.
reach 99.955% accuracy level, it would Once the user selects a record, the
normally require full proof-reading of the application keys into the location index, finds
keyed-in files, plus table lookups and the document and displays it.
dictionary spellchecks. Most of the document imaging software
packages. through their menu driven or
5. STEPS IN THE PROCESS OF
command driven interface, facilitate elaborate
DIGITISATION indexing of documents. While some DMS
The following four steps are involved in the facilitate selection of indexing terms from the
process of digitisation: scanning, index~ng, image file, others allow only manual keying in
storage, and retrieval. Software, variably of indexing terms. Further, many OMS
called document image processing (DIP), packages provides OCR capabilities for
electronic filing system (EFS) and document transforming the images into standard ASCII
management systems (DMS) provides all or files. The OCRed text then serves as a
more of these functions: database for full text search of the stored
images.

5.1 Scanning 5.3 Storage


The process involves acquisition of an The most tenacious problem of a
electronic image through its original that may document image relates to its file size and,
be a photograph, text, manuscript, etc. into therefore, to its storage. Every part of an
the computer using an electronic image electronic page image is saved regardless of
scanner. An image is read or scanned at a presence or absence of ink. The file size
predefined resolution and dynamic range. The varies directly with scanning resolution, the
resulting file, called 'bit-map page image' is size of the area being digitised, compression
formatted and tagged for storage and ratio, content and the style of graphic file
subsequent retrieval by the software package format used to save the image. The scanned
used for scanning. Fax card, electronic images, therefore, need to be transferred
camera or other imaging devices may also from the hard disc of scanning workstat~onto
employed for acquisition of an image, an external large capacity storage devices
however, image scanners are most important such as an optical disc, CO-ROMfDVD-ROM
and most commonly used component of an disc, cartridge tapes, etc. While the smaller
imaging system for transfer of normal document imaging system use offline media,
paper-based documents into an image. which need to be reloaded when required, or
DESIDOC Bulletin of Inf Techno/,2001.21(6) 13
fixed hard disc drives allocated for image characterised by multiple core and peripheral
storage, larger DMS use auto-changers such systems that includes:
as optical jukeboxes and tape library systems. n Hardware (scanners, computers data
The image storage device may be either storage and data output peripherals)
remote or local to the retrieval workstation
n Software (image capturing, data
depending upon the imaging systems and
compression)
DMS used.
n Network (data transmission)
5.4 Retrieval a Display technologies.
Once scanned images and OCRed text Digital images, also called 'bit-mapped
documents have been saved as a file, a page image' are 'electronic photographs' or
database is needed for selective retrieval of set of bits or pixels (picture elements)
data contained in one or more fields within represented by '0"and '1'. A bit-mapped page
each record in the database. Typically, a image is a true representation of its original in
document imaging system uses at least two terms of typefaces, illustrations, layout and
files to store and retrieve documents. The first presentation of scanned documents. As such
is traditional file that has a text description of information contents of 'bit-mapped page
the image along with a key to the second file. image' cannot be searched or manipulated
The second file contains the document unlike text file documents (or ASCII).
location. The user selects a record from the However, an ASCII file can be generated from
first-file using a search algorithm. Once the a bit-mapped page image using an optical
user selects a record, the application keys character recognition (OCRJ software such as
into the location index, finds the document Xerox's TextBridge and Caere's Omnipage.
and displays it. Most of the DMS provide The quality of digital image can be monitored
elaborate search possibilities including use of at the time of its capture by the following
Boolean and proximity operators (and, or, not) factors:
and wild cards. Users are also allowed to Bit depthldynamic range
refine their search strategy. Once the required
Resolution
images have been identified their associated
8 Threshold
document image can quickly be retrieved
from the image storage device for display or Compression
printed output. Image enhancement.

6. DIGITAL IMAGING 6.1 Bit Depth or Dynamic Range


TECHNOLOGY The number of bits used to define each
Digital imaging IS the process of converting pixel determines bit depth. The greater the bit
paper documents including text, graphics, or depth, the greater the number of gray scale or
pictures into digital images that can be made colour tones that can be represented.
accessible over electronic networks. A digital 'Dynamic range: is the term used to,express
the full range of total variations, as measured
image, in turn, is composed of a set of pixels
(picture elements), arranged according to a by a densitometer between the lightest and
pre-defined ratio of columns and rows. An darkest portion of a document. Digital images
can be captured a1 varied density of bits per
images file can be managed as regular
computer file and can be retrieved, printed pixel depending upon: (a) the nature of
and modified using appropriate software. source material or document to be scanned;
Further, textual images can be OCRed so as (b) target audience or users; and (c)
capabilities of the display and print subsystem
to make its contents searchable. Digitat
imaging is an inter-linked system of hardware, that are to be used. Bitonal or black & white
or binary scanning is generally employed in
software image database and access
libraries to scan pages containing text or the
sub-system with each having its own
drawings. Bitonal or binary scanning
- .. .- .. . - . .-. Digital
components. . .-- - -- .- systems
.- are
--
14 DESlDOC Bulletin of Inf Techno!,2001, 21 (6)
represents one bit per pixel (either '0' (black) 6.2 Resolution
or '1' (white). Gray scale scanning is used for
reliable reproduction of intermediate or The number of pixel in a given area
continuous tones found in black & white defines the resolution of an image. It is
photographs to represent shades of grey. measured in terms of dot per inch (dpi) in
Multiple numbers of bits ranging from 2-8 are case of an image file and as ratio of number
assigned to each pixel to represent shades of of pixel on horizontal line x number of pixel in
grey in this process. Although each bit is vertical lines in case of display resolution on a
either black or white, as in the case of bitonal monitor. Higher the dpi set on the scanner
images, but bits are combined to produce a the better the resolution and quality of image
level of grey in the pixel that is black, white or and larger the image file.
somewhere in between. Colour scanning can Regardless of the resolution, image quality
be employed to scan colour photographs. As of an image can be improved by capturing an
in the case of grey-scale scanning, multiple image in greyscale. The add~tionalgray-scale
bits per pixels typically 2 (lowest quality) to 8 data can be processed electronically to
(highest quality) per primary colour are used sharpen edges, fill-in characters, remove
for representing colour. Colour images are extraneous dirt, remove unwanted page
evidently more complex than grey scale strains or discoloration, so as to create a
images, because, it involves encoding of much higher quality image than possible with
shades of each of the three primary colours, binary scanning alone. A major draw back in
i.e. red, green and blue (RGB). If a coloured gray scale is the large amount of data
image is captured at 2 bits per primary colour, captured. Continuing increase in resolution
each primary colour can have 2' or 4 shades will not result in any appreciable gain in image
and each pixel can have 44 shades for each quality after some time, except for increase in
of the three primary colour. file size. It is thus important to determine the
Evidently, increase in bit depth increases point at where sufficient resolution has been
the quality of image captured and the space used to capture all significant details present
required to store the resultant image. in the source document.
Generally speaking, 12 bits per pixel (4 bits The black and white or bitonal images
per primary colour) is considered minimum (textual) are scanned most commonly at 300
pixel depth for good quality colour image. dpi that preserve 99.9% of the information
Most of today's colour scanner can scan at contents of a page and can be considered as
24-bit colour (8 bit per primary colour). adequate access resolution. Some
Table 1 provides binary calculation for no presewation projects scan at 600 dpi for
of shades represented by bit depth. better quality. A standard SVGANGA monitor
has a resolution of 640 x 480 lines while the
SI. No. No. of No. of No. of ultra-high mon~torshave a resolution of about
No. of bitsl shades shadeslpixel 2048 x 1664 (about 150 dpi).
Bits shade
1 1 1 1 1 6.3 Threshold
2 2 2 22=4 43=@ The threshold setting in bitonal scanning
3 4 3 23=8 a3=512 defines the point on a scale, usually ranging
4 8 4 2'=16 163=40$6 from 0 to 255, at which gray values will be
5 16 5 25=32 323=32768 interpreted as black or white pixels. In bitonal
scanning, resolution and threshold are the
6 32 6 26=64 643=262144
key determinants of image quality. Bitonal
7 64 7 2'= 128 1 283=2097152
scanning is best suited to high-contrast
8 128 8 2'=256 25€i3=167772 16 documents, such as text and line drawings.
Table 1: Binary Calculation for No. of Shades For continuous tone or low contrast
Represented by Bit Depth documents such as photographs, gray scale
or colour scanning is required. In gray
DESIDOC Bulletin of Inf Technol,2001,21(6) 15
scalelcolour scanning both resolution and bit reasons. It is thus suggested that scanned
depth combine to play significant roles in images should be either stored as
image quality. uncompressed images or at the most as
lossless compressed images. Further, it is
6.4 Compression best to use one of the standard and widely
lmage files are evidently larger than supported compression protocols than a
textual ASCII files. Lt is thus necessary to proprietary one, even if it offers efffcient
compress image files so as to achieve compression and better quality. Attributes of
economic storage, processing and original documents may also be considered
transmission over the network. A black & while selecting compression techniques. For
white image file of a page of text scanned at example ITU G 4 is designed to compress
300 dpi is about 1 MB in size where as a text text where as JPEG, GIF and ImagePAC are
file containing the same information is about designed to compress pictures. It is important
2-3 KB. Image compression is the process of to ensure migration of images from one
reducing size of a image by abbreviating the platform to another and from one hardware
repetitive information such as one or more media to another. It may be noted that highly
rows of white bits to a single code. The compressed files are more susceptible to
compression algorithms may be grouped into corruption than unwmpressed files.
the following two categories:
6.5 Compression Protocols
6.4.1 Lossless Compression
Following protocols are commonly used
The tzonversion Process converts repeated for bitonal, gray scale or colour compression:
information as a mathematical algorithm that
6.5.f ITU-G4
can decompressed without loosing - any- details
into the original image with absolute fidelity. ITU G4, a standard developed by
No information is 'lost' or 'sacrificed' in the International Telecommunication Union (ITU),
process of compression. Lossless is considered as de facto standard
compression is primarily used in bitonal compression scheme for storing black & white
images. bitonal images. An image created as a TlFF
6.4.2 Lossy Compression and compressed using ITU-G4 compression
technique is called a Group4 TlFF or TlFF
Lossy compression process discards or G-4. TlFF G 4 is a lossless compression
averages details that are least significant or scheme. Joint Bi-level lmage Group (JBIG)
which may not make appreciable effect on the (ISO-11544) is another standard compression
quality of image. This kind of compression is technique for bitonal images.
called 'lossy' because when the image
compressed using 'lossy' compression 6.5.2 JPEG (Joint Photographic
techniques, is decompressed, it will not be an Expert Group)
exact replica of the original image. Lossy JPEG (Joint Photographic Expert Group)
compression is used with gray scale/colour is an ISO-10918-1 compression protocol that
scanning, and in particular with complicated works by finding areas of the image that have
images. same tone, shade, colour or other
Compression is necessary in digital characteristics and representing this area by
imaging but more important is the ability to a code. Compression is achieved at loss of
output uncompressed t h e replica of images. data. It is generally observed that
This is especially important when images are compression of about 10 or 15 to one can be
transferred from one platform to another or achieved without visible degradation of image
are handled by software packages under quality.
different operating system. 6.5.3 LZW (Lenpel-Ziv Welch)
Uncompressed images often work better LZW compression technique uses a
than compressed images for different table-based lookup algorithm invented by
16 DESIDOC Bulletin of Inf Techrwl,2001.21(6)
Abraham Lempel, Jacob Ziv, and Terry Wavelet compression can take a longer
Welch. Two commonly-used file formats in time to compress images due to the complex
which LZW compression is used are the mathematics involved. However, the time
graphics interchange format (GIF) and the tag required to decompress wavelet or fractally
image file format (TIFF). L N V compression is encoded image is usually comparative to
also suitable for compressing text files. A decompressing a JPG image. Wavelet
particular U W compression algorithm takes compression is used in varied digital
each input sequence of binary digit of a given applications, photographic images, audio &
length (for example, 12 bits) and creates an video recordings, 2D & 3D rendering,
entry in a table (sometimes called a multimedia, finger-prints imaging, medical
'dictionary' or 'codebook') for that particular bit imaging (radiography MRI, etc.), satellite and
pattern, consisting of the pattern itself and a remote sensing imaging, GIs, maps &
shorter code. As input is read, any pattern document imaging. Multi-resolution Seamless
that has been read before results in the lmage Database (MrSID) from Lizard Tech
substitution of the shorter code, effectively uses wavelet compression technique.
compressing the total amount of input to (b) Fractal Compression
something smaller. The decoding program
that uncompresses the file is able to build the Mathematically, a fractal describes a
table itself by using the algorithm as it structure that has many repeated forms
processes the encoded input. regardless of scale. Fractal Compression
works by using variety of methods to identify
6.5.4 Wavelet and Fractal features within an image and then breaking
Compression down the image into mathematically modeled
Fractal, wavelet and flaxpix file formats series of repeating shapes and patterns.
offer advantages for providing access to Compression ratio of upto 250:l can be
digital images of oversized materials on the achieved using fractal compression
web. Unlike JPEG and LZW compression that depending upon the image being
maintain images as array of pixels, wavelet compressed. Fractal compression take longer
and fractal compression convert images into time to compress images than wavelet
mathematical models in order to save s t o ~ g e compression but the decompression is
space. Both compression techniques are relatively quick.
lossy. Some applications are combination of
both wavelet and fractal compression. 6.6 lmage Enhancement
(a) Wavelet Compression lmage enhancement process can be used
to improve scanned images at a cost of
Wavelet compression is a method of image authenticity and fidelity. The process of
mathematical modeling of images that breaks image enhancement is, however, time
the image down into small waves that consuming. It requires special skills and
represent the frequency analysis of a would invariably increase the cost of
function. The shapes and patterns in an conversion. Typical image enhancement
image are identified and then described using features available in a scanning or image
mathematical functions or formulas. The editing software include filters, tonal
function that models or describes the image is reproduction, curves and colour
contained within the compression and management, touch, crop, image sharpening,
decompression software. Wavelet contrast, transparent background, etc. In a
compression is very efficient, with ratio upto page scanned in grayscale, the textiline art,
50:1, depending upon the images being and half tone areas can be decomposed and
compressed. In comparison JPEG images are each area of the page can be filtered
usually compressed between 10: 1 & 20: 1 and separately to maximize its quality. The text
LZW around 2: 1. area on page can be treated with edge
sharpening filters so as to clearly define the

DESIDOC Bulktin ofInf Technoi,2001,21(6)


character edges, a second filter could be Annexure - 1 lists file formats used for digital
used to remove the high-frequency noise and images.
finally another filter could fill in characters. It would be appropriate to store a high
Gray scale area of the page could be resolution image as a TlFF master (archival
processed with different filters to maximize format) and distribute the image as JPEGI
the quality of the halftone. GIF file (access' format). Soffware are. now
available that would generate a JPEG or GIF
7. FILE FORMATS files 'on the fly' from a master TIFF file.
The digitally scanned images are stored in
a file as a bit-mapped page image, 8. TOOLS OF DlGlTlSATlON
irrespective of the fact that a scanned page An image scanning system may consist of
contains a photograph, a line drawing or text. a stand-alone workstation where most or all
The scanned image can be formatted and the work is done on the same workstation or
tagged in dozens of different formats to as a par! of a network of workstations with
facilitate easy storage and retrieval depending imaging work distributed and shared amongst
upon the scanner and its software. National various workstations. The network usually
and international standards for image-file includes a scanning station, a server and one
formats and compression methods exist to or more editing, and retrieval stations. A
ensure that data . will be interchangeable typical scanning workstation for a small,
amongst systems. An image file stores production level project, could consist of the
discrete sets of data and information allowing following:
a computing system to disptay, interpret and
B Microcomputer-Pentium IIIIPentium IV
print the image in a predefined format. An
8 Scanner and scanning software
image file format consists of three distinct
components, i.e., header which stores Storage system
information on file identifier and image Network
specifications such as its size, resolution, Display system
compression protocols, etc.; image data Printer.
consisting of look-up table and image raster;
and lastly footer that signals file termination This section will concentrate on scanners
information. While bit-mapped portion of a and scanning software as important
raster image is standardised, it is the file components of scanning workstation.
header that differentiate one format from 8.1 Scanners
another. The display software of a raster
image picks up the details, like resolution, Digital scanners are used to capture a
compression technique, etc., from the file digital image from an analogue media such as
header and displays an electronic replica of printed page or a microfichelmicrofilm at a
the original page. File formats also define the predefined resolution and dynamic range (bit
compression protocol used for compressing range). There are following two types of
or decompressing an image. image scanners:
Most of the scanning sohare allow saving 8.1.1 Vector scanners
of scanned images in a number of formats. These scanners scan an image as a
TlFF (Tagged Image File Format) is the most complex set of x, y coordinates. Vector
commonly used file format and is considered images are generally used in geographical
de facto standard for bitonal scanning. TlFF is information systems (GIs). The display
a truly multi-platform protocol and is a good soitware for the vector image interprets the
candidate for scanning projects. Some image image as function of coordinates and other
formats are proprietary, developed and included information to produce an electronic
supported by a commercial vendor and replica of the original drawing or photograph.
require a specific software or hardware for Vector images can be zoomed in portion to
displaying the printing scanned images.
18 DESIDOC Bulletin of Inf Tihnol,2001,21(6)
display minute details of a drawing or a map. scanners can accept both reflective as well as
Maps, engineering drawings, and transparent media over a glass platen.
architectural blueprints are often scanned as 8.1.4 Sheet feed Scanners
vector images.
In a sheet-feed scanner, as is indicated in
8.1.2 Raster scanners the name, document is fed over a stationary
Raster images are captured by raster CCD and light source via roller, belt, drum, or
scanners by passing lights (laser in some vacuum transport. Like flat-bed scanner,
cases) down the page and digitally encoding sheet-feed scanner also have optional
it row by row. Multiple passes of lights may be attachment to auto feed uniformed-sized
required to capture basic colours in a stacks of documents to be scanned.
coloured image. Raster scanners are used in Example: Kodak 500S, Tangent
libraries to convert printed publications into CCS300-SF.
electronic forms. Majority of electronic
imaging system generate raster images.
8.1.5 Drum Scanners
The scanners used for digitising analogue
images into digital images come in a variety Source material, in a drum scanner is
of shapes and sizes. There are following wrapped on a drum, which is then rotated
types of scanners: past a high-intensity light source to capture
the image. They offer superior image quality,
Flatbed Scanners-right angle, prism and
overhead flatbed but require flexibte source material of limited
sized that can be wrapped around the drum.
Sheet-Feed Scanners
They are specially targeted for graphic art
Drum Scanners market as they offer highest resolution for
Digital Cameras grey scale and colour scanning. Drum
Slide Scanners Scanner use Photo-Multiplier (Vacuum)
8 Microfilm Scanner Tubes (PMTs) instead of CCDs, which offer a
Video Frame Grabber. greater bit depth (12 to 16 bits).
The type of scanner selected for an Example: Juno CP-4000, Scan graphics
imaging project would be influenced by the ScanMate5000, ColorGetter .
type, size and source of documents to be 8.1.6 Digital Cameras
scanned. Many scanners can handle only
Digital cameras mounted on copy cradle
transparent material, whereas others can
resemble microfilming stand. Source material
handle only reflective materials.
is placed on the stand and the camera is
8.1.3 Flatbed Scanners cranked up or down in order to focus the
Flatbed scanners are most common and material within the field of view. Digital
look like photocopiers and are used in much cameras are most promising scanner
the same way. Source material is placed face development for library and archival
down for scanning. The light source and CCD applications.
move beneath the platen, while the document Example: Zeutschel Ominiscan 3000,
remains stationary as in the case of Minolta PS 3000.
photocopying machine. Flatbed scanner
8.1.7 Siide Scanner
comes in various models like right-angle,
prism and planetaryloverhead to handle ' Slide scanners have a slot in the side to
bound volumes and books. accommodate a 35mm slide. Inside the box
the light passes through the slide to hit a CCD
Examples: HP ScanJet 6300C, Ricoh
array behind the slide. Slide scanners can
IS420, Bell & Howell 1000FB Fujitsu 3097G,
generally scan only 35mm transparent source
Minolta PS3000, Xerox Oocu CS620. Flatbed
materials.

DESIDOC Bulletin of Inf Tmhnol.2001 ,21(6)


Example: Kodak PCD Scanner 4045, Data Scan, (Stacks Sohare Pvt. Ltd.)
Nikon LS3510, Leaf Systems Leafscan 45. http:/www.stex.com/
8.1.8 Microfilm Scanner OmniDocs (NewGen)
http:lhrvww.newgensoft.com/
Specially targeted to librarylarchival
application, microfilm scanners have adapters 9. ORGANlSlNG DIGITAL IMAGES
to convert roll film, fiche, and aperture cards A disc full of digital images without any
in the same model. organisation, browse and search options may
Example: Mekel MSOOXL, SunRise have no meaning except for one who created
SRI-50, Lenzpro 2000 Multimedia Digitiser. it. Scanned images need to be organised in
order to be useful. Moreover, images need to
8.1.9 Mdeo Frame Grabber or Video
be linked to the associated metadata to
Digitiser facilitate their browsing and searching. The
Video digitisers are circuit boards placed following steps describe process of organising
inside a computer, attached to a standard the digital images:
video camera. Any thing that is filmed by the
(a) Hierarchical storage
video camera is digitised by the video
digitiser. Organise the scanned image files into disc
hierarchy that logically maps the physical
8.2 Image CapturinglScanning organisation of the document. For example, in
Sohare a project on scanning of journals, create a
folder for each journal, which, in turn, may
The process of converting a paper have folder for each volume scanned. Each
document into a computer-processible digital volume, in turn may have a subfolder for each
image is done using a software variably called issue. The folder fur each issue, in turn, may
document imaging system, electronic filing contain scanned articles that appeared in the
system or DMS, etc. Scanning softwares also issue along with a content page, composed in
come with the scanners. Important document HTML providing links to articles in that issue.
imaging soflware are:
(b) Nomenclature
Altris Software
http://www.aItris.com/ Name the scanned image files in a strictly
OPTlX controlled manner that reflect their logical
http://www.blueridge.com/ relationship. For example, each article may
be named after the surname of first author
Documentum
followed by volume number and issue
http://www.documentum.corn/
number. For example, file name
Poweroffice 'smithrkamv5nl.pdf' conveys that the article is
http://~ww.expower.com/ by 'R.K. Smith' that appeared in volume 5 and
FileNet issue no.1 of Applied Mechanics. The file
http://www.fiIenet.com/ name for each article would, therefore,
LAVA Systems convey a logical and hierachial organisation
http://www.lavasys.com/ of the journal.
NovaManage (c) Description
http:lIwww.novasoft.corn1
Describe the scanned images file
LiveLink
http://www.opentext.comllivelink/index.html internally, using image header and externally
using linked descriptive metadata files. Most
Docs Open software packages provide for storing
http:llwvuw.pcdos.com1 'administrative data' regarding image, i.e.,
Parlance Ambassador date of creation, format type & version,
http:l/www.xyvision.com/ compression technique, name of creator, etc.
DMS Products from India in the image header. 'Structured' or
20 DESIDOC Bulletin of Inf Technol, 2001,21(6)
'descriptive' metadab for images are the (b) Nature extraction: Can recognise a
keywords assigned io each image. character from its structure and shape
(d) Table of contents (angles, points, breaks, etc.) based on a
set of rules. The process claims to
The simplest and most effective method recognise all fonts.
for providing access is through a table of
contents and links each item to its respective
(c) Structural analysis. Determinescharacters
objectlimage. Content pages of issues of
journals, done in HTML, would offer browsing
on the basis of density gradations or
facility. full text search to HTML pages or character darkness.
OCRed pages can. be achieved by installing (d) Neural networking: Neural networking is a
one of the free internet search engines like: form of artificial intelligence that attempts
to mimic processes of the human mind.
Oingo Free Search
http:I/www.oingo.wmloingo~free~search/prod Combinedwith traditional OCR techniques
ucts.html; plus pattern recognition, a neural
network-based system can perform text
Swish-E
recognition and learn from its success and
http:llwww.berke\ey.edulSWSH-€I
failure. Referred to as intelligent character
Whatyouseek recognition (ICR), a neural network-based
http:Ilintra.whatuseek.com/ system is being used to recognise
Excite hand-written text as well as other
http://exlcite.corn/ traditionally difficult source material.
Neural network ICR can contemplate
characters in the context of an entire word.
Large scanning projects would, however, New ICR combines neural networking with
require a backend database storing images fuuy logic.
or links to the images, metadata (descriptive1
administrative). Backend database used by
most DMS holds the functionality required by fO. Hybrid Solution
most web applications. important document It would be imperative to consider
management systems like File Net have-now producing a more reliable media like microfilm
integrated their database with HTML simultaneous to the process of digitisation if
conversion tools. Further, some of the OMS archival preservation is one of the objectives.
have also signed up with Adobe to The process of microfilming produce a
incorporate Acrobat and Acrobat capture into high-resolution image on the microfilm1
their web-based DMS. These databases microfiche that equates to approximately
entertain queries from users through 'HTML 1000 dpi in digital binary scanning. In
forms' ahd generate search results on the fly. comparison, a bitonal digital image can at
best be scanned at 600 dpi for archival
OCR does not actually convert an image
into text but rather creates a separate file storage.
containing the text while leaving the image in
tact. There are four types of OCR technology The microfi\m/microfiche can be used for
that are prevailing in the market. These conversion to electronic image. format.
technologies are: matrix matching, feature Moreover, future improvements in scanning
extraction, structural analysis and neural technology can be utilised by rescanning a
network. microfilm to obtain high-reso!ution images. It
(a) Matrixrremplate matching: Compares is expected that ultimately electronic scanning
each character with a template of the same wit1 reach or exceed photographic quality.
character. Such a system is usually limited Durability and reliability of computers, storage
to a specific number of fonts, or must be media and formats used for electronic image
files may also increase and stabilise.
'taught' to recognise a particular font.
DESfDOC Bulletin of lnf Technol.22001,21(6) 21
11. CONCLUSION components, network interfaces, etc. The
librarian should be aware of constant threat of
N t h the technology available at an 'techno-obsolescence' and transitory
effordable cost, the libraries can initiate small standards. Magnetic and optical discs as a
digitisation projects as individual library or as physical media are re-engineered to store
a group of libraries. Building-up digital more and more data. There is a constant
collection and infrastructure required to threat of backward compatibility for the
access them is a challenge that evei$'library products that were used in the past. Acquiring
has to deal with. an imaging system with all its peripherals
Recent growth and development in digital capable of enhancing access to the library
libraries can broadly be grouped into three resources is quite a simple task specially now
broad categories, i.e., (a) digitisation of that the cost of all computing systems and
.
traditional printed-resources and their peripherals are crashing down. However, the
integration with secondary sources of libraries must be concerned with all aspects
information like Medline. Most of the of imaging technologies, as well as new
secondary services now incorporate link to demands that the technology will place on
the full text articles on the publisher's site. them as an organisation.
Developments of platforms l i e Crossref, Web Digital images will have to be constantly
of Science, Silver Linker, etc. are indication of migrated converted to new formats computing
trends towards gradual merger of full text devices, storage media and software as new
resources and secondary services; (b) digital forms of storage devices, updated formats
library derived from the datafiles generated as and software and improved computing
a byproduct in the process of printing of systems are released. lnspite of the fact that
primary journals or major reference works. this constant migration conversion of digital
Most of the publishers of STM journals have images would not only be expensive it would
launched their digital collections consisting of also direct valuable manpower resources into
electronic counter-parts of . their printed a constant re-invention of wheel, it would be
journals; and (c) new information sources that essential to do so otherwise valuable images
are created specially for the web imbibing data would be left behind an obsolete
frills of technological advances. The numbers machinery which will eventually break down
and variety of resources in the last category rendering data inaccessible.
are expected to increase as the users
respond to technologically advanced products
and services. The meaning of digital library is 1. Association of Research Libraries.
in the process of evolution as it climb along Proceedings of 126th ARL Annual
the technological ladder. It can thus be Meeting, 1995.
concluded that further advances in the area of http:/larl.cni.org/arllproceedings/l26/2-
digital libraries would rest upon further defn.html
technological developments as well as
conceptual foundation that is being laid down 2. ~esser, H & Trant, J. Introduction to
presently. imaging: Issues in construction of an
image database. Gery Art History
Unlike microfilming, digital imaging Information Program, Santa Monica,
technologies present a preservation solution 1995.48 p.
for the documents in libraries with increased
3. Boteler, J. E-mail dt Feb. 28, 2001 to
accessibility over the electronic networks,
listserv:
highquality output, OCRed text, electronic
DIG-REF@LISTSERV.SYR.EDU
links to individual pages, etc. However,
imaging technologies are still in a continuous 4. Conway, Paul. Preservation in digital
flux of change. New standards and protocols world. Microform and Imaging Review,
are being defined on a regular basis for file 1997, 25(4), 156-71.
formats, compression techniques, hardware
22 DESlDOC Bulletin of Inf Technol,2001,21(6)
Cox, John E.. Publishers, publishing and worldwide. Communications of the ACM,
the Internet: how journal publishing will 1998, 41(4), 33-43.
survive and prosper in the electronic age. Payette, S.; B m i , C.; Lagoze, C.; &
Nectmnic Library, 1997, 15(2), 125-31. Overly, E.A. Interoperability for digital
Davis, Eric T. An overview of the access objects and repositories. D-Lib Magazine,
and preservation capabilities in digital 1999, 5(3).
technology. http://www.dlib.or~$lib/May99/payette/05
http:Ilwww.iwaynet.net/-lscildiglibldigpapf payette.html
f.html Rusbridge, Chris. Towards the hybrid
Fox, E. & Marchionini, Gary. Towards a library. D-Lib Magazine, JulylAugust;
worldwide digital library. Communication 1998.
of the ACM, 1998, 41(4), 29-32. http:llwww.dlib.org/dIib/july98/rusbridge.
Guthrie, Kevin M. JSTOR: From project html
to independent organisation. D-Lib School of Scanning, 3-5 Nov., 1997,
Magazine, JulylAugust, 1997. New York. Papers and presentations.
http:/lwww.dlib.org/dlib/july97/07guthrie. Northeast Document Conservation
html Center, Andaver, MA, 1997.
Hulser, Richard P. Digital library: Content Science Server Software. Products and
preservation in a digital world. DESIDOC services from a company dedicated to
Bulletin of Information Technology, 1997, innovative devetopment of digital library
17(6), 7-14. software. Scienceserver LLC, McLean,
Lynch, Clifford & Hector, Garcia-Molina. 1999.
llTA Digital Library Workshop, Reston, http:I/www.scienceserver.corn/
VA, 18-19 May 1995. Sloan, Bernard G Services perspectives
Kenney, A.R. & Chapman, S. Digital for the digital library: Remote reference
imaging for libraries and archives. Cornell services. Library Trends, 1998,47(2).
University, New York, 1996. 198 p. Smith, T.R. Meta information in digital
Marchionini, G; Plaisant, C. & Komlodi, libraries. Int.J.Digital Libraries, 1997, 1,
A. Interfaces and tools for the Library of 105-07.
Congress National Digital Library Waters, D. Electronic technologies and
Program. lnformation Processing and preservation. European Research
Management, 1998, 34(5), 535-55. bbran'es Cooperation, 2(3), 285-93,
Noerr, P. The digital library toolkit. Ed. 2. 1992.
Palo Alto, Sun Systems, 2000. 186 p. W~llis,Don. A hybrid systems approach to
Paepcke, A.; Chang, C-C.K.; preservation of printed materials.
Garcia-Molina, H. & Winograd, T. Commission on Preservation and
Interoperability for digital libraries Access, Washington, D.C., 1992.

Contributor: Dr Jagdlsh Arora is deputy librarian at Central Library, Indian Institute of


Technology (I IT), Delhi.

DESIDOC Bulletin of Inf TwhnoI,2201,21(6) 23


Annexure - I

File Formats in Digital Libraries


Abbreviation Format File Extention
File Format for UnstructuredText
ASCII American Standard Code for Information Interchange
File Format for Structured Text
SGML Standard Generalised Markup Language. sgml
HTMt Hypertext Markup Language .html or .htrn
XML Extended Markup Language .xml
PDF Portable Document Format (Adobe) .pdf
PS Postscript (Adobe) .ps
TEXT Texture Format .txt
File Format for Images
PDF Portable Document Formal .pdf
BMP Bit Map Page (Mndaws) .bmP
IMG Ventura Publisher .img
JPEG Joint PhotographicExpert Group .jp9
JFlF JPEG File Format .jfif
PCP PC Paint (BBW Mode) .PCP
PCX PC Paint Brush {Color & B&W) .pcx
PSD Photoshop .psd
TGA True Vision Targa .tga
PNG Portable Network Graphic .prig
TIFF Tagged Image File Format .tior .tiff
TIFF-G4 Tagged Image File Format with Group 4 Fax Compression .ti
SPlFF Still Picture Interchange File Format. .spf
PCD Photo CD (Kodak) .pa
Web-Compatible lmage File Format
GIF Graphics Interchange Format
JPEG Joint Photographic Experts Group
Audidvideo File Format
WAVE Waveform Audio (Microsoft) .wav
AlFF Audio Interchange Format .aif
JPEG File Format .jff
VoC Creative Voice .VOC
MIDI Musical Instrument Digital Interface .midi
SND Sound .snd
AU Audio (Sun Microsystems) .au
RA Real Audio Format (ProgressiveNetworks) .ra
AVI Audio Visual Interleave .avi
FIA Macromedia Flash Movie .Ra
FLC AutoDesk FLIC Animation .flc
MOV Quicktime for Wlndows Movie .mov
MPEG Motion Picture Expert Group mpeg
MP2 MPEG Audio Layer 2 .mp2
MP3 MPEG Audio Layer 3 .mp3

24 DESIDOC Bulletin of Inf Techno!,200?,21(6)

You might also like