You are on page 1of 47

Digitization Practices in

India: Issues and


Challenges

V.N. Shukla
C-DAC, NOIDA UNIT
NATURAL
NATURALLANGUAGE
LANGUAGE
PROCESSING
PROCESSINGAND
AND
INTERFACES
INTERFACES

INFRASTRUCTURE HUMAN
HUMANRESOURCE
RESOURCE
INFRASTRUCTURE
AND DEVELOPMENT
DEVELOPMENTIN
ANDSUPPORT
SUPPORT IN
SERVICES
SERVICES
MISSION HITECH
HITECHAREAS
AREAS
C-DAC

SPECIAL
INDUSTRIAL
APPLICATIONS

2
AREAS OF COMPETENCE

Graphical Display System


NLP

E-Governance .

Security Systems
.

NOIDA
NOIDA
Internet on CATV .

& E-Commerce Embedded System

Solar Energy
System System Engineering and Consultancy

3
Digital Library Activities : CDAC Noida

•Digital Library Projects

•Mega Centre for Digital Library


•Mobile Digital Library : Dware Dware Gyan Sampada
•Digital Library at President’s House
•Digital Library at Nagari Pracharini Sabha Varanasi
•Digital Library at Uttaranchal
•GyanNidhi : Multilingual Parallel Corpus in Indian Languages
•Digital Library at Gujrat Vidyapeeth ,Ahmedabad
•Digitization of Libraries
Digital Library Mission
To organize the information and make it universally
accessible and useful.

Online Content Offline Content


Billions of web pages Billions of items still unindexed
DL Initiatives

~85% of books are out of print


and/or out of copyright – these
Only ~15% of books are in print
books are only found in libraries

GOAL: Create a comprehensive virtual card catalog of all


books in all languages, while respecting publishers’ rights
Source: Google
Digital Libraries

Users

Hyperlinks Metadata
Search
Index

DL creation &
Traditional Libraries processes
I
N
D
E
X
A Typical Library Collection

The value is in the middle

15%
~15% ~65% or more Less than 20%**

In-Print Unclear copyright status Public Domain


• May be in copyright, but not for sale
• Rights may have reverted to author
• May be in the public domain

92% of the world's books are neither generating revenue for the
copyright holder nor easily accessible to potential readers.*

*Source:  Covey, Denise Troll.  "Global Cooperation for Global Access:  The Million Book Project“
**OCLC analysis of the Google Books Library Project: http://www.dlib.org/dlib/september05/lavoie/09lavoie.html   
 Digital Library (DL) may be seen as
“Collection of intelligent creations by human
beings through their own language and
culture. It also reflects cultural heritage
besides providing archive and generating
many research issues pertaining to Natural
Language Processing”
Digital Library ?
Sun Microsystems defines a digital library as the electronic extension
of functions users typically perform and the resources they access in a
traditional library.

These information resources can be translated into digital form, stored


in multimedia repositories, and made available through Web-based
services.

According to other definition Digital libraries are

“Organizations that provide the resources, including the specialized


staff, to select, structure, offer intellectual access to, interpret,
distribute, preserve the integrity of, and ensure the persistence over
time of collections of digital works so that they are readily available for
use by a defined community or set of communities”.
What is Digital library ?

 A Service? An Architecture?
 A set of Information Resources?
 A set of tools to locate, search, retrieve
information?
 Possibly the tools to create such resources and
services also fall within the purview of DLs
 Digital face of traditional libraries
 Include both digital collections and traditional
 Backbone and nervous system of libraries.
Digital library Vs traditional library

• Efficient & qualitative services by collecting, organizing, storing,


disseminating, retrieving and preserving the information.

• Preservation benefits besides making information retrieval & delivery more


comfortable.

• Online access to historical and cultural documents whose existence is


endangered due to physical decay.

Digital libraries necessarily include a strong focus on the management of


digital content, just as traditional libraries have focused for long on the
management of content in physical forms.
Digital Content Management
Most of the digital content that is being managed includes:

• Human Language, in various forms character-coded electronic text, scanned


images, printed or handwritten text or human speech.

• Language technology helps in managing digital content

• Management through learning from past experience also adds to manage


content

The major areas for great exploitation are:

• Information retrieval,
• multimedia,
• database,
• data mining,
• data warehouse,
• on-line information repositories,
• image processing, hypertext,
• World Wide Web and wide area information services (WAIS).
Few advantages of digital libraries
• Access anywhere

• Reducing delays

• Distributed storage – central access

• Better cataloguing

• Cross references to other documents

• Full text search

• Protected information source

• Wide exploration and exploitation of the information

The information explosion, the wide bandwidth data networks and the potential
of Internet-based technologies - such as the Web - make digital libraries one of
the important application areas of computer science.
Process of Digital Preservation
Centralized Book scanning
Centralized Book scanning
Server status
Server status

XML Meta File


XML Meta File
Creation using
Creation using
Yes
Reject
Rejectthe
the
Dublin core Std.
Dublin core Std. Book
Book
No

Scanned S/w Batch


Scanned S/wtotodivide
divide Batch
Image in TIFF even cropping &
Image in TIFF
format
even &odd
& odd cropping &
Cleaning
format pages
pages Cleaning

Conversion to
Conversion to
TXT/RTF/HTML OCR
TXT/RTF/HTML OCR

Uploading
Goals of DL
 Focused on digitization technology, metadata schemes,
data management techniques, and digital preservation.
 Second-generation digital library
 exploring new opportunities and developing new competencies.
 Third-generation digital library
 focusing instead on fully integrating digital material into the
library’s collections through a modular systems architecture.
Ingredients for DLs

 Hardware
The minimum machinery to do the job
 Software
The programs for handling data
 Digital Objects
Articles, Conference Papers, Thesis,……
 Basic Skills
Things one has to learn
Hardware

 A Server
 You’ll need access to a web server
 A good PC
 Scanners
Flatbed – Auto feed, Back to back
MF
Book Scanner
Software

 Open Source Software (OSS)


Dspace, E-Prints, Fedora, GSDL……

 Proprietary software you can’t avoid


Image Editing and Optical Character Recognition Software
have to be purchased
Content is King

The information content is


more important than the Objects should not be “locked”
systems used for its storage, in specific DLs or archives
management and retrieval
Creating DLs …

 Six steps
 Selecting
 Acquiring
 Digitization
 Creation Of Meta Data
 Organizing
 Archiving
 Providing Access
Possible Delivery Formats

 Pure image formats: TIFF, JPEG


 Open encoded formats: XML, HTML, ASCII, and
Unicode
 Hybrid formats: PDF, DjVu – can contain both image and
text
 Proprietary formats: Microsoft Word, WordPerfect
Digitization: Issues

 Copyright
 Access copy and archive copy
 File size
 Storage media( CD, Hard disc…)
 File format ( TIFF,JPEG…)
Challenges in Digitization

 Building digital collections of national importance from


existing texts, documents, images . . .

 Creating new digital documents & linking them

 Subject portals: Selecting and maintaining open source


digital resources

 Developing / adapting management tools for digital


collections

 Providing access to digital collections


25
Challenges..

 Integrating digital & other library collections

 incl. integration of OPACs, subscribed e-resources and


subject portals

 Establishing services for digital libraries

 online access & offline support


 education & training of users and librarians

 Addressing social, legal, policy issues

26
Challenges in Publishing

 Preservation of layout

 Searchability of content and metadata

 Efficient image compression

 Easy browsing of books

 Accommodating low bandwidth user

 Multilingual text support

 Multipaging
Digital Library Support in India
Funding
 Ministry of Communication & Information Technology
(MIT)
 Ministry of Human Resource Development (MHRD)

 Manuscript Mission of India

 Department of Scientific & Industrial Research (DSIR-


TRP)
 All India Council for Technical Education (AICTE)

 University Grants Commission (UGC)


Digital Library Initiatives in India
 Library Consortium in India
 Scholarly Science Journals
 Theses & Dissertations
 Institutional E-Print Archives
 Books (out of copyright)
 Manuscripts
 Newspapers
 Online Courseware
 Open Access at Metadata Level
 Portal and Gateway Services

29
Government of India

Min. of C&IT Min of Culture Others

Universal Digital National Manuscript


CSIR E-Journals
Library
Library Consortium

INDEST-AICTE
Consortium

UGC Infonet
Consortium

FORSA
Consortium

IIM Libraries
Consortium
Participating centers of DLI
PTU-1
PTU-2
PTU-3
Rashtrapathi
Bhavan
ERNET CDAC Noida

IIIT-Allahabad

Digital Library of India


CDAC Kolkata

MIDC Pune University


IIIT-H
State & City
Central Library
University of Hyderabad
Goa University

IISc TTD Tirupati


Sringeri Mutt
Mega Scanning Centres at
Anna University IIITH, IIITA
IISc, IIAP, Kanchi Mutt CDAC- Noida and Kolkatta
ASR Melkote
PoornaPragya SASTRA

AKCE
Digital Library Initiatives in India

Some Examples
Digital Library of India
http://www.dli.ernet.in/

April 20, 2009 Workshop on Institutional Repositories 33


http://www.ias.ac.in/

April 20, 2009 Workshop on Institutional Repositories 35


http://www.insa.ac.in/

April 20, 2009 Workshop on Institutional Repositories 36


http://medind.nic.in/

April 20, 2009 Workshop on Institutional Repositories 37


April 20, 2009 Workshop on Institutional Repositories 38
39
Manuscripts
 India has the largest collection of manuscripts in the world (5 million
Approximately).

 India is the repository of an astounding wealth of ancient knowledge


belonging to different periods of history, going back to thousands of
years. Most of this knowledge belonging to different areas of
intellectual activity such as religion, philosophy, science, arts and
literature is preserved in the form of manuscripts. Composed in
different Indian languages and scripts, they are preserved in materials
such as birch bark, palm leaf, cloth, wood, stone and paper.

 National Manuscript Mission was launched five-year programme in


Feb., 2003 by the Ministry of Human Resource Development, Govt. of
India to get all the manuscripts and conserve them.
http://namami.nic.in/
Archives of Indian Labour
V.V. Giri National Labour Institute

Heritage of Indian Working Class

Commissions on Labour
Oral History Collections
Trade Union Collections
Regional Collections
Strike Collections
Powered by Green Stone Digital
Library
http://www.indialabourarchives.org/
43
Digital Libraries Benefits : Individual

 Gain access to the holdings of libraries worldwide through


automated catalogs. Locate both physical and digitized versions of
scholarly articles and books.
 Optimize searches, simultaneously search the Internet, commercial
databases, and library collections.
 Save search results and conduct additional processing to narrow or
qualify results.
 From search results, click through to access the digitized content
or locate additional items of interest.
All of these capabilities are available from the desktop or other
Web-enabled device such as a personal digital assistant or
cellular telephone.
Conclusion
 Digital Libraries are redefining the role of libraries in society
& the role of librarians & information specialists

 National level mechanism is essential to promote and


coordinate open access and public domain digital library
systems

 Improve awareness of open access


 Regular training – tools, processes, standards
 Support setting up of working models, services
 National Resource Centre for open access publishing

 International agencies like UNESCO, ICSU, ICSTI,


CODATA need to actively promote and support developing
country initiatives
References

 Digitization Of Library Forum Survey 2010. IT Act .


Available at www.mit.gov.in/it-bill.htm.
 A digital library for education: the PEN-DOR project. The
Electronic Library, 17(2), 75-82.
 Government of India. 2000. “Background Report on IT
for Masses” itformasses.nic.in/vsitformasses/page1.htm
 Government of India. 2000. IT for the Common Man: The
Millenium IT Policy. Department of Information.
Thank You

You might also like