You are on page 1of 27

Digital Library Mega Scanning Centre

IIIT Hyderabad

Vamshi Ambati
Major Objectives of our Centre
 Digitizing to produce books of quality in
quantity
 Development of core technologies needed
for Digital Libraries
 Knowledge and Experience Dissemination
 Training
 Sharing resources
Progress
 Established centers at Osmania University,
Telugu University, Salarjung Museum
 Content generation at SVDL, CCL, SCL
 Conducted a workshop for sharing
resources and establishing common
standards
 Generated content of about 32 Million
Pages
 Host content at (http://dli.iiit.ac.in)
Effort Distribution of Digitization

Web
Quality Enablement,
Identification,
Assurance, 5% 15%
15%
Metadata, 5%

OCR, 10%

Scanning,
Image 30%
Processing,
20%
Current Status
 170,000 books
 72,000 English books
 18 other languages
 http://dli.iiit.ac.in

 Operations -
 50 scanners
 15 centers
 300 people in all
Language Report
RMSC BOOKS WISE REPORT

80000
70000
60000
50000
40000
30000
20000
10000
0
0
5000
10000
15000
20000
25000
30000
AOU

AP TEXT
BOOKS

CCL-HYD

EPW

FAO

KANSAS

OUL

PSTU

STATE
ARCHIVE
SALARJUNG
MEZUM
RMSC SOURCE LIBRARY BOOKS REPORT

SCL-HYD
Source Library Report

WASHINGTON

TTD

OTHERS
Scanning Centre Report

RMSC SCANNING LOCATION BOOKS WISE REPORT

35000
30000
25000
20000
15000
10000
5000
0
L S
T H SU YD VD YD TT
D
SJ
M T U G
III O H H S BN ER
L- S L- P L -
TH
C
C SC SC O
Technologies: Research
 Content Search in Images
 Text Mining
 Cross Lingual Retrieval
 Summarization tools
 UniTrans: Universal Transliteration tool
 Languages: Arabic, Persian,Urdu,Assamese,
Bengali, Tamil, Telugu, Kannada, Malayalam,
Sanskrit, Hindi, Marathi
Technologies: Workflow
 Workflow Tools
 Metadata creation, Structural metadata etc
 Server management
 Image Processing
 Image Processing tools
 Plug-in
 Server Management Tools
 Digital Library of India Portal
Rare Collections
 50 years of Andhra Pradesh State
Legislature Proceedings (Multilingual data)
 Rare Telugu classics (like Kalidasa’s work)
 Andhra Pradesh State Archive Books (rare
collection as old as 1835)
 Text Books State Board of Education (1 st
to 10th grade)
Acknowledgements
 Ajay Pannala, CEO Par Informatics
 C S N Mohan, CEO Thrinaina Ltd
 T N Sreenivas, CEO SV Infosys
 Bhuman Reddy, Director SVDL
 Kiran V K, Planning Director DLI
 Nadendla Manohar, MLA Tenali
 Rajeev Sangal, Director IIIT
 C.V Jawahar, Professor IIIT
Thank you
Workshops held
 Tools and Resources for DLI
 (5th May to 7th May 2005)
 36 participants
 Research Challenges in DLI
 30th December 2006
 100 participants
 Speakers and Dignitaries
 Raj Reddy, Pradeep Chopra, Sunil Alag, Yagna
Narayana among others
Center Specific Technology

Search Similar Images based on


Image Patterns
Problem
 Huge amount of content generated by DLI
 Search the DLI
 Query is generally in form of text word
 Currently cannot convert all document
images into text
 Can we match words in the image space
by converting the query into image?
Challenges
Match two word images in the presence of
 Degradations
 Salt and Pepper noise
 Cuts and Breaks
 Blobs
 Erosion of Boundary pixels
 Print Variations
 Font Type
 Font Size
 Variability due to Language Cases
Proposed Solution
Results and Discussion
 Partial Matching

Demo
Core Technologies for Digital
Library
Workflow and Tools
 Workflow Management
 Vendor Progress Tracking
 Report Generation tool
 Server Management Tools
 Server uptime monitoring, Server cluster solution
 Metadata Management tools
 Regular metadata, Structural metadata
 Quality Assurance
 Online metadata verification and correction interface
 Centralized Duplicate Detection tool
 Image quality assurance tool (QualCheck)
Multilingual Information Retrieval
 Cross Lingual Information Retrieval
 UniversalDictionary based
 Query expansion
 Explicit (user feedback) and Implicit (word frq)
Automatic Text Summarization
 Summarization system for Telugu
 Frequency based
 Position based
 Most informative sentence identification
 Dictionary lookup
 Approximate String Matching to compensate
for lack of Morph Analyzers
 Stop Word vs. Content Word identification
Search and Indexing
 Web Crawler
 Focused Crawling
 Incremental Crawling
 Crawls Telugu, Malayalam and Tamil web pages
 Content Based Image Retrieval
 Addresses queries in multiple formats (sample image
or text)
 Uses features such as color, texture to match images.
 Learns from user feedback.
Search and Indexing
 ITRANS based search for DLI servers
 Search on actual content as opposed to
search on metadata
 Ability to extend for multiple languages
 Allows users to query in their native
languages and converts the documents
actually stored in ITRANS to native language
on the fly
Multimodal Multimedia Tools
 Book Reading Interface
 Developed TIFF Plugin (released open source)
 Image Server for ‘on the fly’
 Format conversions
 Resolution conversion
 Thumbnail generation etc
 Speech Interface
 Plugin for IE and Firefox for Reading a Book
 Text To Speech System (developed in IIIT using
Festivox CMU toolkit)
Tools for Download
 Tools available for download at
http://dli.iiit.ac.in/download.html

You might also like