Digital Library Mega Scanning Centre IIIT Hyderabad

Digital Library Mega Scanning Centre
IIIT Hyderabad
Vamshi Ambati
Major Objectives of our Centre
 Digitizing to produce books of quality in
quantity
 Development of core technologies needed
for Digital Libraries
 Knowledge and Experience Dissemination
 Training
 Sharing resources
Progress
 Established centers at Osmania University,
Telugu University, Salarjung Museum
 Content generation at SVDL, CCL, SCL
 Conducted a workshop for sharing
resources and establishing common
standards
 Generated content of about 32 Million
Pages
 Host content at (http://dli.iiit.ac.in)
Effort Distribution of Digitization
Web
Quality Enablement,
Identification,
Assurance, 5% 15%
15%
Metadata, 5%
OCR, 10%
Scanning,
Image 30%
Processing,
20%
Current Status
 170,000 books
 72,000 English books
 18 other languages
 http://dli.iiit.ac.in
 Operations -
 50 scanners
 15 centers
 300 people in all
Language Report
RMSC BOOKS WISE REPORT
80000
70000
60000
50000
40000
30000
20000
10000
0
0
5000
10000
15000
20000
25000
30000
AOU
AP TEXT
BOOKS
CCL-HYD
EPW
FAO
KANSAS
OUL
PSTU
STATE
ARCHIVE
SALARJUNG
MEZUM
RMSC SOURCE LIBRARY BOOKS REPORT
SCL-HYD
Source Library Report
WASHINGTON
TTD
OTHERS
Scanning Centre Report
RMSC SCANNING LOCATION BOOKS WISE REPORT
35000
30000
25000
20000
15000
10000
5000
0
L S
T H SU YD VD YD TT
D
SJ
M T U G
III O H H S BN ER
L- S L- P L -
TH
C
C SC SC O
Technologies: Research
 Content Search in Images
 Text Mining
 Cross Lingual Retrieval
 Summarization tools
 UniTrans: Universal Transliteration tool
 Languages: Arabic, Persian,Urdu,Assamese,
Bengali, Tamil, Telugu, Kannada, Malayalam,
Sanskrit, Hindi, Marathi
Technologies: Workflow
 Workflow Tools
 Metadata creation, Structural metadata etc
 Server management
 Image Processing
 Image Processing tools
 Plug-in
 Server Management Tools
 Digital Library of India Portal
Rare Collections
 50 years of Andhra Pradesh State
Legislature Proceedings (Multilingual data)
 Rare Telugu classics (like Kalidasa’s work)
 Andhra Pradesh State Archive Books (rare
collection as old as 1835)
 Text Books State Board of Education (1 st
to 10th grade)
Acknowledgements
 Ajay Pannala, CEO Par Informatics
 C S N Mohan, CEO Thrinaina Ltd
 T N Sreenivas, CEO SV Infosys
 Bhuman Reddy, Director SVDL
 Kiran V K, Planning Director DLI
 Nadendla Manohar, MLA Tenali
 Rajeev Sangal, Director IIIT
 C.V Jawahar, Professor IIIT
Thank you
Workshops held
 Tools and Resources for DLI
 (5th May to 7th May 2005)
 36 participants
 Research Challenges in DLI
 30th December 2006
 100 participants
 Speakers and Dignitaries
 Raj Reddy, Pradeep Chopra, Sunil Alag, Yagna
Narayana among others
Center Specific Technology
Search Similar Images based on

Image Patterns
Problem
 Huge amount of content generated by DLI
 Search the DLI
 Query is generally in form of text word
 Currently cannot convert all document
images into text
 Can we match words in the image space
by converting the query into image?
Challenges
Match two word images in the presence of
 Degradations
 Salt and Pepper noise
 Cuts and Breaks
 Blobs
 Erosion of Boundary pixels
 Print Variations
 Font Type
 Font Size
 Variability due to Language Cases
Proposed Solution
Results and Discussion
 Partial Matching
Demo
Core Technologies for Digital
Library
Workflow and Tools
 Workflow Management
 Vendor Progress Tracking
 Report Generation tool
 Server Management Tools
 Server uptime monitoring, Server cluster solution
 Metadata Management tools
 Regular metadata, Structural metadata
 Quality Assurance
 Online metadata verification and correction interface
 Centralized Duplicate Detection tool
 Image quality assurance tool (QualCheck)
Multilingual Information Retrieval
 Cross Lingual Information Retrieval
 UniversalDictionary based
 Query expansion
 Explicit (user feedback) and Implicit (word frq)
Automatic Text Summarization
 Summarization system for Telugu
 Frequency based
 Position based
 Most informative sentence identification
 Dictionary lookup
 Approximate String Matching to compensate
for lack of Morph Analyzers
 Stop Word vs. Content Word identification
Search and Indexing
 Web Crawler
 Focused Crawling
 Incremental Crawling
 Crawls Telugu, Malayalam and Tamil web pages
 Content Based Image Retrieval
 Addresses queries in multiple formats (sample image
or text)
 Uses features such as color, texture to match images.
 Learns from user feedback.
Search and Indexing
 ITRANS based search for DLI servers
 Search on actual content as opposed to
search on metadata
 Ability to extend for multiple languages
 Allows users to query in their native
languages and converts the documents
actually stored in ITRANS to native language
on the fly
Multimodal Multimedia Tools
 Book Reading Interface
 Developed TIFF Plugin (released open source)
 Image Server for ‘on the fly’
 Format conversions
 Resolution conversion
 Thumbnail generation etc
 Speech Interface
 Plugin for IE and Firefox for Reading a Book
 Text To Speech System (developed in IIIT using
Festivox CMU toolkit)
Tools for Download
 Tools available for download at
http://dli.iiit.ac.in/download.html

Digital Library Mega Scanning Centre IIIT Hyderabad

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Digital Library Mega Scanning Centre IIIT Hyderabad

Uploaded by

Copyright:

Available Formats

Digital Library Mega Scanning Centre

RMSC SCANNING LOCATION BOOKS WISE REPORT

Search Similar Images based on

You might also like