(India) Digital Library Mega Scanning Centre

Digital Library Mega Scanning Centre IIIT Hyderabad
Vamshi Ambati
Major Objectives of our Centre

Digitizing to produce books of quality in quantity Development of core technologies needed for Digital Libraries Knowledge and Experience Dissemination
Training
Sharing
resources
Progress
Established centers at Osmania University, Telugu University, Salarjung Museum Content generation at SVDL, CCL, SCL Conducted a workshop for sharing resources and establishing common standards Generated content of about 32 Million Pages Host content at (http://dli.iiit.ac.in)
Effort Distribution of Digitization

Web Quality Enablement, Assurance, 5% 15%
Identification, 15% Metadata, 5%
OCR, 10% Scanning, 30%
Image Processing, 20%
Current Status
170,000 books 72,000 English books 18 other languages http://dli.iiit.ac.in

Operations
50 scanners 15 centers 300 people in all
80000 70000 60000 50000 40000 30000 20000 10000 0
Eu r op ia n La ng ua ge s lis h
En g
Sa n sk rit
M ar at hi
Te lu g u
hi nd i Pe r sia n
RMSC BOOKS WISE REPORT
Ur du Ta m il Ka n na
Language Report
da Ar a
bi c O th e
rs
30000 25000 20000 15000 10000 5000 0

AOU AP TEXT BOOKS CCL-HYD EPW FAO KANSAS OUL PSTU STATE ARCHIVE SALARJUNG MEZUM SCL-HYD WASHINGTON TTD OTHERS
Source Library Report

RM SC SOURCE LIBRARY BOOKS REPORT
Scanning Centre Report

RM SC SCANNING LOCATION BOOKS WISE REPORT 35000 30000 25000 20000 15000 10000 5000 0
SV D L III TH C LH YD SJ M TT D YD SU G SC LBN SC LH TH ER S O PS TU O
Technologies: Research
Content Search in Images Text Mining Cross Lingual Retrieval Summarization tools UniTrans: Universal Transliteration tool
Languages:
Arabic, Persian,Urdu,Assamese, Bengali, Tamil, Telugu, Kannada, Malayalam, Sanskrit, Hindi, Marathi
Technologies: Workflow
Workflow Tools
Metadata
creation, Structural metadata etc Server management
Image Processing
Image
Plug-in
Processing tools
Server Management Tools Digital Library of India Portal
Rare Collections
50 years of Andhra Pradesh State Legislature Proceedings (Multilingual data) Rare Telugu classics (like Kalidasas work) Andhra Pradesh State Archive Books (rare collection as old as 1835) Text Books State Board of Education (1st to 10th grade)
Acknowledgements
Ajay Pannala, CEO Par Informatics C S N Mohan, CEO Thrinaina Ltd T N Sreenivas, CEO SV Infosys Bhuman Reddy, Director SVDL Kiran V K, Planning Director DLI Nadendla Manohar, MLA Tenali Rajeev Sangal, Director IIIT C.V Jawahar, Professor IIIT
Thank you
Workshops held
Tools and Resources for DLI
(5th May to 7th May 2005) 36 participants

30th December 2006 100 participants
Research Challenges in DLI

Speakers and Dignitaries
Raj Reddy, Pradeep Chopra, Sunil Alag, Yagna Narayana among others
Center Specific Technology

Search Similar Images based on Image Patterns
Problem
Huge amount of content generated by DLI Search the DLI Query is generally in form of text word Currently cannot convert all document images into text Can we match words in the image space by converting the query into image?
Challenges
Match two word images in the presence of Degradations

Salt and Pepper noise Cuts and Breaks Blobs Erosion of Boundary pixels
Print Variations
Font Type Font Size
Variability due to Language Cases
Proposed Solution
Results and Discussion
Partial Matching
Demo
Core Technologies for Digital Library
Workflow and Tools
Workflow Management
Vendor Progress Tracking Report Generation tool Server Management Tools
Server uptime monitoring, Server cluster solution Regular metadata, Structural metadata
Metadata Management tools
Quality Assurance

Online metadata verification and correction interface Centralized Duplicate Detection tool Image quality assurance tool (QualCheck)
Multilingual Information Retrieval
Cross Lingual Information Retrieval

Universal
Explicit
Dictionary based Query expansion

(user feedback) and Implicit (word frq)
Automatic Text Summarization
Summarization system for Telugu

Frequency
based Position based Most informative sentence identification Dictionary lookup Approximate String Matching to compensate for lack of Morph Analyzers Stop Word vs. Content Word identification
Search and Indexing
Web Crawler

Focused Crawling Incremental Crawling Crawls Telugu, Malayalam and Tamil web pages
Content Based Image Retrieval

Addresses queries in multiple formats (sample image or text) Uses features such as color, texture to match images. Learns from user feedback.
Search and Indexing
ITRANS based search for DLI servers

Search
on actual content as opposed to search on metadata Ability to extend for multiple languages Allows users to query in their native languages and converts the documents actually stored in ITRANS to native language on the fly
Multimodal Multimedia Tools
Book Reading Interface
Developed TIFF Plugin (released open source) Image Server for on the fly

Format conversions Resolution conversion Thumbnail generation etc
Speech Interface
Plugin for IE and Firefox for Reading a Book Text To Speech System (developed in IIIT using Festivox CMU toolkit)
Tools for Download
Tools available for download at

http://dli.iiit.ac.in/download.html

(India) Digital Library Mega Scanning Centre

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(India) Digital Library Mega Scanning Centre

Uploaded by

Copyright:

Available Formats

Digital Library Mega Scanning Centre IIIT Hyderabad

Major Objectives of our Centre

Effort Distribution of Digitization

Identification, 15% Metadata, 5%

OCR, 10% Scanning, 30%

Image Processing, 20%

50 scanners 15 centers 300 people in all

80000 70000 60000 50000 40000 30000 20000 10000 0

RMSC BOOKS WISE REPORT

30000 25000 20000 15000 10000 5000 0

Source Library Report

Scanning Centre Report

creation, Structural metadata etc Server management

Server Management Tools Digital Library of India Portal

Tools and Resources for DLI

(5th May to 7th May 2005) 36 participants

Research Challenges in DLI

Speakers and Dignitaries

Center Specific Technology

Variability due to Language Cases

Results and Discussion

Core Technologies for Digital Library

Workflow and Tools

Vendor Progress Tracking Report Generation tool Server Management Tools

Metadata Management tools

Multilingual Information Retrieval

Cross Lingual Information Retrieval

Dictionary based Query expansion

Automatic Text Summarization

Summarization system for Telugu

Search and Indexing

Content Based Image Retrieval

Search and Indexing

ITRANS based search for DLI servers

Multimodal Multimedia Tools

Book Reading Interface

Format conversions Resolution conversion Thumbnail generation etc

Tools for Download

Tools available for download at

You might also like