You are on page 1of 27

Digital Library Mega Scanning Centre IIIT Hyderabad

Vamshi Ambati

Major Objectives of our Centre


Digitizing to produce books of quality in quantity Development of core technologies needed for Digital Libraries Knowledge and Experience Dissemination

Training

Sharing

resources

Progress
Established centers at Osmania University, Telugu University, Salarjung Museum Content generation at SVDL, CCL, SCL Conducted a workshop for sharing resources and establishing common standards Generated content of about 32 Million Pages Host content at (http://dli.iiit.ac.in)

Effort Distribution of Digitization


Web Quality Enablement, Assurance, 5% 15%

Identification, 15% Metadata, 5%

OCR, 10% Scanning, 30%

Image Processing, 20%

Current Status
170,000 books 72,000 English books 18 other languages http://dli.iiit.ac.in

Operations

50 scanners 15 centers 300 people in all

80000 70000 60000 50000 40000 30000 20000 10000 0

Eu r op ia n La ng ua ge s lis h

En g

Sa n sk rit

M ar at hi

Te lu g u

hi nd i Pe r sia n

RMSC BOOKS WISE REPORT

Ur du Ta m il Ka n na

Language Report

da Ar a

bi c O th e

rs

30000 25000 20000 15000 10000 5000 0


AOU AP TEXT BOOKS CCL-HYD EPW FAO KANSAS OUL PSTU STATE ARCHIVE SALARJUNG MEZUM SCL-HYD WASHINGTON TTD OTHERS

Source Library Report


RM SC SOURCE LIBRARY BOOKS REPORT

Scanning Centre Report


RM SC SCANNING LOCATION BOOKS WISE REPORT 35000 30000 25000 20000 15000 10000 5000 0
SV D L III TH C LH YD SJ M TT D YD SU G SC LBN SC LH TH ER S O PS TU O

Technologies: Research
Content Search in Images Text Mining Cross Lingual Retrieval Summarization tools UniTrans: Universal Transliteration tool

Languages:

Arabic, Persian,Urdu,Assamese, Bengali, Tamil, Telugu, Kannada, Malayalam, Sanskrit, Hindi, Marathi

Technologies: Workflow

Workflow Tools
Metadata

creation, Structural metadata etc Server management

Image Processing
Image
Plug-in

Processing tools

Server Management Tools Digital Library of India Portal

Rare Collections
50 years of Andhra Pradesh State Legislature Proceedings (Multilingual data) Rare Telugu classics (like Kalidasas work) Andhra Pradesh State Archive Books (rare collection as old as 1835) Text Books State Board of Education (1st to 10th grade)

Acknowledgements
Ajay Pannala, CEO Par Informatics C S N Mohan, CEO Thrinaina Ltd T N Sreenivas, CEO SV Infosys Bhuman Reddy, Director SVDL Kiran V K, Planning Director DLI Nadendla Manohar, MLA Tenali Rajeev Sangal, Director IIIT C.V Jawahar, Professor IIIT

Thank you

Workshops held

Tools and Resources for DLI

(5th May to 7th May 2005) 36 participants


30th December 2006 100 participants

Research Challenges in DLI


Speakers and Dignitaries

Raj Reddy, Pradeep Chopra, Sunil Alag, Yagna Narayana among others

Center Specific Technology


Search Similar Images based on Image Patterns

Problem
Huge amount of content generated by DLI Search the DLI Query is generally in form of text word Currently cannot convert all document images into text Can we match words in the image space by converting the query into image?

Challenges
Match two word images in the presence of Degradations

Salt and Pepper noise Cuts and Breaks Blobs Erosion of Boundary pixels

Print Variations
Font Type Font Size

Variability due to Language Cases

Proposed Solution

Results and Discussion

Partial Matching

Demo

Core Technologies for Digital Library

Workflow and Tools

Workflow Management

Vendor Progress Tracking Report Generation tool Server Management Tools

Server uptime monitoring, Server cluster solution Regular metadata, Structural metadata

Metadata Management tools

Quality Assurance

Online metadata verification and correction interface Centralized Duplicate Detection tool Image quality assurance tool (QualCheck)

Multilingual Information Retrieval

Cross Lingual Information Retrieval


Universal
Explicit

Dictionary based Query expansion


(user feedback) and Implicit (word frq)

Automatic Text Summarization

Summarization system for Telugu


Frequency

based Position based Most informative sentence identification Dictionary lookup Approximate String Matching to compensate for lack of Morph Analyzers Stop Word vs. Content Word identification

Search and Indexing

Web Crawler

Focused Crawling Incremental Crawling Crawls Telugu, Malayalam and Tamil web pages

Content Based Image Retrieval


Addresses queries in multiple formats (sample image or text) Uses features such as color, texture to match images. Learns from user feedback.

Search and Indexing

ITRANS based search for DLI servers


Search

on actual content as opposed to search on metadata Ability to extend for multiple languages Allows users to query in their native languages and converts the documents actually stored in ITRANS to native language on the fly

Multimodal Multimedia Tools

Book Reading Interface

Developed TIFF Plugin (released open source) Image Server for on the fly

Format conversions Resolution conversion Thumbnail generation etc

Speech Interface
Plugin for IE and Firefox for Reading a Book Text To Speech System (developed in IIIT using Festivox CMU toolkit)

Tools for Download

Tools available for download at


http://dli.iiit.ac.in/download.html

You might also like