Professional Documents
Culture Documents
(India) Digital Library Mega Scanning Centre
(India) Digital Library Mega Scanning Centre
Vamshi Ambati
Training
Sharing
resources
Progress
Established centers at Osmania University, Telugu University, Salarjung Museum Content generation at SVDL, CCL, SCL Conducted a workshop for sharing resources and establishing common standards Generated content of about 32 Million Pages Host content at (http://dli.iiit.ac.in)
Current Status
170,000 books 72,000 English books 18 other languages http://dli.iiit.ac.in
Operations
Eu r op ia n La ng ua ge s lis h
En g
Sa n sk rit
M ar at hi
Te lu g u
hi nd i Pe r sia n
Ur du Ta m il Ka n na
Language Report
da Ar a
bi c O th e
rs
Technologies: Research
Content Search in Images Text Mining Cross Lingual Retrieval Summarization tools UniTrans: Universal Transliteration tool
Languages:
Arabic, Persian,Urdu,Assamese, Bengali, Tamil, Telugu, Kannada, Malayalam, Sanskrit, Hindi, Marathi
Technologies: Workflow
Workflow Tools
Metadata
Image Processing
Image
Plug-in
Processing tools
Rare Collections
50 years of Andhra Pradesh State Legislature Proceedings (Multilingual data) Rare Telugu classics (like Kalidasas work) Andhra Pradesh State Archive Books (rare collection as old as 1835) Text Books State Board of Education (1st to 10th grade)
Acknowledgements
Ajay Pannala, CEO Par Informatics C S N Mohan, CEO Thrinaina Ltd T N Sreenivas, CEO SV Infosys Bhuman Reddy, Director SVDL Kiran V K, Planning Director DLI Nadendla Manohar, MLA Tenali Rajeev Sangal, Director IIIT C.V Jawahar, Professor IIIT
Thank you
Workshops held
Raj Reddy, Pradeep Chopra, Sunil Alag, Yagna Narayana among others
Problem
Huge amount of content generated by DLI Search the DLI Query is generally in form of text word Currently cannot convert all document images into text Can we match words in the image space by converting the query into image?
Challenges
Match two word images in the presence of Degradations
Salt and Pepper noise Cuts and Breaks Blobs Erosion of Boundary pixels
Print Variations
Font Type Font Size
Proposed Solution
Partial Matching
Demo
Workflow Management
Server uptime monitoring, Server cluster solution Regular metadata, Structural metadata
Quality Assurance
Online metadata verification and correction interface Centralized Duplicate Detection tool Image quality assurance tool (QualCheck)
based Position based Most informative sentence identification Dictionary lookup Approximate String Matching to compensate for lack of Morph Analyzers Stop Word vs. Content Word identification
Web Crawler
Focused Crawling Incremental Crawling Crawls Telugu, Malayalam and Tamil web pages
on actual content as opposed to search on metadata Ability to extend for multiple languages Allows users to query in their native languages and converts the documents actually stored in ITRANS to native language on the fly
Developed TIFF Plugin (released open source) Image Server for on the fly
Speech Interface
Plugin for IE and Firefox for Reading a Book Text To Speech System (developed in IIIT using Festivox CMU toolkit)