Achievements of the TDIL

Resource Centres
• Human Machine Interface
Systems (HUMIS)
• Knowledge Resources (KR)
• Knowledge Tools (KT)
• Language Engineering (LE)
• Localisation (L)
• Translation Support
Systems (TSS)
• Standardisation
ISSN No. 0972-6454
WebSite : http:/ /
Associate Patron
• S. Lakshminarayanan, Additional Secretary
ISSN No. 0972-6454 has been granted to VishwaBharat@tdil. Google search engine refers to the contents of this journal.
Arun Shourie, Hon'ble Minister
Ministry of Communications & Information Technology
(Government of India)
Over 95% of India’s population cannot work in English, hence there is need for shift
from English-centric computing to multilingual computing. Technology development
for Indian Languages was initiated at individual level during early 1960’s & ’70s;
and later on Government sponsored R&D projects during 1980-1990s and further
accelerated these activities in mission mode since 2000. Innovation is key in the
growth of new technologies and incubation of these technologies is essential for
productising them. Incubating innovations in Indian language technology follows
‘S’-curve model with four growth prospectives : Improving Core Capabilities,
Collaborating Capabilities, Developing New competitive Capabilities, and Creating
Revolutionary Changes.
Technology development for Indian Languages may be categorized in A-B-C Technology
During 1976-1990 (A-Technology Phase), focus was on Adaptation Technologies;
abstraction of requisite technological designs and competence building in R&D
institutions. [This may correspond to phase-1 of the ‘S curve’ model – Improving core
During 1991-2000 (B-Technology Phase), focus was on developing Basic Technolo-
gies- generic information processing tools, interface technologies and cross-compatibil-
ity conversion utilities. TDIL(Technology Development for Indian Languages)
programme was initiated. [This may correspond to phase-2 of the ‘S’ curve model –
ing capa-
D u r i n g
2 0 0 1 -
ogy Phase),
focus is on
Cr e a t i v e
in the con-
text of con-
vergence of
cation and content technologies. Collaborative technology development is being en-
couraged. During this period Resource Centres for Indian Language Technology Solu-
tions are set up. These create a virtual organizational Language Technology environ-
ment. [This may correspond to phase-3 of the ‘S’ curve model - Developing new capa-
During 2006- 2010, focus may be on Demonstrative technologies for newer products
or services. Futuristic projects such as Intelligent Cognitive Systems and Speech to
Speech Translation systems are being conceptualized.
Monitoring of the TDIL programme relied on ZOPP (Objectivity Oriented Project
Planning) workshop for building consensus on approach and knowledge sharing, peer
review and regional clustering for closer interaction of technology sharing. Once the
technology is developed by a particular centre, it is shared with other centres as well.
Clustering of Regional Resource Centres enables peer competition and collaboration for
innovation and product-oriented development. Basic Information Processing Kit,
OCR, Text-to-Speech and messaging system will shortly be available for all Indian
languages. On-line Machine Aided Translation is available from English to Hindi.
Other Translation pairs are also being developed, Language Technology Business Meet
(LTBM 2001) provided a forum for closer interaction between acdmia, industry and
government to facilitates innovation, incuvation and proliferation. Technology
handshake were encouraged; 41 technology handshake were signed. International
community has also recognised India’s growing expertise in multilingual computing.
Research groups from France, Germany, U.K., Japan etc. had approached Indian
scholar for possible R&D collaboration.
This issue of VishwaBharat@tdil consolidates the technologies developed so far mainly
in the Resource Centres under the TDIL programme. We hope it will be highly useful to
the Research Groups in academia as well as to the industry for integrating these
technologies into newer products and services.
K.K. Jaswal, Secretary
Department of Information Technology
Ministry of Communications & Information Technology
(Government of India)
6, CGO Complex, New Delhi-110003
• Y.S. Bhave, Joint Secretary & Financial Advisor
Technology Treading on Multi-lingualism
1. Calendar of Events 2
2. Reader’s Feedback 4
3. TDIL Vision 5
4. About TDIL Programme 7
5. Achievements of the TDIL Resource Centres
RCILTS for Hindi & Nepali (IIT, Kanpur) 13
RCILTS for Gujarati (MS Univ., Baroda) 25
RCILTS for Marathi & Konkani (IIT, Mumbai) 33
RCILTS for Punjabi (TIET, Patiala) 43
RCILTS for Bengali (ISI, Kolkata) 51
RCILTS for Oriya (Utkal Univ. & OCAC, Bhubaneswar) 69
RCILTS for Assamese & Manipuri (IIT, Guwahati) 83
RCILTS for Kannada (IISc, Bangalore) 95
RCILTS for Telugu (Univ. of Hyderabad, Hyderabad) 109
RCILTS for Malayalam (C-DAC, Thiruvananthapuram) 135
RCILTS for Tamil (Anna Univ. Chennai) 149
RCILTS for Urdu, Sindhi & Kashmiri (C-DAC, Pune) 167
RCILTS for Sanskrit, Japanese & Chinese (JNU, Delhi) 181
6. TDIL Associates Achieve Honours…Congratulations ! 190
7. Resource Centres Technology Index 192
Editorial Team
Om Vikas
P.K. Chaturvedi
Pardeep Chopra
V.K. Sharma
Manoj Jain
S. Chandra
Contents July 2003, vk"kk<+ 10
Incubating Innovations - ‘S’ curve model : Technology
Development for Indian Languages
Information Technology has brought about the
revolution that has transformed commodities, people
and their relations. Our sages envisioned the world as
a Global Village: “Vasudhaiv Kutumbkam”. That
vision is poised to become a reality as communities
and countries network and join up. What we need is the right mindset
to impel us towards the state of “Sarve Bhavantu Sukhinah”.
The opportunities that today’s knowledge economy offers enable
us to make the knowledge and innovation available far and wide,
without any barriers of nationality, language, culture, caste and creed.
IT with such a humanitarian perspective will lead to a society where
peace and harmony obtain. I hope that with the success of the
Technology Development for Indian Languages Mission programme,
India will emerge as a pioneering hub of multilingual computing and
provide appropriate technology solutions for overcoming linguistic
and cultural barriers, and thereby ensure knowledge for all.
(A.B. Vajpayee)
çèkku ea=h
Prime Minister
New Delhi
July 29, 2003
1. Calendar of Events
! Workshop on Spoken Language Processing, Tata Institute
of Fundamental Research, Mumbai, January 9-11, 2003.
Website : http:/ /
! Indo Wordnet workshop, CIIL Mysore, Jan. 14-15, 2003.
Website : http:/ /
! Second International workshop on Technology Development
i n I n di an Lan gu ages ( I WT D I L- 2003) , Kolkat a,
Jan. 22-24, 2003.
Website : http:/ / -cvpr/ events/ iwtdl.htm
" Text Ret rieval Conference, Conduct ed by Nat ional
I n st i t u t e of St an dards an d Tech n ology ( N I ST )
wit h support from ARDA, U. S. A. , Febr uary 17-21,
November 2003.
Website : http:/ /
! Nat i on al Wor ksh op on Appli cat i on of Lan gu age
Technology in Indian Languages, CALTS, Universit y
of H yder abad, Cen t r al Un i ver si t y, H yder abad,
March 6-8 2003.
E-mail :
! 13th International Workshop on Research Issues on Data
Engineering Mult ilingual Informat ion Management
( RI D E- MLI M’ 2003) , H yder abad, I n di a, Mar ch
10-11, 2003.
Website : http:/ / conferences/ ride2003.html
! First National Indic-Font Workshop at PES Institute of
Technology, Bangalore, 26-30 March, 2003.
E-mail :
" EACL 2003-11th Conference of the European Chapter
of the Association for Computational Linguistics Agro
Hotel, Budapest, Hungary, April 12-17, 2003.
Website : http:/ / eacl/ eacl03.php3
" EACL 2003 workshop on Computational Linguistics for
South Asian languages- Expanding synergies with Europe.
Agro Hotel, Budapest, Hungary, April 14, 2003.
" The 12th International Conference on Management of
Technology, International Association for Management
of Technology (IAMOT), Nancy, May 13 – 15, 2003.
Website : http:/ / IAMOT2003/ cfpl1.html
! WI LTS-2003, Ut kal Universit y, Bhubaneswar, May
15-17, 2003.
Website : http:/ / wilts-2003.htm
" Human Language Technology Conference (NAACL),
Edmonton, Canada, May 27- June 1, 2003.
Websi t e : h t t p:/ / www. si ms. ber keley. edu / resear ch /
conferences/ hlt -naacl03/ dat ml
" 3rd International LRC 2003, Localisation Summer School,
University of Limerick, Ireland, 3-6 June, 2003.
Website : http:/ /
" At habascan Languages Conferences, Arcat a, California,
USA, 5 – 7 June, 2003.
Website : http:/ / anlc/ alc/
! Summer School in Universal Networking Language, IIT
Mumbai, 5-13 June, 2003.
E-mail :
" 8th Biennial Conference of the International Association
for Lan guage Lear n in g Techn ology - I ALLT 2003,
University of Michigan, USA, Conference Workshop : June
17 – June 18, 2003, Conference Sessions : June 19 – June
21, 2003.
Website : http:/ / lrc/ iallt/
" Terminology & Localizat ion Conference/ Workshops,
Terminology Summer Academy, Kent St at e Universit y,
Institute for Applied Linguistics, Kent Ohio 44242, June
17-19, June 20-21, 2003.
Websit e : ht t p:/ / appling.kent .edu/ ResourcePages/ TSA-
2003/ TSAWeb/ TerminologyAndLocalizationHome.html
" NLDB 2003, 8th International Conference on Application
of Natural Language to Information System, from June
23 – 25, Burg (Spreewald), Germany.
Website : http:/ /
" Association for Computational Linguistics (ACL 2003),
Sapporo Con ven t i on Cen t er, SAPPO RO , Japan
July 7-12, 2003.
Website : http:/ / ACL2003/
ACL 2003 - Associated Conferences
" EMNLP2003: The Eight h Conference on Empirical
Methods in Natural Language Processing, Nara Institute
of Science and Technology, Japan.
" 6th International Workshop on Information Retrieval with
Asian Languages (IRAL2003), Nat ional Inst it ut e of
Informatics, Japan.
ACL 2003 - Associated Workshops
" WS- 1 Mu lt i li n gu al Su mmar i zat i on an d Q u est i on
Answering - Machine Learning and Beyond.
" WS-2 Natural Language Processing in Biomedicine.
" WS-3 The Lexicon and Figurative Language.
" WS-4 Multilingual and Mixed-language Named Entity
Recognition: Combining Statistical and Symbolic Models.
" WS- 5 T h e Secon d I n t er n at i on al Wor ksh op on
Paraphrasing: Paraphrase Acquisition and Applications.
" WS-6 Second SIGHAN Workshop on Chinese Language
" WS-7 Multiword Expressions : Analysis, Acquisition and
" WS-8 Linguistic Annotation : Getting the Model Right.
" WS-9 Workshop on Patent Corpus Processing.
" WS-10 Towards a Resources Information Infrastructure.
" 2003 Conference on Empirical Met hod in Language
Processing (EMNLP 2003), Sapporo, Japan from July
11 – 12, 2003.
Website :
" JH U summer workshop on Lan guage En gin eerin g,
Baltimore, Maryland, USA, July 14 to Aug.22, 2003.
Website : http:/ / workshops
! International Conference on Information processing and
Resources on t he I nt ernet (“Tamil Int ernet 2003”)
Chennai, Tamilnadu, India, July 31 - 3 August, 2003.
Website : http:/ / ti2003/
" T he XVI t h I nt ernat ional Conference on H ist orical
Linguist ics, Universit y of Copenhagen, Denmark from
11th – 15th August 2003.
Website : http:/ / ichl2003
" Adaptation of Automatic Learning Methods for Analytical
and Inflect ional Languages, Vienna, Aust ria, August
18-22, 2003.
Website : http:/ / esslli03
" Workshop on “Adaptation of Automatic Learning Methods
for Analyt ical and Inflect ional Languages” (ALAF’03),
Vienna, Austria, Aug.18-22, 2003.
Website : http:/ / esslli03
" The 4th Celtic Linguistic Conference, Selwyn College,
University of Cambridge, UK From 1-3 September, 2003.
Websit e : ht t p:/ / privat ~louisa/ celt ic/
" Eurospeech 2003 – Switzerland (Interspeech 2003), 8th
European Conference on Speech Communicat ion &
Technology, September 1 – 4, 2003 – Geneva, Switzerland.
Website : http:/ / eurospeech/
" Generative Approaches to Language Acquisition, Utrecht,
Netherland, from 4th – 6th September, 2003.
Website : http:/ / conferences/ Gala/ m
" 36
Annual Meeting on Applied Linguistics, University
of leeds, from 4th – 6th September, 2003.
Website : http:/ / baal03call.htm
" An I n t er n at ion al Con fer en ce on Text Speech an d
Dialogue, Czech Republic, September 8 – 11, 2003.
Website : http:/ / events/ tsd2003/
" I nt ernat ional Conference RAN LP – 2003 (Recent
Advance in Nat ural Language Processing) Borovet s,
Bulgaria from 10 – 12 September 2003.
Websit e : ht t p:/ / ranlp2003/
" LISA Workshops in Boston, Boston U.S.A. from 12-18
September, 2003.
Website :
! 3 Days National Seminar on Telugu Language, Center for ALTS,
Univ. of Hyderabad,
Website : http:/ / announcement/ ltt2.html
Hyderabad from 17-19, Sept., 2003.
" Machine Translat ion Summit IX, New Orleans, USA,
September 23-28, 2003.
Website : http:/ /
" Conference on Language Policy and St andardizat ion,
Iceland, 4th October, 2003.
Websit e : ht t p:/ / Radst ml
" 2n d I n t er n at i on al Seman t i c Websi t e Con feren ce,
Workshop on Human Language Technology for t he
Semantic Website and Website Services, Sanibel Island,
Florida, USA, From 20 – 23 October 2003
Website : http:/ / conferences/ iswc2003/
" 8th Annual LRC Conference, Localisation Research Center,
University College Dublin, from 17-19 November, 2003.
Website : http:/ /
" LangTech2003-The European Forum for Language Technology,
Paris, France, from 24th & 25th November 2003.
Website : http:/ /
" 6th International Conference of Asian Digital Libraries
(ICADL) 2003, Kuala Lumpur, Malaysia from 8-11
December 2003.
Website : http:/ / ICADL2003/
! I CO N -2003: I n t er n at ion al Con feren ce on Nat ur al
Language Processing, Central Institute of Indian Languages,
Manasgangotri, Mysore, Dec.18-21, 2003.
Website : http/ / conferences/ icon2003.html
Unicode Related Meetings & Events
" INCITS/ CT22,Washington, DC,USA Aug.18 2003.
" UTC # 96/ L2 # 193, Pleasanton, CA, USA, Aug. 25-28
" IUC # 24, Atlanta GA, Sept. 3-5 2003.
" IRG #22 Guilin, China.
" SC22 Plenary, Oslo, Norway, Sept. 15-19 2003.
" Unicode Workshop, Hosted by Technology Development
for Indian Languages (TDIL) Programme of Department
of Information Technology, Govt. of India, at Park Hotel,
New Delhi, India, Sept 24-26, 2003.
" WG20, Hosted by Microsoft & The Unicode Consortium,
Mountain View, CA-USA, Oct 15-17, 2003.
" SC2/ WG2 # 44 Host ed by Microsoft & The Unicode
Consortium, Mountain View, CA-USA, Oct 20-23, 2003.
" UTC#97/ L2 # 194 Annual Members Meeting, Hosted
by Joh n H opkin s Un iver sit y, Balt imor e, MD ,
Nov 4-7, 2003.
Websit e : ht t p:/ / unicode/ t imesens/ ml
Note : ! Indicates conferences in India.
 “…Thank you very much for participating in the
Universal Library Million Book Project and for
hosting our delegation to India. I enjoyed the
presentations, discussions, collegiality, and food.
I was part icularly moved by Dr. Om Vikas’s
research and spoke with him afterwards. He gave
me a copy of one of his papers. That paper and
the other publications you provided in the packet
are impressive. The brass Shiva you gave me as a
gift is in an honored place in my office. I
appreciate your generosity and contribution to
achieving noble goals. God bless you …”
-Denise A. Troll,
Carnegie Mellon University (U.S.A)
Website :
 “…Thank you very much for your letter dated
Mar ch 24, 2003. I am delight ed wit h t he
information contained therein concerning the
Unicode Consort ium Commit t ee meet ing on
March 4-7, 2003 at Microsoft Center, Mountain
view, California.
My int erest had always been t o preserve and
develop the diversity that India represents in its
culture, languages, traditions and much else. That
why I always supported efforts that would leads
to possibilities in this direction. I am particularly
pleased that the initiatives you have taken over
the past quarter of a century, since the whole
pr ogr amme r elat i n g t o I n di an lan gu ages
computers and information technology was first
taken up in the Department of Electronics when
I was Secretary, are now flowering. We have come
long way since then, with considerable success,
for which a lot of credit need to be given to you.
I am delighted to have had the opportunity to be
supportive. My principal interest today is in the
UNL and t he approach t hat t his would be
complementary to a great deal of what has already
been done under the TDIL.
Thank you for sending me the “Annals of Indian
Language Computing” and Vishwabharat@tdil
-M.G.K. Menon, Eminent Scientist,
Dr. Vikram Sarabhai Distinguished
Prof. of ISRO, Dept. of Space
E-mail :
2. Reader’s Feedback
 “…Here in Bolivia, we are only recently reaching
sufficient agreement in an alliance of organisations
from civil society, universities, and the private
sector, to begin to plan a National Strategy for
Communications & Education. A vital element
in this, of course, is an indigenous presence, which
makes up over 70% of the Bolivian population,
and t he st rat egy planning of mult i-lingual
information processing.
In this light, we are interested to know :
(i) if there is some written documentation you
could send us on how you began your enormous
ven t u r e t o t r an sfor m I n di an I n for mat i on
(ii) and as we are still at the stage of preparing the
broad outlines of a planning strategy, we wonder
if you have any articles on your experience in
h i n dsi gh t , par t i cu lar ly con cer n i n g t h e
management of priorities at this earliest stage.
At the same time, we wish to thank you for sending
u s copi es of you r jou r n al, D i gi t al Un i t e
an d Kn owledge for all, wh i ch i n spi r e u s
-Juan de Dios Yapita,
Instituto de Lengua y Cultura Aymara
E-mail :
 “…I appreciate your work on “English to Hindi”
translation. I have tried using it for the first time
and it seems quite impressive. I cannot write hindi
properly so this would be a helpful tool (especially
to write to my parents). It’s done very well and
throws in exceptions very nicely. I was trying to
use your hindi editor and sometimes totally got
stuck while trying to find something. Also some
stuff is hard to delete (i.e. if I made a mistake).
Few more things that would make life more easy
wou ld be t o h ave a pi ct u r e or a t ext fi le
explaining/ showing the keyboard mapping to the
hindi fonts, and having an ability to get a hardcopy
of the typed stuff. Thanks again, Cheers…”
-Ashutosh Jha, Osellus Inc. Toronto
E-mail :
3. TDIL Vision
TDIL Vision 2010
(TDIL : Technology Development for Indian Languages)
A B C Technology Development Phases
India was long aware of the technological changes and the
local con st r ai n t s an d developmen t of Lan gu age
Technology in India may be categorized in three phases:
q 1976-1990 : A-Technology Phase: Focus was on
Adapt at ion Technologies; abst ract ion of requisit e
technological designs and competence building in
R&D institutions.
q 1991-2000 : B-Technology Phase: Focus was on
developing Basic Technologies- generic information
processing t ools, int erface t echnologies and cross-
compatibility conversion utilities. TDIL (Technology
Development for Indian Languages) programme was
q 2001-2010 : C-Technology Phase: Focus is on
developing Creative Technologies in the context of
convergence of comput ing, communicat ion and
cont ent t echnologies. Collaborat ive t echnology
development is being encouraged to realise.
Vision statement
Digital unite and knowledge for all.
Mission statement
Communicat ing & moving up t he knowledge chain
overcoming language barrier.
q To develop information processing tools to facilitate
human machine interaction in Indian languages and
to create and access multilingual knowledge resources/
q To promote the use of information processing tools
for language studies and research.
q To consolidate technologies thus developed for Indian
languages and integrate these to develop innovative
user products and services.
Major Initiatives
q Knowledge Resources
(Parallel Corpora, Multilingual Libraries/ Dictionaries,
Wordnets, Ontologies)
q Knowledge Tools
(Portals, Language Processing Tools, Web based Tools)
q Translation Support Systems
(Machine Translation, Multilingual Information Access,
Cross Lingual Information Retrieval)
q Human Machine Interface System
(O pt ical Charact er Recognit ion Syst ems, Voice
Recognition Systems, Text-to-Speech System)
q Localization
(Adapting IT Tools and solutions in Indian Languages)
q Language Technology Human Resource Development
(Manpower Development in Nat ural Language
q Standardization
(ISCII, Unicode, XML, INSFOC, etc.)
TDIL Programme Goals
Short Term Goals
q Standardization of code, font, keyboard etc.
q Fonts and basic software utilities in public domain.
q Corpora creation and analysis.
q Content creation tools.
q Language Technology be integrated into IT curricula.
q Collaborative development of Indian language lexical
q Writing aids (Spell checks, grammar checks and text
summarization utilities).
q Sharing of standardized lexware &development of
lexware tools.
q Tr ain in g progr ams on I LT awaren ess, lexwar e
development, and computational linguistics.
Medium Term Goals
q Indian language speech database
q Multilingual, multimedia, content development with
semant ic indexing, classical and mult i font and
decorative fonts, off-line/ on-line OCR.
q Cross lingual information retrieval (CLIR) tools.
q Human speech encoding
q Speech Engine : Speech recognition, specific speech
I/ O.
q Indian language support on Internet appliances.
q Un der st an din g an d Acquisit ion of lan guages,
knowledge representation, gisting and interfacing.
q Distinguished achievement awards for M.Tech/ MCA/
Ph.D. level in Indian Language Technologies.
q Machine aided translation: English to Indian languages,
among Indian languages, Indian languages to English
and other foreign languages.
q On line rapid translation, gisting and summarization.
Long Term Goals
q Speech-to-speech translation.
q Human Inspiring Systems.
Language Technology Map
Resource Centres & CoIL-Net Centres
for Indian Language Technology Solutions
Resource Centres
MSU (B) = MS University, Baroda (Gujarati)
AU (C) = Anna University, Chennai (Tamil)
UU (B) = Utkal University, Bhuwaneshwar (Oriya)
ISI (K) = Indian Statistical Institute, Kolkata
JNU (D) = Jawaharlal Nehru University, New Delhi
(Sanskrit, Japanese, Chinese)
UOH (H) = University of Hyderabad, Hyderabad
IISc (B) = Indian Institute of Science, Bangalore
IIT (K) = Indian Institute of Technology, Kanpur
(Hindi & Nepali)
IIT (M) = Indian Institute of Technology, Mumbai
(Marathi & Konkani)
IIT (G) = Indian Institute of Technology, Guwahati
(Assamese, Manipuri)
CDAC (T) = Electronic Research & Development
Centre of India, Thiruvananthapuram (Malayalam)
CDAC (P) = Centre for Development of Advanced
Computing, Pune (Urdu, Sindhi, Kashmiri)
OCAC (B) = Orissa Computer Application Centre,
Bhubaneswar (Oriya)
TIET (P) = Thapar Institute of Engineering &
Technology, Patiala (Punjabi)
CoIL-Net Centres
CDAC (P) = Centre for
Development of Advanced
Computing, Pune
BIT (R) = Birla Institute of
Technology, Ranchi
IIT (K) = Indian Institute of
Technology, Kanpur
IIITM (G) = Indian Institute
of Information Technology &
Management, Gwalior
BV (B) = Banasthali
Vidyapeeth, Banasthali
BHU (V) = Banaras Hindu
University, Varanasi
IIT (R) = Indian Institute of
Technology, Roorkee
IGNCA (ND) = Indira Gandhi
National Centre for the Arts,
New Delhi
GGU (B) = Guru Ghasidas
University, Bilaspur
Resource Centres
CoILNet Centres
Technology Development for Indian
Languages Programme (TDIL)
-An Overview
1. Need of the programme
T he world is in t he midst of a t echnological
revolut ion nucleat ed around Informat ion and
Communication Technology (ICT). Advances in
Human Language Technology will offer nearly
universal access to information and services for more
and more people in their own language. Today 80
% of the content on the Web is in English, which is
spoken by only 8% of the World population and
only 5% of Indian population. In a multilingual
country like India, with 18 official languages & 10
scripts it is essential that information processing and
translation software should be developed in local
languages and available at low cost for wider
proliferation of ICT to benefit the people at large
and thus pave the way towards “Digital Unite and
Knowledge for all” and arrest the sprawling Digital
2. Impediments
q Lack of industry involvement due to constrained
q Sub-critical and Un-sustained demand in States;
q Negli gi ble soft war e t ools an d r e- u sable
components in public domain
q Plurality of Internal Codes: ISCII-88, ISCII-91,
UNICODE, and other propriety codes;
q Cont ent Glyph-coded, not (ISCII) charact er-
q No font Standard;
q Non-availabilit y of Human resources in t he
domai n of Lan gu age Tech n ology i n bot h
Computational Linguistics (CL) and knowledge
Engineering (KE)
q Reluctance of Academia to focus on productising
the R&D.
3. TDIL Mission
In this context, a number of initiatives have been
taken towards development of software, tools and
human –machin e I n t er face syst ems in I n dian
Languages under the TDIL programme.
Digital unite and knowledge for all.
Communicating without language barrier & moving
up the knowledge chain.
q To develop informat ion processing t ools t o
facilitate human machine interaction in Indian
languages and to create and access multilingual
knowledge resources/ content.
q To promote the use of information processing
tools for language studies and research.
q To consolidate technologies thus developed for
Indian languages and integrate these to develop
innovative user products and services.
3.1 Focus Areas
Translation Systems
• Machine aided Translation system (MAT)
• Par allel cor por a, Lexwar e, Mu lt i li n gu al
dictionaries, Wordnet, speech databases
• Speech-to-Speech Translation system
Human Machine Interface systems
• Optical Character Recognition
• Speech Recognition
• Text to Speech system
Language processing and Web tools
Word processors, spell checkers, converters and
Adapt ing I T t ools and solut ions in I ndian
UNICO DE, XML, ISCII (Indian St andard
Code for Information Interchange), INSFOC
(Indian Standard Font Code), INSROT (Indian
4. About TDIL Programme
Script to Roman Transliteration), Standard for
Lexware format, etc.
Evaluation and Benchmarking
Evalu at i on an d t est i n g of t h e pr ot ot ype
t echn ologies again st ben chmar ks, pr oduct
compliance t est ing against specificat ions and
validations of standards.
4. Technology Development Phases
Development of Language Technology in India may
be categorized in three phases:
1976-1990 : A-Technology Phase
Focus was on Adaptation Technologies; abstraction
of requisite technological designs and competence
building in R&D institutions.
1991-2000 : B-Technology Phase
Focus was on developing Basic Technologies- generic
information processing tools, interface technologies
and cross-compatibility conversion utilities. TDIL
(Technology Development for Indian Languages)
programme was initiated.
2001-2010 : C-Technology Phase
Focus is on developing Creative Technologies in the
con t ext of con ver gen ce of compu t i n g,
commu n i cat i on an d con t en t t ech n ologi es.
Collaborat ive t echnology development is being
encouraged to realise.
5. Program Management Structure
All t he above act ivit ies are being implement ed
through various Resource Centres and Localization
(CoIL-Net ) Cent res. The programme is being
technically monitored by TDIL Working Group
consisting of experts drawn from academia, industry,
R &D organizations and Government. The Resource
centres are divided into regional clusters for peer
review of t he progress of t he t echnologies and
solutions developed by them.
Resource Centres
The Department has thirteen Resource Centres for
Indian Language Technology Solutions at various
educational institutes and R&D organizations who
h ave developed sever al impor t an t t ools an d
technologies for Indian Language support.
CoIL-Net (Content Creation & IT Localisation
CoIL-Net programme is formulated encompassing
a vi si on for all per vasi ve soci o- econ omi c
development s by proliferat ing use of language
specific IT based content, solutions and applications
for bringing the benefits of IT revolution to the
common citizen; requisite technology development
and consequently bridging the existing digital divide
in the Hindi speaking states of India. The objective
is to provide a boost to IT localization based socio-
economic developments and help bridge the existing
di gi t al di vi de by appr eci ably i mpr ovi n g I T
penetration and awareness levels, using Hindi as
medium of delivery, in the Hindi speaking states of
MP, Chattisgarh, UP, Uttaranchal, Bihar, Jharkhand,
and Rajasthan.
The MAIT Consortium on Innovation & Language
Technology (CoIL-Tech) since it s incept ion in
Sept ember 2001, has been act ively coordinat ing
various activities with the Industry and the TDIL
(Technology Development in Indian Languages)
Department of the Ministry of Communications
& Information Technology. The consortium today
has active participation from both Indian and MNC
companies wit h a focus t o promot e indust ry
part icipat ion in collaborat ive R&D in language
technology, to coordinate Open Source Software
supporting Indian languages, to evolve consensus
on standards, benchmarks, and certification of LT
products, to collectively interface with government
and academia, to conduct market surveys, organize
technology shows, promote technology transfers and
expand market collectively.
6. Major Achievements of the TDIL Programme
Optical Character Recognition (OCR)
Optical Character Recognition is an indispensable
tool for digitizing the content and is essential for
development of Knowledge net works such as
digital libraries. Optical Character Recognition
(OCR) technology offers the facility to scan and
store the printed text. There are three essential
elemen t s t o O CR t ech n ology - scan n in g,
recognit ion and t hen reading t ext . Init ially, a
camera scans a printed document. OCR software
t h en con ver t s t h e images in t o r ecogn ized
characters and words. The OCRs developed are
being tested and benchmarked independently by
OCR with more than 97% accuracy has been
developed for seven Indian Languages viz Hindi,
Mar at h i , Ban gla, Tami l, Telu gu , Pu n jabi ,
The OCR t echnologies for Assamese, Oriya,
Malayalam an d Gujarat i script s are in t he
advanced stages of development.
Spell Checkers
Spell checker is the language specific software
component used in word processing, OCR post
processing and text processing applications. Spell
Checkers for Hindi, Marat hi, Bangla, Tamil,
Telugu, Punjabi, and Malayalam have been
developed. The Spell checkers for other Indian
Languages are in advanced stages of development.
Machine Aided Translation System (MAT)
MAT system provides rough translation from a
source language (say English) int o a t arget
language (say Hindi). This output may be post
edited with specially designed tools for enhancing
the translation accuracy.
Anglabharati English to Hindi Translation
Support Syst em has been developed by IIT
Kanpur (Resource Centre for Indian language
in Hindi and Nepali language) and the same is
avai lable i n Pu bli c D omai n at http://
The utility of text to speech synthesis has also been
integrated with Anglabharati and works on the
Linux platform.
The Machine Translat ion syst em for Indian
Languages to Hindi and Hindi to English are in
the process of development.
Support on Indian Language LINUX Operating
System (INDIX)
Localization Linux operating system at x-windows
level has been done. INDIX (Localized LINUX)
operating system has been developed by NCST,
It is now possible on Linux to give file names,
domain names in Devanagari. Indian language
support on Linux in open domain will also ensure
Indian Languages support on all Open Source
Tools Kits. INDIX supports GNOME desktop
environment (of LINUX) to enable applications
t o creat e, edit and display cont ent s in Indian
Languages. The R&D for supporting other Indian
Languages have been initiated.
Unicode Standards are widely being used by the
Indust ry for t he development of Mult ilingual
Software. Unicode Standard 3.0 includes standard
code sets for Indian scripts based on ISCII-1988
document. Some modifications are necessary to
incorporate in the Unicode Standard for adequate
representation of Indian Scripts.
D epar t men t of I n for mat i on Tech n ology,
Ministry of Communications & IT, is the voting
member of the Unicode Consortium. Proposed
changes in the existing Unicode Standards has
been finalized in consultation with respective State
Government and Indian IT Industry. These have
been pu bli sh ed i n T D I L Newslet t er
The Indian Script s Font Code (INSFOC) t o
overcome t he problem of int eroperabilit y of
various software products developed by different
ven dor s h as been complet ed for H i n di ,
Malayalam, Gurmukhi and Gujarati. The work
is in progress for other Indian languages.
An “I n dian Scr ipt t o Roman izat ion Table
(INSROT)” has been worked out to facilitate no-
Hindi users to work in Hindi. Efforts are also on
to Standardise the Indian Lexware Format.
Parallel Corpora (Gyan Nidhi)
A parallel corpus is an important input for MAT
system. One Million pages Parallel Corpora is
present ly in t he process of development . The
Graphic User Interface for viewing the multiple
Language Corpora has also been developed. The
aim is to develop parallel corpora of One million
pages. So far 450, 000 pages parallel corpora has
been developed (as on May 31, 2003).
Bi-lingual Dictionaries
Bi-lingual electronic dictionaries are the basic
linguistic resource which is useful for any NLP
related research development and testing progress.
These are useful in the development of Word
processors in Indian Languages. English – Hindi,
English – Telugu, English- Tamil, English –
Kan n ada, En glish- Ban gla, Ban gla-Ban gla,
En gli sh - Pu n jabi , En gli sh - O r i ya, En gli sh -
Malayalam bilingual dict ionaries have been
developed by RCs.
IT Localisation Clinic
Workshops were organised by DI T during
November 2002, t o acceler at e t he pr oject
deliverables such as Cont ent development in
H indi, Creat ion of Reposit ory of language
t ech n ology pr odu ct s, Tr ai n er s Tr ai n i n g
pr ogr ammes, I T Locali sat i on Cli n i cs, I T
Curricula for Schools, Website Development in
Hindi. Installation of test beds in the domains of
e-governance (Land Records, Agri-Net, Query
System), e-education (Schools, ITIs), e-business
(Small Business), e-tourism, e-health.
TDIL Portal
The TDIL websit e is bi-lingual (English and
Hindi) and contains information about the TDIL
Programme, it’s initiatives and achievements. It
provides access to Indian scriptures, standards
(Indian scripts, keyboard layout, font layout, etc),
articles and reviews. The unique services provided
by t he websit e includes E-mail in Hindi and
Online Machine Translat ion from English t o
Hindi and vice versa.
The website also provides downloadable softwares
and tools in Indian Languages viz. Plug-in such
as Akshar for Windows (Plug-in for MS-Word),
Shrilipi Bhart i (Plug-in wit h keyboard driver
having Devanagari fonts), Indian Language Word
Processor s, N LP t ools, N LP Resource for
Windows/ Linux.
Quarterly on Indian Language Technologies
The VishwaBharat@tdil is a quarterly Newsletter
of I n di an Lan gu age Tech n ologi es, wh i ch
consolidat es in one place informat ion about
products, tools, services, activities, developments,
achievements in the area of Indian Language
software. It serves as a means of sharing ideas
among t echnology developers. T his creat es
awareness in the society regarding the availability
of language t echnology resources. It is issued
quarterly is widely circulated
Eight issues viz. Jan 2001, May 2001, Sept.
2001, Jan 2002 , May 2002, July 2002, October
2002, Jan 2003 published. All the issues Accessible
through TDIL Web site and all the issues are
available in a single CD.
7. Targets for 2003-04
q Integration and Deployment of OCR, TTS and
Document processing technologies
q Init iat ion of development of Machine aided
Translation System between English and Indian
q Linux Operat ing Syst em enabled wit h Indian
languages support
q Enforcing UNICODE revision and Vedic code
q I n i t i at i on of Intelligent Cognitive system
(KUNDALINI) project for bringing synergy in
t radit ional Sanskrit Shast ras and Informat ion
q Cr oss Li n gu al I n for mat i on Ret r i eval for
Universal Access
q IT Localizat ion Clinics for incubat ion of IT
q 16 Bit Unicode compliant Hindi fonts for public
8. New Projects Initiated
Intelligent Cognitive System
A new project in t he domain of Int elligent
Cogn i t i ve syst em n amed Kn owledge
1 0
UNDerst anding & Acquisit ion of Languages,
INferencing and Interpretation (KUNDALINI)
is being initiated with the following objectives:
• To develop met hodologies and t ools for
kn owledge r epr esen t at i on , ext r act i on ,
mi n i n g, gi st i n g, i n fer en ci n g an d
• To develop knowledge frameworks and access
mechanisms based on Indian tradition
• To develop San skr it based Net wor kin g
Language as Machine Translation Interlingua.
(Concept based Net working Languages :
Gyanaudyog (-rr-rr ¤r ·r)
Encouraged by the success of the Gyanaudyog
Workshop held on April 7 –9, 2003 a new project
Gyanaudyog is being initiated to promote Small
Office & Home Entrepreneurship in the area of
In for mat ion Techn ology for cat alyzin g I T
enabled services (ITES), specifically, Cont ent
Creat ion, Cont ent localizat ion & applicat ion
soft war e locali zat i on , Remot e cu st omer
int eract ion Services, Comput er Aided Design
with support for Technology Mentoring, Financial
Support guidance and Market information.
Anusrijan (»-rªr»r-r)
There is addition of over 25

Million pages in
R&D in the field of Science &Technology. It is
planned to prepare books/ monographs in local
language on emerging areas of Information &
Communications Technology (ICT). These will
be available for t ranslat ion int o ot her Indian
Languages so as to create averseness about recent
developmen t t o pr omot e i n n ovat i on an d
entrepreneurial aptitude.
SOHE - Ganak Bharti (ªrrr·rºr+ -rrº·r|)
A new project named as Ganak Bharti is being
i n i t i at ed t o pr omot e developmen t an d
deploymen t of Low cost PC wi t h I n di an
Language support and Open Source Software for
small business men, women at home and children
aiming at catalyzing the growth of Small Office,
Home and Education (SOHE) environment.
Evaluation and Benchmarking of the Indian
Language Technology Tools
Test ing, Evaluat ion and Benchmarking of t he
Indian Language Technology tools is essential for
wider accept an ce of t h e I n dian Lan guage
t ech n ology pr odu ct s. I n vi ew of t h i s,
Standardization Testing and Quality Certification
(STQC) division of DIT has been designated as
the third party for evaluation of the language
technologies tools and products developed TDIL
programme as per international standard.
9. Digital Library Initiative of India
D i gi t al li br ar i es ar e a for m of I n for mat i on
Technology in which social impact matters as much
as technological advancements. Future knowledge
n et wor ks will r ely on scalable seman t ics, on
automatically indexing the community collections
so t hose users can effect ively search wit hin t he
Interspace of a billion of repositories. Just as the
transmission networks of the Internet are connected
via swit ching machines t hat swit ch packet s, t he
knowledge net works of t he Int erspace will be
connect ed via swit ching machines t hat swit ch
concepts. Connectivity and training continue to
be the principal barriers to integrating the global
network of libraries.
Among a number of major DL initiatives in USA is
the Universal Digital Library (UDL) project that
aims at creating a free-to-read, searchable collection
of one million books, primarily in t he English
language, available to everyone over the Internet.
This project involves part icipat ion from many
countries including USA, Australia, India China,
Egypt, Srilanka . The funding agency is National
Science Foundation (NSF),USA, with a funding of
US $ 4.0 Million
The contribution form various industries of USA is
US $ 6.0 million. The overall coordination of the
project is done by Carnegie Mellon Universit y
India participates in the UDL project and makes
efforts to put e-content of vast Indian knowledge base
in the Indian language as much as possible. The
Indian activity of Digital Library is coordinated by
Indian Institute of Science, Bangalore
1 1
9.1 Objectives
q To digitize and index the heritage knowledge.
q To provide a t est bed t hat will support ot her
research domains such as scanning techniques,
optical character recognition etc
q Inter-ministerial collaboration for e-Content.
q To promote life long learning in the society (A
necessity of the Knowledge-based society).
q To promote collaborative creativity and building
up knowledge teams across borders.
q Involvement of Resource Cent ers for Indian
Language Technology Solutions and COILNet
Centers to digitize and web-enable contents in
Indian Languages.
9.2 Participating Institutions of UDL program
q Indian Institute of Science, Bangalore (Overall
q Anna University, Chennai, TamilNadu
q Arulmigu Kalasligam College of Engineering,
Madurai,Tamil Nadu
q Goa University, Goa
q Indian Inst it ut e of Informat ion Technology,
Allahabad, Uttar Pradesh
q City and State Central Library, Andra Pradesh
q Indian Inst it ut e of Informat ion Technology,
Hyderabad, Andra Pradesh
q Shanmugha Art, Science Technology & Research
Academy, Thanjavore, TamilNadu
q Sri Sri Sharda Peetam, Sringeri Mutt, Sringeri,
q Tir umala Tir upat i Devast han am, Tir upat i,
q Mah ar ast h r a I n du st r i al D evelopmen t
Corporation, Mumbai, Maharashtra
q University of Pune
q Cen t r e for D evelopmen t of Advan ced
Computing, Noida
q Kanchi University, Kanchi, Tamil Nadu
q Indian Institute of AstroPhysics, Karnataka
q Indira Gandhi National Centre for Arts, New Delhi
q Rashtrapathi Bhavan, New Delhi
q Punjab Technical University, Punjab
9.3 Targets for 2003-04
q D evelopmen t an d D eploymen t of on -lin e
Multilingual Machine Translation system, OCRs
in Indian Languages and Universal Dictionary
Programme through TDIL programme.
q Establishment of Regional Mega Centres at the
institutions with proven capability of scanning
at least 5000 pages per day. The Mega-centres may
be created with State government’s initiative for
meeting operational costs.
q Up gradation of the connectivity of the Mega-
Centres of scanning with 2Mb/ s connectivity.
q Creation of “ Digital Library Act” – Provision for
Tax deduction etc for the purpose of providing a
Digital version on web for public benefit.
q Creation of 4C ( Consortium for Compensation
for Creative content)
q Evolution of Project Management mechanism
and product ivit y measurement st rat egy of t he
Digital Library Centres and criteria for setting
up of Mega Scanning Centres.
q Set t ing up of Int er-minist erial commit t ee t o
integrate Digital Library efforts in India
q Evolution of National Mission on Digital Library
in consultation with Prof. M.G.K.Menon.
q Participation with lead role in
− WSIS (World Summit on Information Society)
organized by ITU at Geneva in Dec 2003)
− UN Decade of Literacy (2003-2013)
− TWA (Third World Academy)
q Promotion of Spin-off technology from Digital
Library programme.
- TDIL Programme Team
1 2
Resource Centre For
Indian Language Technology Solutions – Hindi, Nepali
Indian Institute of Technology Kanpur
Achievements Achievements Achievements Achievements Achievements
Department of Computer Science & Engineering
Indian Institute of Technology Kanpur-208016 India
Tel. : 00-91-512-2590762 E-mail :
Website : http:/ /
http:/ / users/ langtech
RCILTS-Hindi, Nepali
Indian Institute of Technology, Kanpur
In 1995, Department of Electronics, Govt. of India,
sanctioned a grant-in-aid for implementation of the
project t it led “Machine Aided Translat ion from
English to Hindi for standard documents (domain
of Pu bli c H ealt h Campai gn ) based on
ANGLABHARTI approach” for which ER&DCI
(wit h it s office at Lucknow and now moved t o
NOIDA) was associated for implementation and
commercialization of this software on a PC platform
in t he domain of public healt h campaign. The
ANGLABHARTI soft ware already developed by
IITK on SUN system was used in this project and
was implement ed (re-engineered) on PC under
Linux joint ly by IITK and ER&DCI under t he
supervision of IITK (R.M.K. Sinha, Ajai Jain). In
1996, I I T K also design ed an d developed an
Example-based appr oach for Machin e Aided
Translat ion for similar (Indian languages) and
dissimilar (English and Indian Languages) under the
leadership of Professor R.M.K. Sinha. This approach
has been named as ANUBH ARTI approach. A
system to translate from Hindi to English has been
implemented based on ANUBHARTI approach by
IITK(R.M.K. Sinha, Ajai Jain and Renu Jain).
Currently, AnglaHindi, the English to Hindi MAT
based on Anglabharti methodology, which accepts
unconstrained text, has already been made available
to the users and is very well received. AnglaUrdu
which is based on AnglaH indi has also been
demonstrated. HindiAngla, the Hindi to English
MAT based on Anubharti methodology, has been
demonstrated for simple sentences and further work
is going on t o handle compound and complex
sentences. The current research at IITK is focused
t owards development of more efficient machine
translation strategies with user friendly interfaces for
these systems. Another dimension of diversification
for future, is to cater to all other Indian languages
by implement ing AnglaSanskrit , AnglaBangala,
An glaPu n jabi , an d so on ; San skr i t An gla,
BangalaAngla, PunjabiAngla, and so on; and
HindiSanskrit, HindiBangala, and so on; based on
hybr idizat ion of An glabhar t i an d An ubhar t i
1. Machine Translation
Chief Investigator: Dr. R.M.K. Sinha
Co-Investigator: Dr. A. Jain
Our work on machine translation started in early
eight ies when we proposed using Sanskrit as
int erlingua for t ranslat ion t o and from Indian
languages (See the paper on “Computer processing
of Indian languages and scripts - Potentialities and
Problems”, Jour. of Inst . Elect ron. & Telecom.
En grs. , vol. 30, n o. 6, 1984). T his was fur t her
elaborated in CPAL-1 paper presented at Bangkok
in 1989.
Later in 1991, the concept of a Pseudo-Interlingua
was developed wh i ch exploi t ed st r u ct u r al
commonality of a group of languages. This concept
has been used in development of machine-aided
translation methodology named ANGLABHARTI
for translation from English to Indian languages.
Anglabharti is a pattern directed rule based system
with context free grammar like structure for English
(source language). It generat es a ‘pseudo-t arget’
(Pseudeo-Int erlingua) applicable t o a group of
Indian languages (target languages) such as Indo-
Aryan family (Hindi, Bangla, Asamiya, Punjabi,
Marathi, Oriya, Gujarati etc.), Dravidian family
(Tamil, Telugu, Kannada & Malayalam) and others.
A set of rules obtained through corpus analysis is
used to identify plausible constituents with respect
1 4
to which movement rules for the ‘pseudo-target’ is
constructed. Within each group the languages exhibit
a high degree of structural homogeneity. We exploit
the similarity to a great extent in our system. A
lan guage specific t ext -gen erat or con vert s t he
‘pseudo-t arget’ code int o t arget language t ext .
Paninian framework based on Sanskrit grammar
using Karak (similar to case) relationship provides
an uniform way of designing the Indian language
t ext generat ors. We also use an example-base t o
identify noun and verb phrasals and resolve their
semantics. An attempt is made to resolve most of
the ambiguities using ontology, syntactic & semantic
t ags and some pragmat ic rules. The unresolved
ambiguities are left for human post-editing. Some
of t he major design considerat ions in design of
Anglabharti have been aimed at providing a practical
aid for translation wherein an attempt is made to
get 90% of the task done by the machine and 10%
left to the human post-editing; a system which could
grow in cremen t ally t o han dle more complex
sit uat ion s; an un ifor m mechan ism by which
t ranslat ion from English t o majorit y of Indian
languages wit h at t achment of appropriat e t ext
generator modules; and human engineered man-
machine interface to facilitate both its usage and
augmentation. The translation system has also been
interfaced with text-to-speech module and OCR
This project also received funding from TDIL
programme of Govt. of India during 1995-97 and
The English to Hindi version named AnglaHindi,
of Anglabharti machine aided translation system has
been web-enabled and is available at URL: http:/ /
The technical know-how of this technology has been
transferred on a non-exclusive basis to ER&DCI/
CDAC Noida for commercialization.
A system for translating English to Urdu, named
AnglaUrdu, has also been developed using our
AnglaHindi system and Urdu display software of
CDAC, Pune.
In 1995, we developed another approach for MT
which was example-based. H ere t he pre-st ored
example-base forms the basis for translation. The
t ranslat ion is obt ained by mat ching t he input
sent ence wit h t he minimum ‘dist ance’ example
sent ence. In our approach, we do not st ore t he
examples in t he raw form. T he examples are
abstracted to contain the category/ class information
to a great extent. This makes the example-base smaller
in size and further partitioning reduces the search
space. The creation and growth of the example-base
is also done in an interactive way. This methodology,
named ANUBHARTI, has been used for Hindi to
English t ranslat ion and furt her det ails of t his
approach can be seen in the Ph.D. thesis of Renu
The Anubharti approach works more efficiently for
similar languages such as among Indian languages.
In such cases the word-order remains the same and
on e n eed n ot h ave poi n t er s t o est abli sh
Currently, we are working towards developing an
Integrated Machine-aided translation system (with
funding from TDIL programme of Govt. of India,
2003 onwards) hybridizing the rule-based approach
of An glabhar t i, example-based appr oach of
Anubharti, corpus/ statistical based approaches to get
the best out of these approaches. This is also being
explored to be used for translation engine of speech
to speech translation system.
In parallel, we are also developing MAT system for
Hindi to English translation system, HindiAngla,
based on our Anubharti approach with funding from
CO I LN ET pr oject of Govt . of I n dia (2001
onwards). AnglaHindi and HindiAngla have been
used to demonstrate the two way reverse translation
for simple sentences.
2. Speech to Speech Translation
The speech to speech (S2S) translation requires a
tight coupling of the automatic speech recognition
(ASR) module, MT module, and the target language
text to speech (TTS) module. A mere interfacing of
ASR, MT and T TS modules does not yield an
accept able S2S t r an slat ion . S2S r equ ir es an
1 5
integration of these modules such that the hypotheses
are cross verified and appropriate parameters get
generated. In our environment, it has to cater to bi-
lingual (Hindi mixed with English) speech with
commonly encountered Indian accent variations.
The MT also needs be a chunk t ranslat or wit h
multiple translation engines. Our investigations are
directed to domain specific applications in Indian
environment .
Some Relevant Publications
• R. M. K. Sin ha, Towards Speech t o Speech
Translation, Key-note presentation at Symposium
on Translation Support Systems (STRANS2002),
March 15-17, 2002, Kanpur, India.
3. Lexical Knowledge-Base Development
Lexical knowledge base is the fuel to the translation
engine. It contains various details for each word in
the source language, like their syntactic categories,
possible senses, keys to disambiguate their senses,
corresponding words in target languages, ontology
and word-net informat ion/ linkages. We are also
working towards development of Indian language
wordnet named ShabdKalpTaru in association with
Dr. Om Vikas and Dr. Pushpak Bhhattacharya.
Some Relevant Publications
• Renu Jain and R.M.K. Sinha, ‘On Multi-lingual
Dict ionary Design’, Symposium for Machine
Aids for Tr an slat ion an d Commun icat ion
(SMATAC96), New Delhi 1996.
• R.M.K. Sinha, K. Sivaraman, Adit i Agrawal,
T. Suresh and C. Sanyal, ‘On logical design of
mult i-lingual lexicon for machine t ranslat ion’,
Technical Report TRCS-93-174, Depart ment
of Comput er Science and Engineering, IIT
Kanpur, 1993.
4. Optical Character Recognition
Work on Devanagari OCR was carried out with
TDIL, Govt. of India, sponsored project named,
DEVDRISHTI, on Recognition of Handprinted
Devanagari script. The investigations were carried
on in developing new features and in integrating
decision making taking into account large variations
in shape. Further, an automated strategy for training
for const ruct ion of prot ot ypes and confusion
matrices, from true ISCII files was developed. This
had to be very much distinct from their Roman
coun t erpart due t o script composit ion bein g
involved in case of Devanagari script. This work
was furt her expanded incorporat ing blackboard
model for knowledge integration in Ph.D. thesis of
Veena Bansal titled “Integrating Knowledge Sources
in Devanagari Text Recognition”
Some work has also been carried out on On-line
character recognition for Roman using handwriting
modelin g. I n vest igat ion s on on -lin e isolat ed
Devanagari characters have also been carried out
and furt her invest igat ions are in progress on
the subject.
(Editorial Comment : This technology was declined
for test & evaluation by STQC)
1 6
5. Transliteration
Translit erat ion among Indian script s is easily
achieved using ISCII (Indian Script Code for
Information Interchange). ISCII has been designed
using the phonetic property of Indian scripts and
cat ers t o t he superset of all Indian script s. By
at t ach i n g an appr opr i at e scr i pt r en d er i n g
mechanism t o ISCII, t ranslit erat ion from one
Indian script to another is achieved in a natural way.
However, transliteration from Indian script requires
use of heuristics to convert the non-phonetic script
to its probable intended spoken form before it could
be transliterated. Similarly, transliteration from an
Indian script to Roman requires using a standardized
mapping table to easily readable. In our work on
t ranslit erat ion, we have suggest ed heurist ics and
tables. Several other workers have come up with their
own suggestions. Recently, TDIL has come up with a
standardization of this table called INSROT which uses
only lower case letters to facilitate standard search.
Some Relevant Publications
• R.M.K. Sinha, ‘Computer processing of Indian
lan guages an d scr ipt s - pot en t ialit ies an d
problems’, Jour. of Inst. Electron. & Telecom.
Engrs., vol.30,no.6, 1984,pp. 133-49.
• R. M. K. Sinha and B. Srinivasan, ‘Machine
transliteration from Roman to Devanagari and
Devanagari to Roman’, Jour. of Inst. Electron.
& Telecom. En gr s. , vol. 30, n o. 6, 1984,
pp 243-45.
6. Spell-Checker Design
For Indian script s, t here is a very loose concept
of a spelling. Writ ing in Indian script s is a direct
mapping of t he inherent phonet ics and you writ e
as you speak. There are geographical variat ions
in t he spoken form and so t he spellings vary. Our
approach t o design of a spell checker is t o develop
an user error model for each class of user where
t he source of error may t he due t o incorrect
p h on et i cs, i n accu r at e i n p u t t i n g or ot h er
influences. The spell-checker uses this error-model
in making suggest ions for t he error.
7. Knowledge Resources
The seeds of this work were planted in the pre-
Internet days with a project undertaken by Dr. T.V.
Prabhakar, Indian Inst it ut e of Technology (IIT)
Kanpur, funded by t he Chinmaya Int ernat ional
Foundat ion (1989). A DOS version of Swami
Chinmayananda’s book, T he Holy Geet a was
hyperised and published as Geeta Vaatika (1992),
perhaps the first electronic book in India. After the
emergence of Internet standards, Geeta Vaatika was
redone in HTML (1996).
As the World Wide Web grew, the Government of
India (Department of Electronics) funded a project
that continued this work. Work began on the Gita
Supersite , which included multiple commentaries
and translations of the Bhagavadgita. A website was
designed and built, with the programming (business
logic) initially all on the client side.
Resource Centres for Indian Language Technology
Solutions were established throughout the country.
Under one such Resource Centre established at IIT
Kanpur, work on the Gita Supersite has continued.
The technology was extensively reworked and the
content was converted into a database, with all the
business logic on the server side. Currently work is
goi n g on t o con ver t t h e dat a i n t o a fon t -
in depen den t I SCI I dat abase, st r eamlin e t h e
programs, improve the audio content, add many
more comment aries on t he Bhagavadgit a and
provide additional features on the site.
1 7
Meanwhile, the idea of building Heritage Websites
related to Indian philosophical texts emerged. A
series of websit es were planned, including t he
Upanishads (to include 12 major Upanishads with
Sankara’s commentaries and translations in English
& Hindi), Brahma Sut ra, Complet e Works of
Sankara, Ramcharitmanas and the Yoga Sutra.
The experience of building websit es in Indian
languages was shared with others and a bi-lingual
site was designed and built for the Uttar Pradesh
Trade Tax Corporation , Government of India. A
site on the life and works of the contemporary sage,
Paramhans Rammangaldasji was also built. Moving
in another direction, an all-Hindi site on disease-
information and health, Bimari-Jankari was created.
The following sites were developed under the TDIL
Gitasupersite : http:/ /
Brahmasutra : http:/ /
Yogasutra : http:/ /
Complete Works of
Adi Sankara : http:/ /
Ramcharitmanas : http:/ /
Upanishads : http:/ /
Minor Gitas : http:/ /
minigita/ index.html
Kavi Sammelan : http:/ /
Munshi Premchand : http:/ /
Bimari-Jankari : http:/ /
U P Trade Tax : http:/ / tradetax
Paramhans Ram
Mangal Das Ji : http:/ /
Short write ups on some of the above sites are
given below:
7.1 Gitasupersite
On t he Git a Supersit e, one can view t he ent ire
Bhagavadgita in its original language (Sanskrit) in
any of ten Indian language scripts (assamese, bengali,
devanagari, gujarat i, kannada, malayalam, oriya,
pu n jabi , t ami l an d t elu gu ) or i n Roman
transliteration. The Supersite also contains Classical
an d Con t empor ar y Commen t ar i es on t h e
Bhagavadgit a, wit h t ranslat ions in H indi and
The Git a Supersit e has been designed t o open
Multiple Windows, so that one can view multiple
t r an slat i on s an d/ or commen t ar i es on t h e
Bhagavadgita simultaneously. A Two-Book option
for comparative study is also available. The Search
facility on this Supersite enables a search for the
occurrence of any word in the original text of the
The Gita Supersite is available for Windows, Unix/
Linux and Mac Platforms, with web browsers that
support frames, JavaScript and Java (such as Netscape
Navigat or 4.0/ Int ernet Explorer 4.0 or higher
versions). Users will not need to download fonts
because Dynamic Font s have been used on t his
The Audio of t he chant ing of t he Bhagavadgit a
shlokas by Swami Brahmananda of Chinmaya
Mission Bangalore is also available on this website.
Texts are included in the Gitasupersite are:
Mool Slokas [Sanskrit Verses] of the Bhagavadgita
in all major Indian Language Scripts
Hindi translation - Swami Ramsukhdas
Hindi translation - Swami Tejomayananda
English translation - Swami Gambhirananda
English translation - Dr. S Sankaranarayan
English translation - Swami Sivananda
Sanskrit Commentary - Sri Abhinavagupta
English translation of Sri Abhinavagupta’s Sanskrit
Commentary - Dr. S Sankaranarayan
Sanskrit Commentary - Sri Ramanuja
English t ranslat ion of Sri Ramanuja’s Sanskrit
Commentary - Swami Adidevananda
Sanskrit Commentary - Sri Sankaracharya
1 8
Hindi translation of Sri Sankaracharya’s Sanskrit
Commentary - Sri Harikrishandas Goenka
English translation of Sri Sankaracharya’s Sanskrit
Commentary - Swami Gambhirananda
Hindi Commentary - Swami Chinmayananda
Hindi Commentary - Swami Ramsukhdas
English Commentary - Swami Sivananda
Sanskrit Commentary - Sri Anandgiri
Sanskrit Commentary - Sri Jayatirtha
Sanskrit Commentary - Sri Madhvacharya
Sanskrit Commentary - Sri Vallabhacharya
English translation - Swami Adidevananda
Sanskrit Commentary - Sri Madhusudan Saraswati
Sanskrit Commentary - Sridhra Swami
Sanskrit Commentary - Sri Vedantadeshikacharya
Sanskrit Commentary - Sri Purushottamji
Sanskrit Commentary - Sri Neelkanth
Sanskrit Commentary - Sri Dhanpati
A sample page from the site is shown below
7.2 Brahma Sutra
The Brahma Sut ra of Badrayana is one of t he
Prasthana Trayi, t he t hree aut horit at ive primary
sources of Vedanta Philosophy. No study of Vedanta
is considered complete without a close examination
of the Brahma Sutra.
It is in this text that the teachings of Vedanta are set
forth in a systematic and logical order. The Brahma
Sut ra consist s of 555 aphorisms or sutras, in 4
chapters, each chapter being divided into 4 sections
each. The first chapt er (Samanvaya: harmony)
explains that all the Vedantic texts talk of Brahman,
the ultimate reality, that is the goal of life. The second
chapt er (Avirodha: non-conflict ) discusses and
refut es t he possible object ions against Vedant a
philosophy. The third chapter (Sadhana: the means)
descr i bes t h e pr ocess by wh i ch u lt i mat e
emancipation can be achieved. The fourth chapter
(Phala: the fruit) talks of the state that is achieved in
final emancipation.
Indian t radit ion ident ifies Badrayana, t he aut hor
of t he Brahma Sut ra, wit h Vyasa, t he compiler of
t he Vedas. Many comment aries have been writ t en
on t his t ext , t he most aut horit at ive being t he one
by Adi Sankara, which is considered t o be an
exemplary model of how a comment ary should
be writ t en.
7.3 Complete Works of Adi Sankara
Adi Sankara, the 9
century philosophical giant of
India was both an intellectual genius and a prolific
writer. In his brief life-span of 32 years, he composed
over 30 or i gi n al wor ks on vedanta, wr ot e
authoritative commentaries on 11 Upanishads, the
Brahma Sutra, Bhagavadgita, and other major texts,
and also creat ed inspiring devot ional hymns t o
various gods and goddesses.
1 9
This website is perhaps the first online repository of
the Complete Works of Adi Sankaracharya. The texts
can be read in the original Sanskrit in any one of
11 Indian language scripts. The texts can also be
down loaded for pr i n t i n g, maki n g t h i s vast ,
invaluable resource easily accessible to users all over
the world.
7.4 Ramcharitmanas
The Ramcharitmanas, the 16
century masterpiece
written by Goswami Tulsidas, is the story of Lord
Rama. The text is an unparalleled combination of
devotion and pure non-dualistic philosophy. This
websi t e i s an at t empt t o u se con t empor ar y
technology to facilitate and enhance the study of
this ancient scripture. Some of the features available
on this website are:
Read the book : Read Ramcharit manas wit h a
unique, user-friendly interface. Navigation through
the book can be linear, using the ‘next’ and ‘previous’
buttons. Or, you can use the Navigation Bar to go
directly to the verse — doha (or sortha), chaupai,
sloka or chhanda — of your choice.
2-Book View: Open two copies of Ramcharitmanas
simultaneously, for a comparative study of different
kaandas of the text.
Word Search: Alphabetic Search for the occurrence
of any word in Ramcharitmanas
Verse Search: Search for verses in Ramcharitmanas,
using the first few words of the verse
Power browse: This option is for Power Users of
Ramcharitmanas who wish to get an overview or
quickly browse through the dohas and chaupais of
any kaanda of the text
Download: Get pr in t er -fr ien dly chapt er s of
Tulsidas: Read about Tulsidas, t he aut hor of
Related Links: Annotated Links to related sites
7.5 Upanishads
T he Upanishad sit e consist s of all t he major
upanishads given below :
Savasya Mandukya’s karika
Kena Taittiriya
Katha Aitereya
Prasna Svetashvatra
Mundaka Brihadaranyaka
Mandukya Chandogya
For each Upanishad we have several commentaries
and translations as given below. For a detailed list of
available translations and commentaries for each
Upanishad please see the appendix.
2 0
7.6 Kavi Sammelan
This is the first Virtual Kavi Sammelan on the Web.
On this site, you can “create” a Hindi kavi sammelan,
choosing from a database of around 100 poems.
Select ions for your kavi sammelan can be made
based on the mood [rasa] or metre [vidha] of the
poems and the poet(s) whose poems you wish to
include. Video and audio recordings of the poems
are available, so you can ‘see’ and ‘hear’ the poets
recite their own poems.
Poems in five moods and five metres are available
here. The five moods are: vira rasa, hasya/vyanga,
shringar rasa, shant rasa, vividha rasa. The five metres
are: geet, ghazal, doha, chhanda mukta, muktak. 13
poets have been featured on this website. These poets
are: Gopaldas Neeraj, Govind Vyas, Dharmpal
Awast hi, Madhup Pandey, Buddhinat h Mishra,
Urmilesh, Kailash Gautam, Shiv Om Ambar, Surya
Kumar Pandey, Surendra Dube, Vineet Chauhan,
Kamal Musaddi, Suresh Awasthi.
Audio recordings of the poets talking about what
poetry means to them is also available. A detailed
biography and a photo-gallery of each poet has been
put up. O t her feat ures include book-reviews,
interviews and articles.
An introductory article on the History of Hindi Kavi
Sammelans has been specially writ t en by Dr.
Upendra for this website. Listen to Dr. Upendra
summarise the development of kavi sammelans over
the years.
7.7 Bimari-Jankari
Bimari-Jankari is a medical website created for the
benefit of Hindi-speaking people. Special efforts
have been made to simplify both the language and
the concepts that are explained on the site. The major
idea behind this website is to supplement a doctor’s
function. By reading about medical conditions and
diseases, a patient (or patient’s family and friends)
would understand their own situation, and therefore
be in a better position to cooperate with the doctors’
advice and prescription.
Navigat ion or moving wit hin t he websit e has been
si mp l i fi ed t o make i t easy t o sear ch for
informat ion on any disease. Under each disease,
informat ion has been provided in an easy t o
underst and, quest ion-answer format . To enable
t he user t o underst and t he disease process bet t er,
t he relat ed funct ional anat omy and physiology
has been briefly explained.
Images/ illust rat ions have been profusely used
throughout the site. Most of these images have been
prepared specially for t his websit e. Some of t he
images have been obtained, with permission from
7.8 Paramhans Ram Mangal Das Ji
Sri Paramhansa Ram Mangal Das ji (1893-1984)
was an ext raordinary sage of our t imes. H e was
blessed wit h a divine vision by virt ue of which he
could communicat e wit h saint s from t he past .
O ver 2000 sain t s an d gods belon gin g t o all
religions of t he world, visit ed him and gave him
t heir messages, which he t r an scr ibed. T hese
t ranscribed messages (over 3500 in number) are
in t he Avadhi dialect and run int o four volumes
called t he Divya Granths.
T h i s websi t e con t ai n s t h ese Divya Granths,
together with other works of Sri Ram Mangal Das ji.
2 1
The messages have been uniquely indexed and can
be read chronologically or alphabet ically. A t opic
index is also t o be included very soon. A phot o
gallery, as well as some audio and video recordings
of Sri Ram Mangal Das ji are in t he pipeline. A
simple Avadhi-H indi dict ionar y is also under
preparat ion.
Creat ion of t his websit e has been an exercise in
making a format using which t he works of t he
in n umerable great sages of our t imes can be
preserved as part of our right ful herit age.
7.9 Nepali Texts
The Nepali site has the following original texts:
Basai Leelabhadhur Shatri
Bhanubhaktko Ramayana Bhanubhakt Acharya
Kunjani Laxmi Prasad Devkota
Langda ko Sathi Lain SinghVadadail
Maaitghar Lain SinghVadadail
Moona Madan Laxmi Prasad Devkota
Ritu Vichar Lekhnath Paudwal
Tarun Tapsi Lekhnath Paudwal
Vinod Pushpanjali Rajeshwar Dvekota
Abstract Chintan–Pyaaz Shankar Lamichhane
8. Technical Issues
Some of the major features of these sites are:
The server side is written in PHP
The content is stored in MySQL in ISCII (as
against ISFOC)
On the fly transliteration into ISFOC for any
Indian language.
Search in all Indian languages.
O n t h e fly PD F gen er at ion in all I n dian
Chanting of the Gita Shlokas.
Continuous Play of the Gita Shlokas.
Architectural Diagram of the Gitasupersite and its
sister sites
2 2
9. Appendix - Details of Texts/Commentaries for
each of the Upanishads
Mool Mantra [Sanskrit Verses] of the Upanishad
English Translation - Swami Sivananda
English Commentary - Swami Sivananda
English Translation - Swami Gambhirananda
Hindi Translation - Gita Press Gorakgpur
Hindi Commentary - Harikrishandas Goenka
Sanskrit Commentary - Sri Shankaracharya
Hindi Translation of Sri Shankaracharyas Sanskrit
Commentary - Gita Press Gorakhpur
English Translation of Sri Shankaracharyas Sanskrit
Commentary - Swami Gambhirananda
Mool Mantra [Sanskrit Verses] of the Upanishad
English Translation - Swami Sivananda
English Commentary - Swami Sivananda
English Translation - Swami Gambhirananda
Hindi Translation - Gita Press Gorakgpur
Hindi Commentary - Harikrishandas Goenka
Sanskrit Commentary - Sri Shankaracharya
Hindi Translation of Sri Shankaracharyas Sanskrit
Commentary - Gita Press Gorakhpur
English Translation of Sri Shankaracharyas Sanskrit
Commentary - Swami Gambhirananda
Mool Mantra [Sanskrit Verses] of the Upanishad
English Translation - Swami Sivananda
English Commentary - Swami Sivananda
English Translation - Swami Gambhirananda
Hindi Translation - Gita Press Gorakgpur
Hindi Commentary - Harikrishandas Goenka
Sanskrit Commentary - Sri Shankaracharya
Hindi Translation of Sri Shankaracharyas Sanskrit
Commentary - Gita Press Gorakhpur
English Translation of Sri Shankaracharyas Sanskrit
Commentary - Swami Gambhirananda
Mool Mantra [Sanskrit Verses] of the Upanishad
English Translation - Swami Sivananda
English Commentary - Swami Sivananda
English Translation - Swami Gambhirananda
Hindi Translation - Gita Press Gorakgpur
Hindi Commentary - Harikrishandas Goenka
Sanskrit Commentary - Sri Shankaracharya
Hindi Translation of Sri Shankaracharyas Sanskrit
Commentary - Gita Press Gorakhpur
English Translation of Sri Shankaracharyas Sanskrit
Commentary - Swami Gambhirananda
Mool Mantra [Sanskrit Verses] of the Upanishad
English Translation - Swami Sivananda
English Commentary - Swami Sivananda
English Translation - Swami Gambhirananda
Hindi Translation - Gita Press Gorakgpur
Hindi Commentary - Harikrishandas Goenka
Sanskrit Commentary - Sri Shankaracharya
Hindi Translation of Sri Shankaracharyas Sanskrit
Commentary - Gita Press Gorakhpur
English Translation of Sri Shankaracharyas Sanskrit
Commentary - Swami Gambhirananda
Mool Mantra [Sanskrit Verses] of the Upanishad
English Translation - Swami Sivananda
English Commentary - Swami Sivananda
English Translation - Swami Gambhirananda
Hindi Translation - Gita Press Gorakgpur
Hindi Commentary - Harikrishandas Goenka
Sanskrit Commentary - Sri Shankaracharya
2 3
Hindi Translation of Sri Shankaracharyas Sanskrit
Commentary - Gita Press Gorakhpur
English Translation of Sri Shankaracharyas Sanskrit
Commentary - Swami Gambhirananda
Mool Mantra [Sanskrit Verses] of the Upanishad
English Translation - Swami Sivananda
English Commentary - Swami Sivananda
English Translation - Swami Gambhirananda
Hindi Translation - Gita Press Gorakgpur
Hindi Commentary - Harikrishandas Goenka
Sanskrit Commentary - Sri Shankaracharya
Hindi Translation of Sri Shankaracharyas Sanskrit
Commentary - Gita Press Gorakhpur
English Translation of Sri Shankaracharyas Sanskrit
Commentary - Swami Gambhirananda
Mool Mantra [Sanskrit Verses] of the Upanishad
English Translation - Swami Sivananda
English Commentary - Swami Sivananda
English Translation - Swami Gambhirananda
Hindi Translation - Gita Press Gorakgpur
Hindi Commentary - Harikrishandas Goenka
Sanskrit Commentary - Sri Shankaracharya
Hindi Translation of Sri Shankaracharyas Sanskrit
Commentary - Gita Press Gorakhpur
English Translation of Sri Shankaracharyas Sanskrit
Commentary - Swami Gambhirananda
Mool Mantra [Sanskrit Verses] of the Upanishad
English Translation - Swami Sivananda
English Commentary - Swami Sivananda
Hindi Commentary - Harikrishandas Goenka
Mool Mantra [Sanskrit Verses] of the Upanishad
English Translation - Swami Madhavananda
Sanskrit Commentary - Sri Shankaracharya
English Translation of Sri Shankaracharyas Sanskrit
Commentary - Swami Madhavananda
10. The Team Members
R.M.K. Sinha
T.V. Prabhakar
Harish C. Karnick
T. Archna
Murat Dhwaj Singh
Rajni Moona
Madhu Kumar
Amit Mishra
Md . Masroor
Rajeev Bhatia
Others Who Helped
Dr. Vineet Chaitanya (VC) was the driv-
ing force behind the Geeta Vaatika, as well as the
inspiration for the Gita Supersite. Now he watches
our activities from IIIT Hyderabad, and is still one
of the few who truly understand the spirit behind
this work.
Nagaraju Pappu, the first one, wrote 100,000 lines
of c-code for the initial versions of Geeta Vaatika.
His DOS version had more features than the cur-
rent HTML one!
A part from the current team, those who have con-
tributed to the growth of these websites in a major
way include:
K. Anil Kumar, Anvita Bajpai, Ashutosh Sharma,
Git a Pat hak, K. Ravi Kiran, Rohit Pat wardhan,
Samu dr a Gu pt a, Sh r i kan t Tr i vedi an d
Tripti Singh.
Courtesy: Prof Sanjay Dhande
Indian Institute of Technology
Department of Computer Science & Engineering
Kanpur - 208 016
(RCILTS for Hindi & Nepali)
Tel : 00-91-0512-2597174, 2598254
E-mail :
Editorial Comment : Because of very large number of publications
related to the above article by the Resource Centre & constraint
of space, we could not include the publication details here these
have already been listed in the April 2003 issue of VishwaBharat.
For getting the publications please contact Prof. Sanjay Dhande/
Prof. R.M.K. Sinha, IIT(K).
2 4
Department of Gujarati, Faculty of Arts
The Maharaja Sayajirao University of Baroda, Vadodara-390002, India
Tel. : 00-91-265-2792959 E-mail :
Website : http:/ / rciltg/
Resource Centre For
Indian Language Technology Solutions – Gujarati
The Maharaja Sayajirao University of Baroda, Vadodara
Achievements Achievements Achievements Achievements Achievements
Maharaja Sayajirao University, Baroda
1. Knowledge Resources
(Corpora, Parallel Corpora, Multi Lingual Digital
1.1 Gujarati WordNet (Lexical Resource)
Gujar at i WordNet , a Lexical Resour ce, is in
t h e p r ocess of d evel op men t . RC at MSU
i s wor ki n g i n close collabor at i on wi t h an d
t hr ough able guidan ce of Professor Pushpak
Bhat t acharya of RC at IIT, Mumbai. The Dat a
Ent ry Int erface for ent ering t he synset s is build.
Present ly above 500 synset s are already been
Further work on preparing the synsets and entering
int o t he dat abase is going on. Basic Semant ic
Relat i on sh i ps ar e also been est abli sh ed li ke
Hypernymy, Hyponymy between the synsets formed
and entered in the database. Further work is going
on for est ablishing ot her Semant ic relat ionships
between the synsets. The Web Interface is already
under construction to facilitate the access of an On-
Line Lexical Database of Gujarati to be available
over the Internet.
2. Knowledge Tools
(Por t als, Lan guage Pr ocessin g Tools, fon t s,
mor phs an alyzer, spell checker, t ext edit or,
basic word processor, code conversion utility, etc.)
2.1 Portal
RC for Gujarati has a web site in English hosted at
http:/ / rciltg/ . The Gujarati version
of the same will be available soon. The web site
con t ain s t he in for mat ion about t he soft war e
developments at RCILTG. The full text of the poems
and prose of well known Gujarati authors is planned
to be displayed on the web site. A manuscript of
t h
Cen t u r y classi c “Sar aswat i ch an dr a” by
Govarthanram Madhavram Tripathi, in the author’s
own handwriting is located and digitized and would
be put on the web for the interested researcher soon.
2.2 Multilingual Text Editor
Unicode based Multilingual Text Editor which was
required since long for Indian Languages, has been
developed at the RC for Gujarati.
The important feature of this Text Editor is that it
fulfils t he necessit y of Universal st orage format
(Unicode) for Indian Languages. T he Indian
languages that are given support by this Text Editor
are Bangla, Gujarati, Gurmukhi, Hindi, Kannada,
Malayalam, Marathi, Oriya, Sanskrit, Tamil and
Telugu along with English as International language.
To make typing ease, assistance is given to the user
by providing an onscreen INSCRIPT keyboard
layout for all languages supported. The Text Editor
supports the basic file features like creating new file,
opening existing file and saving file. It is efficient
enough to open file of around 64KB. Also the basic
text editing features are provided like cut, copy, paste,
undo, redo, find, find next, replace and select all.
2 6
The user can use any of the system fonts for writing
text. The styles that are supported by Text Editor
are font name, font size, bold, italic and underline.
The sorting facility is also available based of word
sort or line sort. The support for three different Look
and Feels viz., Motif, Windows and Metal of Java
are provided to the user. The Gujarati Spell Checker
is also bundled with this Text Editor.
Current Focus
• To increase the efficiency of Text Editor.
Further Plans
• E-mail support for Indian Languages.
• ISCII compatible Import/ Export facilities.
• More formatting features to be provided.
2.3 Code Converter
We have the programs which can convert the codes
between Unicode and ISCII.
2.4 Gujarati Spell Checker
The Gujarati Spell Checker is in its last stage of
development . It is based on a Morphological
Analysisof Gujarati language in order to increase the
i n t elli gen ce of t h e Gu jar at i Spell Ch ecker.
Morphological Analyzer covers t he analysis of
Nouns and Verbs of the language.
The suggest ion generat ion is very effect ive. The
number of suggestions that will be generated will
not be greater then ten. All the non-gujarati words
along with different symbols are ignored and are
n ot ch ecked. T h e Gujar at i Spell Ch ecker is
integrated along with Multi-Lingual Text Editor for
the convenience to user.
Currently the software is under testing and very soon
the accuracy in percentage of Morphological Analysis
will be available.
Current Focus
• Rigorous testing to be done, using large corpus,
t o find all t he fault s in t he spell checker and
morphological analyzer.
• Improving the algorithms for the correctness and
efficiency of the spell checker.
Further Plans
• Increasing the size of the root dictionary covering
the maximum possible words of the language.
• Making an independent spell checker which can
be used on Unicode or ISCII compatible text.
• More and more testing to increase the correctness
of t he spell checker and also improving t he
suggestion generation.
3. Translation Support Systems
(Machine Translation, Multi Lingual Information
Access, Cross Language Information Retrieval)
3.1 Machine Translation
The work on t his front has already st art ed and
init ially we have planned t o be domain specific
2 7
and we have select ed weat her relat ed t ext . A group
of t wo is working on t his area. Basic Universal
Word Lexicon format is being st udied.
Out of all t he approaches we have select ed t o go
for Int erlingua (UNL) based approach for t his
purpose. The t eam has st art ed t he exploring t he
var i ou s r esou r ces avai l ab l e on t h e U N D L
Fou n d at i on websi t e. T h e sp eci fi cat i on s for
enconvert er and deconvert er are being st udied.
The UNL proxy has been downloaded and t est ed
wit h t he sample document s.
4. Human Machine Interface Systems
(Opt ical Charact er Recognit ion Syst ems, Voice
Recognition Systems, Text to Speech Systems)
4.1 OCR for Gujarati
A working prot ot ype of Gujarat i O CR which
is capable of doing end to end task i.e. converting
an i mage i n t o an equ i valen t t ext , h as been
pr oduced. T he O CR employs t he t emplat e
mat chin g t echn ique, t he same as t he Telugu
O CR developed at RC for Telu gu u ses,
for r ecogn i t i on an d n ear est n ei gh bou r for
classificat ion. The Gujarat i OCR differs in
many ways from t he Telugu one due t o t he
special feat ures like t he necessit y for zone
T h e p r ogr am, wh i ch can b e i n voked fr om
command line, t akes a gray image, scanned in
color or grayscale, in any of t he popular image
format s like JPEG, GIF, PNG, BMP and aft er
processing it produces t he out put as Plain Text
file in Un icode for mat . T h e in put image is
assumed t o be skew-correct ed, wit hout any images
and t ables, having only one column. We have a
convert er which covert s Unicode t o equivalent
ISCII and vise-versa ready, so we can produce t he
ISCII out put as and when required.
Rigorous t est ing of t he OCR for recognit ion rat e
et c. is being init iat ed and t he init ial result s are
quit e good. Work is in progress for fine t uning
var i ou s mod u l es i n or d er t o i mp r ove t h e
recognit ion accuracy.
Current Focus
• In t he t est we have seen some misrecognit ion,
so we are current ly t est ing t he performance
of Discret e Cosin e Tr an sfor m as a feat ur e
ext ract or.
• The programs are getting reorganized to in the
Object Oriented way.
Future Plans
• Developing/ Adapting skew-correction algorithm.
• Developing the post-processing routines.
• Thorough testing of the OCR and necessary fine
tuning to take it to the required accuracy of 97%
or above.
• Integrating with the OCRized spell-checker.
• Developing GUI for the system with inbuilt Text
4.2 Text To Speech (TTS)
The work on TTS has already been started at RC-
Gujarati. A team of three students has explored the
available TTSs such as FreeTTS, IBMJS in English
and Dhvani and Naarad in Hindi. All of these are
t h e i mplemen t at i on of Java Speech API
specifications. We have made a significant progress
also in this direction.
While starting this work, we had divided our task in
the following two major subtasks :
1. Developing a program which generat es t he
out put , which can be input t o t he exist ing
speech en gin e. E. g. I BMJS t akes phon et ic
t ranscript ion of t he t ext using IPA codes, t o
produce t he sound.
2. Developing a core speech engine wit h nat ive
accent. We are adopting the concatenation based
2 8
Task Completed So Far
• A program is now available wit h RC-Gujarat i
which allows user t o t ype in Gujarat i t ext , using
I N SC RI PT keyb oar d l ayou t an d t h en i t
convert s t he st ring int o it s equivalent IPA st ring
and feeds it t o t he IBM speech engine, which
in turn produces the sound represented by given
IPA st ring.
Current Focus
• Explor i n g t h e ph on ologi cal pr oper t i es of
language, extracting the basic phonemes from a
continuous speech and design of basic speech
Future Plans
• Development of preprocessor to handle language
specific details.
• Developing speech engine and integrating with
5. Localization
(Adapting IT tools and solutions in Indian
language(s)- IT localisation clinic interaction with
state government for possible solutions in Ils).
5.1 Language Technology Human Resource
(Manpower Development in Nat ural Language
Processing - specialized training programs)
RC for Gujarat i is t rying t o develop man power
by providing t he resources t o final year st udent s
from t he relevant background like Mat hemat ics,
C omp u t er Sci en ce, El ect r i cal En gi n eer i n g,
Elect ronics Engineering, Comput er Applicat ion
et c. for doing t heir project s wit h RC. Two (one
ME ( Elect r i cal) an d on e M. Sc. ( St at i st i cs) )
st udent s have already complet ed t heir project s on
st udy and development of various t echniques for
Gujarat i OCR.
At present s Five st udent s are working for t heir
final year project s. Three of which are working
on T TS t echnologies and t wo are working on
development of UNL based machine t ranslat ion
support syst em.
Apart from t his RC, Gujarat i has t rained t hree
physically challenged persons in Gujarati data entry
using inscript keyboard and now they are doing the
data entry for RC and for which they are being paid.
6. Standardization
( Con t r i bu t i on t owar ds UN I CO D E,
Transliteration Tables(s), XML, lexware format, etc)
UNICODE for Gujarati have been studied and some
additions and corrections have been suggested. They
are as follows :
• Correction in the shape of AVAGRAH
• Addition of Gujarati Currency sign, Symbols for
¼, ½ , ¾ .
• Addition of two rendering rules for combination
of Ja with Dependent vowel sign E and EE
The collation sequence for Gujarati alphabets is not
same as t he sequence of t he charact ers in t he
UNICODE chart . H ence t he correct collat ion
sequence have been finalized.
All the above modifications and suggestions were
published in the April,2002 issue of Vishwabharat
magazine of TDIL.
In addition to above the RC for Gujarati has worked
closely with Mr. Abhijit Dutta of IBM, Delhi and
helped him in st andardizing on t he collat ion
sequence for Gujarat i and t he set of Glyphs for
developing the fonts.
Apart from development of basic t echnology RC
for Gujarat i has, as per t he original proposal
accept ed by t he Minist r y, developed a st rong
Kn owl ed ge b ase i n G u j ar at i , avai l ab l e on
mu lt i med i a C D s an d on ou r websi t e. T h i s
knowledge base includes t he following;
2 9
1. Sel ect G u j ar at i cl assi cs i n p r i n t an d
manuscript s, including a rare mss of t he 19
t h
cen t u r y classi c, Sar asvat i ch an dr a i n t h e
aut hor’s handwrit ing, wit h many changes he
has made incorporat ed in t he digit ized form
on CD.
2. Cult ural H ist ory and Geography of Gujarat
on CD, useful t o researchers in one set and
t o t ourist in a second set .
3. Mult imedia CDs on Rural Healt h Problems
and Prevent ive Care.
4. Language self-learning CDs useful t o st udent s
(learning Sanskrit , English et c.) and second
gen er at i on N RI s an d n ew Resear ch er s
7. Products which can be launched and the
service which the Resource Centre can provide to
the State Government and the Industry in the
7.1 Technical Skills
The RC for Gujarat i has built t echnical skills
r el at ed t o I mage Pr ocessi n g an d Pat t er n
Recognit ion, Audio signal processing, Speech
Synt hesis, Unicode based t ext processing and
rendering and implement at ion of t hese using
Java. These can readily be shared, wit h proper
agreement , wit h researches and indust ry in t he
region. This knowledge can even be t ransferred
t o t he new generat ion by t eaching some courses
in t hese areas at t he nearby colleges.
RC-Gujarat i has developed Mult ilingual Text
Edit or which can be used t o t ype in Unicode.
Wit h t he help of t his, we could propagat e t he
use of st an dar d en codin g st or age of dat a in
Unicode or ISCII. In t urn t his would lead t o
t he sit uat ion where t he dat a can be t ransferred
from one machine t o t he ot her wit hout
t ransferring t he font s. This could be much use
in Government offices where such dat a t ransfer
is a daily rout ine.
7.2 Products Which can be Launched
At present RC has a multilingual text editor ready
to launch. The detail of which is given above.
7.3 IT Services
RC for Gujarati has developed the following services
which it can provide to the State Government and
the Industry in the region:
• Mult imedia CDs on Language Self-learning
(“Learn Gujarat i t hrough English”, “Learn
Sanskrit t hrough Gujarat i”, “Learn English
t hrough Gujarat i” et c. ) could be of use t o
learners in Gujarat and overseas, including t he
NRIs. In opur int eract ion wit h him, t hese CDs
h ave been foun d of in t er est t o D r. M. N .
Cooper of Infosis Lt d. Pune.
• Multimedia CDs on Rural Health and Adolescent
H eat h , developed i n collabor at i on wi t h
Tribhuvandas Foundat ion, Karamsad and it s
Director, Dr. Nikhil Kharod, M.D., could be of
use to the Departments of Health and Education,
Gujarat Government and to the Panchayats.
• Services on Library Knowledge-base, giving
detailed information on books in print, 19
century journals and medieval manuscripts
available at some libraries in Gujarat and Mumbai,
on CDs and our Website, could be of much use
to Universities, Colleges and other institutions
of research and teaching in India and abroad. The
3 0
libraries with which we have entered into service
agr eemen t s h ave u n der t aken t o pr ovi de
information through email, fax and surface mail
to users who would have access to this knowledge
base through our Website.
8. IT Services - Multimedia CDs
8.1 Learn Gujarati Through English
A mult imedia t ut orial is designed t hat helps an
ent husiast , learning Gujarat i language t hough
English. The tutorial is based upon the concept of
Dr. Jagadish Dave who teaches Gujarati in Europe.
He has used this methodology to teach Gujarati to
the persons who doesn’t have any connection to
Gujarat and for that matter India.
The main features of the CD are:
• Novel appr oach for t each i n g Gu jar at i by
exploiting similarity of shapes for remembering
various characters.
• St art s wit h a familiar graphic resembling a
Gujarat i let t er and shows t he manufact ure of
related letters from it.
• For example, starting with the symbol ‘S’ which
is a character in English as well as Gujarati, the
Gujarati characters with related shapes DA, KA,
PHA, HA, THA are derived as morphological
modifications of the basic symbol.
• T he relat ed audio comment ary figurat ively
explains the modifications of the basic symbol.
• The pronunciation of Gujarati words in which
the character is embedded at various positions also
accompanies the presentation.
• After introducing a set of alphabets the learner is
lead t o meaningful word format ion involving
those alphabets.
• I n t r odu ct i on of gr ammar t h r ou gh si mple
sentences instead of complicated rules.
Future Plans
• A practical conversation involving those words
suitable for handling a real life situation is to be
provided at a regular intervals
• At the end of the presentation a list of common
words, their meanings and their pronunciation
is to be added.
• The less common grammar rules are to be added.
• A video of some pract ical sit uat ions is t o be
8.2 Adolescent Health
RCI LTG or gan ized a t wo days wor kshop on
Adolescent Health with Tribhuvandas foundation of
Anand on t he above t opic. About 40 t eachers
participated in the workshop. The workshop was
aimed at adolescent health and how teachers can
scientifically disseminate the knowledge about the
changes occurring in male and female body during
this age.
This CD is a gist of ideas and suggestions, came out
aft er a few long sessions of brain-st orming. The
information presented here is the work of Dr. Nikhil
Kharod (M.D. Pediatrician).
The features of this CD are:
• Introduction of sensitive issues from a scientific
view point.
• Accompanying pictures, sketches, diagrams and
tables giving medically proved information about
the topic.
• A detailed description of nutrients, their source
and disease caused in case of their deficiency.
• Description of the Body cycle of male and female.
• A quiz with key answers which gives, the user,
and an idea about his / her knowledge in this area.
Our future plan includes the production of a series
of learning language CDs. Work on the CD “Learn
Sanskrit through Gujarati” has already begun. Next
to follow would be a set of CDs linking Gujarati
with other Indian languages and with English. There
is a good marketing possibility for this series.
3 1
Initial work has been completed on production of a
set of CDs with multimedia presentation on Cultural
History and Cultural Geography of Gujarat. This
series would be useful for Primary and Secondary
Schools and also for tourism.
One CD on “Rural Health” has been produced,
focusing on Adolescent Health. Similar workshops
on other areas on Rural Health would be conducted
to produce more CDs on other aspects of health in
rural Gujarat.
8.3 Bibliography of Books
In close collaboration with some important research
libraries of Gujarat i and Mumbai, t he RC for
Gu jar at i h as wor ked on pr epar i n g on - li n e
bibliography of books and journals available at these
centres. Also on-line lists of critical and creative items
(essays, fict ion, poet r y, drama, cult ure-relat ed
information) published in 19
and early 20
Gujarati journals, has been digitized. The Libraries
have agr eed t o pr ovide phot o-copied of t he
knowledge-base items, at no-profit level, to the users
of the RC’s website.
8.4 Sarasvatichandra
Original manuscript of t he 19
t h
cent ury classic,
Sarasvatichandra, by Govardhanram Tripathi, which
had a great impact on cultural life of Gujarat, has
been located by our RC. The mss, in the author’s
handwriting, has now been digitized and made into
a CD with an appropriate navigation.
Note: RC for Gujarati has trained three physically
challenged former students, Kum. Rina Phadia, Shri
Ganpat Solanki and Arun Paladgaonkar, in data
entry and has used their paid services for its data
entry work for Wordnet and CD production.
9. The Team Members
Prof. Sitanshu Mehta
Prof. S. Rama Mohan
Mr. Jignesh Dholakia
Mr. Mihir Trivedi
Mr. Irshad Shaikh
Courtesy: Shri Sitanshu Y. Mehta
M.S.University of Baroda
Department of Gujarati
Faculty of Arts
Baroda – 390 002
(RCILTS for Gujarati)
Tel : 00-91-265-2792959
E-mail :
3 2
Computer Science and Engineering Department
Indian Institute of Technology, Mumbai-400076 India
Tel. : 00-91-22-25722545 Extn. 5879 E-mail :
Website : http:/ /
Resource Centre For
Indian Language Technology Solutions – Marathi & Konkani
I.I.T., Mumbai
Achievements Achievements Achievements Achievements Achievements
RCILTS-Marathi & Konkani
Indian Institute of Technology, Mumbai
The Resource Center for Technology Solutions in
Indian Languages was set up at IIT Bombay in the
year 2000. The center concentrated on developing
niche technologies for Indian languages with special
focus on t he languages of West ern India, in
particular Marathi and Konkani. The main theme
of the research has been development of (i) lexical
resources like wordnets, ontologies and semantics-rich
machine readable dictionaries (MRD), (ii) interlingua
based machine translation softwares for Hindi and
Marathi and (iii) Marathi search engine and spell
checker. Over t he years t he cent er has built it s
strength and reputation in its chosen areas and has
obtained national and international visibility. In
t h e par agr aph s t h at follow we descr i be t h e
technologies already developed and in the process
of being developed, and also the impact the center
has made in the area of information technology
1. Lexicon and Ontology
1.1 Lexicon
The lexicon is being prepared in t he cont ext of
i n t er li n gu a based mach i n e t r an slat i on . T h e
interlingua is the Universal Networking Language
(UNL) (ht t p:/ / which is a
recently proposed interlingua. The heart of the UNL
based MT is the universal word (UW) dictionary.
UWs are essentially disambiguated English words
and stand for unique concepts. Our lexicon contains
abou t 70, 000 UWs an d appr oxi mat ely 200
morphological, grammatical and semantic attributes.
These entries are linked with Hindi headwords. The
lexi con i s avai lable on t h e web ( h t t p:/ /
www.cfilt .iit dict ionary/ ml). UW-
H in di Lexicon is r equir ed for en con ver sion
(analysis) and deconversion (generation) processes.
The lexicon can also be used as a language reference
for English and Hindi. We are now concentrating
on complet i n g t h e lan gu age cover age an d
standardizing the lexicon (standard restrictions and
semantic attributes).
1.2 Ontology
An ont ology is a hierarchical organizat ion of
concept s. Domain specific ont ologies are crit ical
for NLP syst ems. We have classified Nouns as
animate and inanimate. Furt her sub-classificat ion
divides animate nouns as flora and fauna t o cover
plant s and animals. The inanimate cat egory is sub-
classified as object, place, event, abstract entity etc.
Concept nouns indicat ing time, emotion, state,
action et c. have also been incorporat ed. Verbs have
been divided int o do, be and occur cat egories, wit h
furt her sub-classificat ions verbs of action, verbs of
state, temporal verbs, verbs of vol ition, etc.
Adjectives have been classified as descriptive
indicat ing weight, colour, quality, quantity, etc. and
also as demonstrative, interrogative, relational and
so on. Adverbs are cat egorized according t o time,
place, manner, quantity, reason, etc.
Statistics of Ontological Categories
We have 22 broad cat egories for Noun, Verb,
Adjective and Adverb. Under these categories, we
have 57 sub-categories for all word classes. The noun
class has 55 sub-sub and 29 sub-sub-sub categories
(vide the table below):
1. No. of Categories of Noun, Verb, Adjective 22
and Adverb
2. No. of Sub-categories of Noun, Verb, 58
Adjective and Verb
3. No. of Sub-sub Categories of Noun 55
4. No. of Sub-sub-sub Categories of Noun 29
Total No. of Categories 164
2. The Hindi WordNet
The Hindi wordnet is an on-line lexical dat abase.
The design closely follows t he English wordnet .
The synonym set {·(· ·(· ·(· ·(· ·(· , ·( r ·( r ·( r ·( r ·( r} and {·(· ·(· ·(· ·(· ·(·, +i··((· +i··((· +i··((· +i··((· +i··((·}- for
examp l e- can ser ve as an u n amb i gu ou s
different iat or of t he t wo meanings of {·(· ·(· ·(· ·(· ·(·}. Such
synsets are t he basic ent it ies in t he wordnet . Each
3 4
sense of a word is mapped t o a separat e synset in
t he wordnet . All word sense nodes are linked by
a variet y of semant ic relat ionships, such as is-a
(hypernymy/ hyponymy) and part-of (meronymy/
T h e lexi cal dat a en t er ed by li n gu i st s an d
lexicographers are stored in a MySQL database. The
web interface ( gives the facility
to query the data entered and browse the semantic
relat ions for t he search word. The int erface also
provides links to extract semantic relations that exist
in t h e dat abase for t h e sear ch wor d. So far
approximately 10000 synsets have been entered. This
corresponds to about 30,000 words in Hindi. There
is a morphological processing module as a front end
to the system.
Figure 1: Snapshot of Data entry interface for entering
synset data
Figure 2: Snapshot of Data Entry interface for entering
semantic relations for noun category
Figure 3: Online web interface of the Hindi WordNet
Figure 4: Snapshot of web interface for the query result
for the word “(--(:(-(”
3. The Marathi WordNet
Since Marat hi and H indi share many common
feat ures, it has been possible t o adopt t he H indi
wordnet as t he basis for Marat hi wordnet creat ion.
H indi and Marat hi originat e from Sanskrit and,
t herefore, t he -··(-( (words t aken direct ly from
mot her language, e.g. ·(i-, = i-) and t he -: ·(·( (a
word which has evolved organically from t he
mot her language, e.g. ·( r ! ·(·, =-( ! =(-() words
in bot h t he languages are oft en same in meaning.
Also, bot h t he languages have t he same script -
In t he const ruct ion of t he Marat hi wordnet , a
H i n d i syn set i s ch osen an d map p ed t o t h e
3 5
corresponding Marat hi synset . For inst ance, for
t he H indi synset {=(·(: =(·(: =(·(: =(·(: =(·(:, =(·(( =(·(( =(·(( =(·(( =(·((}- st anding for paper-
t he Marat hi synset is {=(·(: =(·(: =(·(: =(·(: =(·(:}. A gloss is added,
which explains t he meaning of t he word, and an
explanat ory sent ence is insert ed. For t he current
example, t he gloss is ·(( ·( ·( ·(·(- ·(( ··(( -(·(:·((+(·( -( -·((·
= -( ((ª((· +(-= +(-( ·((·(· i-(r(-( ·(( =(+-( ((- and”··((-(
=( -·(( =(·(:(·(· -((:(( ·(r( ·( --((”. So far, 3400 synset s
have been ent ered. This corresponds t o about
10,000 words in Marat hi. Our aim is t o cover all
t he common Marat hi words. As t he Marat hi
wordnet is based on t he H indi wordnet , all t he
seman t i c r el at i on s b ecome au t omat i cal l y
inherit ed. An addit ional benefit t o accrue is t he
creat ion of a parallel corpus.
4. Automatic Generation of Concept Dictionary
and Word Sense Disambiguation
A Concept (UW) Dict ionary is a reposit ory of
language independent represent at ion of concept s
using special disambiguat ion const ruct s. A syst em
has been developed for automatically generat ing
document specific concept dict ionaries in t he
cont ext of language analysis and generat ion using
t he Universal Net working Language (UNL). The
dict ionary ent ries consist of mappings bet ween
head words of a nat ural language and t he universal
wor d s of t h e U N L. T h e man u al effor t i n
con st r uct in g such dict ion ar ies is en or mous,
because t he lexicographer has t o ident ify t he
disambiguat ion con st r uct s an d t he seman t ic
at t r ibut es of t he en t r ies. O ur syst em, which
con st r u ct s t h e U W d i ct i on ar y ( semi - )
aut omat ically, makes use t he English wordnet and
processes t he part of speech t agged and part ially
sense t agged document s. The sense t agging st ep is
a cr u ci al on e an d u ses t h e ver b associ at i on
informat ion for disambiguat ing t he nouns. The
accuracy of t he word sense disambiguat ion syst em
is 70-75% and t hat of t he DDG (document
specific dict ionary generat or) is 90-95%.
The DDG makes use of a rule base for producing
the dictionary. It is based on the principle of expert
syst ems, d oi n g i n fer en ci n g an d p r ovi d i n g
explanat ions for t he choices made wit h respect t o
t he restrictions and semantic attributes.
Figure 5: stages in automatic dictionary generation
5. Hindi Analysis and Generation
5.1 Hindi Analysis (Enconversion)
The analysis engine uses the UW-Hindi dictionary
and the analysis rule base. The dictionary contains
headword, universal word and its grammatical and
semantic attributes. The process of analysis depends
on these semantic and grammatical attributes. There
are 41 relations defined in the UNL specification.
The system is capable of handling all of them. The
an alysis syst em can deal wit h almost all t he
morphological phenomena in Hindi. At present,
there are 5500 rules dealing with morphosyntactic
and semantic phenomena. The system has been
tested on corpora from the UN, the MICT and the
agricultural domain.
5.2 Hindi Generation (Deconverion)
The generat ion process also uses t he UW-Hindi
dictionary and the generation rule base. The dictionary
is the same for both analysis and generation purpose,
only the rule base is different. The generation rule
is formed from t he grammat ical and semant ic
attributes as well as the syntactic relations. Some of
the features of the system are:
• Matrix-based priority of Relation has been designed
to get the correct syntax.
• Extensive work has been done on Morphology.
• The system has gone through the UNL of corpora
from various domains with satisfactory results.
• There are almost 6000 rules for syntax planning
and morphology of all parts of speech.
6. Marathi Analysis
Marathi is a natural language of Indo-Aryan family.
3 6
The textual communication uses the Devanagari
(· |-||·|·|) script . The analysis and generat ion of
Marathi makes use of the Marathi-UW lexicon and
the rules for the Marathi language grammar. There
are about 2100 rules t o handle Marat hi verb
phenomena which are listed below:
Sr. No. Phenomenon Description
1 Present tense Time w.r.t. writer
2 Past tense Time w.r.t. writer
3 Future tense Time w.r.t. writer
4 Event in progress Writer’s view of aspect
5 Complement Writer’s view of
6 Passive voice Writer’s view of topic
7 Imperative mood Writer’s attitude
8 Ability of doing Writer’s viewpoint
9 Intention Intention about
something or to do
10 Should To do something as a
matter of course
11 Unreality Unreality that
something is true or
7. Speech Synthesis for Marathi Language
The aim of this project is to design speech synthesis
syst em t o speak out Mar at hi t ext wr it t en in
Devanagari script . Concatenative speech synthesis
model is used t o achieve t his goal. Ours is an
unlimited vocabulary system. It employs basic units,
i.e., vowels an d con son an t s as t h e basis for
constructing words and sentences. The database of
basic speech units, required in this approach, is few
hundred kilobytes as against other approaches where
the size is almost tens of megabytes. We are able to
get intelligible speech with this approach. A graphical
user interface for the system has been developed. It
provides features for drawing waveforms and for
amplifying the output speech signal. It also provides,
help for displaying Marathi text through English
Currently we are working in rendering newspapers
as speech is taking shape. On-line reading of Marathi
newspaper Maharashtra Times is being carried out
as an illustration of the technology.
8. Project Tukaram
Tukaram Project provides the entire collection of
Saint Tukaram’s Abhangs in browsable and searchable
format. Tukaram’s abhangs are a household name in
Maharashtra. It is possible to convert the existing
version t o a st and-alone syst em, which can be
inst alled on a single machine. The abhangs are
avai lable for pu bli c vi ewi n g on h t t p:/ /
www.cfilt .iit (also see t he pict ure below).
Browsability is provided at chapter and verse levels.
Users can move from one chapter or verse to another
by using links at the bottom of the webpage.
The search engine for Tukaram Project is made
compat ible wit h bot h Lin ux an d Win dows.
Tukaram’s Abhangas were keyed in Akruti font and
were later converted to ISFOC font. ISCII encoding
is used by the Tukaram search engine.
To enable the users to use the Tukaram search engine
from bot h Windows and LINUX, we chose t he
XDVNG font for t he input t echnology while
keeping ISFOC for display purposes only. Clients
can key in their queries through phonetic English
designed by Sibal (1996). Clients’ input queries
typed in phonetic English are displayed at clients’
t erminal in Marat hi using t he XDVNG font .
Phonetic English to XDVNG mapping is available
as JavaScript from
jtrans. Since JavaScript is used for mapping, it is easy
to integrate it in a page.
A query in XDVNG is sent to the server at IIT
Bombay and is converted into ISCII at the server.
For this conversion, an XDVNG to ISCII converter
pr ovi ded by I I I T H yder abad i s u sed aft er
modifications. The features of Project Tukaram are
summarized in table below:
Font and Package used for Akruti in MS
Original Script Word
Number of converted HTML 165
Number of converted HTML 4614
Font used in HTML display pages ISFOC
3 7
Total number of words excluding 2,12,442
HTML tags
Number of distinct words used for 34,773
the Indexing
Keyboard mapping at client Phonetic English
Query Display Technology at XDVNG
Result Display Technology at ISFOC
Encoding used for Indexing ISCII
Database Server MySql
Languages used for Converters Lex and C
Language used in Search Engine JAVA
Language used in Client Side JavaScript
9. Automatic Language Identification of Documents
using Devanagari Script
The problem of language identification has been
addressed in the past. There are existing methods
like unique letter combinations, common words, and
N-grams technique. A subtle issue here is the length
of the text available for classification. Many methods
require longer texts for language identification. Also,
some met hods require some a priori linguist ic
knowledge. Another issue arises when the test set
for the classification program contains some semantic
errors. The classification technique should be robust
in such a way that these errors do not affect the
accuracy of the classifier.
The N-grams method has been developed based on
a rank order of short letter sequences. A rank ordered
list of the most common short character sequences
is derived from the training document set. A rank
ordered list is prepared for every language under
considerat ion. Each such list is called a profile.
Language identification is done by measuring the
closeness of t he t est document t o each of t hese
profiles. The profile, which is closest to the test
document, gives the language of the test document.
Figure 1 shows the data flow diagram of the classifier.
We have ext ended t his approach for Devanagari
script, as the conventional approach was not giving
desired accuracy.
Letter Approach: In this approach, each letter in
Devanagari is treated as a character in an N-gram.
That means an N-gram may contain more than “N”
characters. On the other hand, an N-gram will be
exactly “N” (possibly half ) letters in Devanagari
Conjunct Approach: This is a variant of the previous
approach. We draw motivation for this approach by
descr ibin g Conjuncts. I n dian scr ipt s con t ain
numerous such conjunct s, which essent ially are
clust ers of up t o four consonant s wit hout t he
int ervening implicit vowels. The shape of t hese
conjunct s can differ from t hose of const it ut ing
consonants. For example, .·| has a conjunct ·|. It
compr ises of t wo con son an t s, | an d ·. T he
consonant | is considered a half let t er. In t he
Conjunct approach, a conjunct is considered as a
single letter.
Results: From experiment al result s, it could be
obser ved t h at alt h ou gh t h e common wor ds
technique is computationally cheaper, the N-gram
techniques are more accurate. Also, it was clear that
our extensions to N-grams work very well than the
convent ional N-grams approach. The N-gram
approach was very much suitable for another reason.
The documents used in experiments had textual
errors. As N-grams method is quite robust to textual
errors, the results were largely satisfactory. In our
approach, although the developing the method itself
required some linguistic knowledge, the method is
completely automatic and does not require any fine-
3 8
t uning as required in case of common words
10. Object Oriented Parallel and Distributed Web
The vastness and dynamic nature of the WWW has
led to the need for efficient information retrieval. A
crawler’s task is to fetch pages from the web. The
crawler starts with an initial set of pages, retrieves
them, extracts URLs in them and adds them to a
queue. It then retrieves URLs from the queue in
some specific order and repeats the process. We have
designed and implemented an object oriented web
crawler. Our crawler has 4 component s: Graph
component , UrlManager, Buffer and Document
Processor. The crawler has a 4-tier architecture with
the storage file system at the lowest layer and crawl-
buffers at level 4. The architecture is shown in figure
below. The Graph component is at level 2 and it
interacts with the file system. The UrlManager and
Document Processor are at level 3. This layer forms
the business logic layer. The crawler architecture is
par allel an d di st r i bu t ed. I t u ses a CO RBA
environment and runs on a Linux Cluster.
User Level Storage System
A search engine requires large amount of storage
space for storing the crawled pages. The existing
dat abases and file syst ems are not specialized t o
handle storage of structures such as web graphs. We
are building storage systems for efficient storage and
retrieval to this requirement. In the storage system,
webpages correspond to the nodes of the graph and
t he hyperlinks correspond t o t he edges. Graph
t raversal met hods can be applied for crawling
In the current version of the storage system, Berkeley
Database (BDB) over Linux has been used as the
backend for storing webpages. An advantage of
BDB is that the library runs in the same address
space and no int er-process communicat ion is
required for t he dat abase operat ion. Secondly,
because BDB uses a simple function-call interface
for all operations, there is no query language to
T he st orage syst em provides a set of API for
applicat ion programs such as crawlers. The APIs
help keeping users’ applicat ions independent of
i mplemen t at i on of t h e st or age syst em. T h e
st orage syst ems recognizes primit ives such as nodes
which represent web pages and graphs which
represent s a specific WWW graph. A st orage API
encapsulat es read and write funct ions. It provides
funct ions such as creat e graph, st ore and read a
webpage, parse t he page and get out links.
11. Designing Devanagari Fonts
The following three types of ‘Arambh’ family fonts
are created:
• 8-bit true type font (.ttf )
• 16-bit Unicode bitmap ascii font (.bdf )
• 16-bit Unicode true type font (.ttf ).
T h ou gh t h er e ar e man y D evan agar i fon t s
available, n o sin gle fon t con t ain s all glyphs
r equ i r ed for d i sp layi n g d ocu men t s su ch as
Damale’s grammer and Saint Tukaram’s abhangas.
H ence a new font is being creat ed t o handle t he
addit ional requirement s. The font called 8-bit
Arambh, is a t r ue t ype font , which covers all
required glyphs. In t his font we get t he scalabilit y
feat ure of t rue t ype font s. Two t ypes of 16-bit
Arambh Unicode font s developed are ArambhU
(bdf ) font and ArambhU (ttf ) font . Each glyph in
BDF is represent ed as a mat rix of dot s. These font s
are rendered fast er t han TrueType font s. Each
glyph in t he Tr ueType font is represent ed as
out line cur ve t hus facilit at ing scalabilit y.
3 9
12. Low Level Auto Corrector
We have built a low level synt act ic checker and
autocorrector based on a classification of low-level
syntactic errors. The tool uses a set of rules to detect
and correct the errors. It was used in project Tukaram,
and over 300 errors were detected and automatically
corrected. The kinds of errors handled are visible
and invisible errors due to multiplicity, ordering,
mutual exclusion and association properties.
Types of Low Level Syntactic Errors
Below, a classification of low-level syntactic errors
in a document written in an Indian Language is
provided wit h a few examples. Reasons of t hese
errors may be at t ribut ed t o violat ions of t he
con st r ai n t s of mu lt i pli ci t y, or der i n g an d
composition of ligatures, especially of the matra
Typically, ligatures such as matra, halant and anuswar
are used only once in a single letter composition.
While typing, it may be possible that they get typed
twice and the result is still not visible in the text in
an Indian Language font . Such errors can be
classified as invisible errors, while some remain
visible in the display.
Invisible Errors
For example, in the case of an extra anuswar in letter
~ the source in ITRANS is given by aM, whereas,
aMM also produces ~ . The Devanagari characters
seen in the previous sentence were actually produced
by two different encodings. While the result of this
error is not visible in the display, this could be a
cause of concern for an application such as a pattern
matching algorithm, an index or a converter.
Invisible errors may also occur in the case of multiple
ukar and matra ( and ) characters. For example, in
·| an ext ra ukar is present but it is invisible, or
Visible errors
While some ligatures overlap in display and the errors
remain undetected by mere human inspection of
the display, some errors may be detectable by display
inspect ion. This is possible due t o posit ioning
adjustments done on ligature placements.
For example, an extra kana in Marathi, such as in
letter ·||, is visibly detectable if it occurs more than
once since the additional ligature is right shifted and
either displayed as a kana, or taken as a separate
character. An example of the later case is encoding
paaa in ITRANS, which is displayed as ·||~
In some cases, more than one vowel can occur in a
single letter. For example, an anuswar and a kana in
Mar at h i as i n let t er · | | may be t yped i n
Akruti_Priya_Expanded in two ways, as hebe (where
h stands for · , e for | and b for anuswar), or as heeb.
While t he first choice result s in an unaest het ic
placement of anuswar on top of the first vertical
line, the second choice gives the desired result. With
a smaller fon t size, t h e differ en ces may get
overlooked, but with an enlarged font size, they are
easily detected.
Mutual Exclusion
In Indian Languages, the composition rules are also
specified for mut ually exclusive vowels. For
examples, an ukar and a kana cannot occur in one
character. Similarly, a velanti and an ukar or a kana
cannot occur in one letter. Whereas, it is possible
for a kana and a matra to occur together in one
Certain consonants do not accept rafar (i.e., ) or
specific conjuncts. For example, a combination of
« and is invalid. Similarly, it is not possible t o
combine certain consonants such as -| with other
consonants, or r with ·|.
13. Font Converters
Following is t he list font convert ers have been
developed by IIT Bombay.
• Akruti_Priya_Expanded to ISCII
4 0
• DV-TTYogesh to ISCII
• ISCII to DV-TTYogesh
14. Marathi Spell Checker
Un der t h i s on goi n g pr oject , a st an d- alon e
spellchecker is bein g built for Mar at hi. T he
spellchecker will be available t o spell check
documents in a given Encoding.
From t h e CI I L (Cen t r al I n st it ut e of I n dian
Language) corpus, 12,886 distinct words have been
listed. Similarly other Marathi texts at the center
are being used t o build a basic dict ionar y. A
morphological analysis is being carried out on the
collect ion of words. For example, an aut omat ic
grouping algorithm identified 3,975 groups out of
12,886 distinct words. First word is usually the root
word. Thus, t here are approximat ely 4000 root
words from Marathi corpus. A manual proof reading
will be done on these results. The morphology will
be enriched.
A motivation behind the stand-alone spellchecker is
that it can be used without an editor through a
packaged interface, or it can be integrated with other
compatible applications such as OCR.
Content creation tools
The Marathi content placed on the CFILT website,
viz. Saint Tukaram’s Abhangas, Soundarya Meemansa
and Damle’s Marathi grammar have been typed using
Akrut i Soft ware along wit h MS Word. Akrut i
provides three keyboard layouts: Typewriter, English
Phonetic and Inscript. MS Word has the facility to
convert the Word files into web pages. We use this
convert option. The web pages created by MS Word
have a lot of Internet Explorer specific style sheet
tags and attributes. Due to this the appearance of
the Devanagari text on the web page is different in
Int ernet Explorer on Windows and Net scape
Navigator on Linux. Hence we remove those style
sheet t ags from t he web pages by using a Java
program. The files are then uploaded to the web
For Marathi search engine Marathi webpages were
crawled from the web. The webpages crawled were
in DV-TTYogesh, DV-TTSurekh font. These font
codings of the webpage text was converted into ISCII
coding. An inverted index is used to store the words
and their attributes like whether the word appears
in bold, italic, whether it appears in the title of the
webpage and other attributes. MySql is used for
storing the inverted index. The user interface uses J-
Trans, which is a JavaScript code. The user types the
query in phonetic English. When the user clicks the
“search” button, the query words are looked up in
the inverted index. The documents which contain
all t he query words are not ed. Relevancy of a
document is calculated according to the attributes
of the query words in that document. For example:
Bold or italic words are considered more important
than non-bold or non-italic words. The documents
are t hen sort ed in descen din g order of t heir
relevancy. The results are then displayed to the user.
Thus t he user get s t he most relevant Marat hi
15. IT Localization
IT awareness
There was a meet in October, 2001 with the media
t o apprise t hem of t he t echnology development
efforts going on in India and at the Resource Center
in IIT Bombay on Indian Languages. Leading
newspapers like Times of India, Maharastra Times,
Loksatta, Lokamat, Sakaal sent their representatives
to this meet. The systems developed in the IITB
cent er were demonst rat ed and present ed. The
newspapers carried reports on this.
Technology solutions
IIT Bombay is providing its expertise for developing
wordnets for Indian languages (please details below).
We recently taught the fundamentals of wordnet
building in the Indo-wordnet workshop at CIIL
Mysore. Licensing agreement for transferring the
lexical data, the user interface and the API for the
Hindi wordnet is being worked out with advice from
the Ministry.
Interaction with State Government
T he in t er act ion is main ly wit h t he Mar at hi
Rajyabhasha Parishad which is the authorized body
of t h e Mah ar ast r a st at e gover n men t for
development s relat ed t o t he Marat hi language.
4 1
Similarly interaction has been going on with the
Government of Goa for the Konkani language. We
conducted a very successful Language technology
con fer en ce i n Goa called t h e International
Conference on Knowledge and Language (ICUKL) in
Goa with generous support from the MICT-TDIL,
in which the state of Goa participated very actively
and supported the event wholeheartedly.
16. Publications
Dipali B. Choudhary, Sagar A. Tamhane, Rushikesh
K. Joshi, A Survey Of Fonts And Encodings For Indian
Language Scripts, Int ernat ional Conference On
Multimedia And Design (ICMD 2002), Mumbai,
September 2002.
Sh ach i D ave, Ji gn ash u Par i kh an d Pu sh pak
Bhat t acharyya, Interlingua Based English Hindi
Machine Translation and Language Divergence, to
appear in Journal of Machine Translation, vol 17,
P. Bhattacharyya Knowledge Extraction from Texts in
Multilingual Contexts, International Conference on
Knowledge Engineering (IKE02), Las Vegas, USA,
June 2002.
Hrishikesh Bokil and P. Bhattacharyya Language
Independent Natural Language Generation from
Universal Networking Language, Secon d
International Symposium on Translation Support
Systems, IIT Kanpur, India, March, 2002.
Dipak Narayan, Debasri Chakrabarty, Prabhakar
Pande and P. Bhat t achar yya An Experience in
Building the Indo WordNet- a WordNet for Hindi, in
First International Conference on Global WordNet,
to be held in Mysore, India, January, 2002.
Shachi Dave and P. Bhat t achar yya Knowledge
Extraction from Hindi Texts, Journal of Institution
of Electronic and Telecommunication Engineers,
vol. 18, no. 4, July, 2001.
P. Bhattacharyya Multilingual Information Processing
Using Universal Networking Language, in Indo UK
Workshop on Language Engineering for Sout h
Asian Languages (LESAL, Mumbai, India, April,
17. The Team Members
Arti Sharma
Ashish F. Almeida
Deepak Jagtap
Gajanan Krishna Ranegk
Lata G Popale
Jaya Saraswati
Laxmi Kashyap
Madhura Bapat
Manish Sinha
Prabhakar Pandey
Roopali Nikam
Shraddha Kalele
Shushant Devlekar
Sunil Kumar Dubey
Satish A Dethe
Vasant Zende
Veena Dixit
M.Tech students
Dipali B. Choudhary
Nitin Verma
Sagar A. Tamhane
Ph.D. student
Debasri Chakrabarti
Courtesy: Prof. Pushpak Bhattacharya
Indian Institute of Technology
Department of Computer Science & Engineering
Mumbai- 400 076
(RCILTS for Marathi & Konkani)
Tel: 00-91-22-25767718, 25722545
Extn. 5479, 25721955
E-mail :
4 2
School of Mathematics and Computer Applications (SMCA)
Thapar Institute of Engineering & Technology
(Deemed University) Patiala-147001
Tel. : 00-91-175-2393382 E-mail :
Website : http:/ /
Resource Centre For
Indian Language Technology Solutions – Punjabi
Thapar Institute of Engineering & Technology, Patiala
Achievements Achievements Achievements Achievements Achievements
Thapar Institute of Engineering &
Technology, Patiala
T h e Resou r ce Cen t r e for I n d i an Lan gu ages
Technology Solut ion-Punjabi was est ablished in
April 2000 at Thapar Inst it ut e of Engineering &
Techn ology wor kin g in T D I L Progr amme of
D epar t men t of In for mat ion Techn ology. T he
Resource Cent re is support ed by t he Minist ry of
Communicat ions and Informat ion Technology
(MCIT) wit h t he aim of working for t echnology
development of Punjabi and also t o provide access
t o millions of people across t he globe. The main
t ask befor e t he r esour ce cen t r e has been t he
promot ion of Punjabi language so as t o ext end
t he benefit of knowledge and awareness across t he
globe. It has since evolved int o a mult i-faced
r esear ch cen t er. Before t he Pun jabi Resource
Cent re was set up, very lit t le work had been done
for comput erizat ion of Punjabi, even t hough large
number of Punjabis set t led in USA, UK and
Canada have been using comput ers for long. The
only work done was development of Punjabi font s
and some Punjabi websit es. There was no Punjabi
spell checker, Punjabi sort ing ut ilit y, elect ronic
dict ionaries or Gurmukhi OCR. In t he span of
t hree years, we have developed for Punjabi lexical
resources, cont ent creat ion t ools, Gurmukhi OCR
and uploaded Punjabi t ext on web. The det ails of
t he work done during t he last t hree years is as
1. Products Developed
1.1 Spell Checker : A spell checker is a basic
necessit y for composing t ext in any language.
Till now, no spell checker was available for
Punjabi as a result people had t o wast e lot of
t ime proof reading t he Punjabi t ext . This
problem was more severe for publishers and
writ ers residing abroad who used comput ers
t o t ype Punjabi mat erial, as even Punjabi
dict ionaries are not easily available t o t hem.
To make t he mat t ers worse, t here has been
no st andardizat ion of Punjabi spellings. In
fact t here are numerous words which are
spelled in more t han one way, some are
being writ t en in 3-4 different ways. Aft er
discussions wit h Punjabi scholars it was
decided t o ret ain mult iple spellings for t he
commonly used words and a spell checker
for Pu n jabi h as been developed t aki n g
Harkirat Singh’s Shabad Jor Kosh as t he base
and addit ional words were t aken from t he
dict ionaries published by Punjabi Universit y
and t he St at e Language Depart ment and t he
corpus developed by CIIL, Mysore.
For generat ing t he suggest ion list , a st udy
was con d u ct ed t o d i scover t h e most
common errors made by Punjabi t ypist s. A
l i st of si mi l ar sou n d i n g wor d s an d
con son an t s was al so comp l i ed an d a
suggest ion list using t his knowledge was
generat ed based on reverse edit dist ance. It
was fou n d t h at t h e r i gh t su ggest i on i s
present ed as a default suggest ion in majorit y
of t he cases. In such a case, t he user needs
only t o confirm t he default suggest ion and
proceed wit h t he next error. Ot herwise, t he
u ser n eed s t o scr ol l t h r ou gh a l i st of
Fig.1 : Spell Cheker
4 4
suggest ions and pick one as t he right one.
The spell checker support s t ext t yped in any
of t he popular Punjabi font s as well as ISCII
encoded files. The Punjabi spell checker is
now complet e and MOU is soon going t o
be signed wit h M/ S Modular Syst ems, for
Transfer of Technology of t he spell checker.
Out put of t he Spell Checker is shown in
Fig. 1.
1.2 Font Converter : The major development
in comput erizat ion of Punjabi before t he
set t ing up of t he Punjabi Resource Cent re
was creat ion of Punjabi font s. As a lot of
Punjabis reside in USA, Canada and UK, so
a large variet y of font s were developed. In
t he absence of any st andardizat ion, each font
had it s own keyboard layout . The net result
is t hat now t here is a chaos as far as t he
Pu n jabi lan gu age i n elect r on i c for m i s
concerned. Neit her can one exchange t he
n ot es i n Pu n jabi as con ven i en t ly as i n
En glish lan guage, n or can on e per for m
search on t ext s in Punjabi. This is so because
t he t ext s are being st ored in font dependent
glyph codes and t he glyph coding schemes
for t hese fon t s is t ypically differ en t for
different font s. To alleviat e t his problem, a
font conversion ut ilit y has been developed
at t he Resource Cent re. The ut ilit y support s
more t han sixt y different Punjabi font s and
t hirt y t wo keyboard layout s. A user may t ype
t ext in any font and using t his ut ilit y he can
lat er on convert it t o any ot her font . This
ut ilit y will be a great help t o publishers,
wr i t er s an d p eop le exch an gi n g t ext i n
different font s.
1.3 Sorting Utility : Sort ing and Indexing is one
of t h e basic n ecessit ies of t h e dat abase
management syst em such as maint enance of
st udent’s records or arranging a dict ionary
in alphabet ic order. But unfort unat ely t here
does not exist any soft ware for aut omat ic
sort ing of Gurmukhi words and all such
work has t o done manually. The collat ing
sequence provided by UNICODE or ISCII
is not adequat e, as it is not compat ible wit h
t he t radit ional sort ing for Gurmukhi words.
Gurmukhi, like ot her Indian languages has
a u n i q u e sor t i n g mech an i sm. Un l i ke
En gl i sh , con son an t s an d vowel s h ave
different priorit ies in sort ing. Words are
sort ed by t aking t he consonant’s order as t he
first considerat ion and t hen t he associat ed
vowel’s order as t he second considerat ion.
In addit ion t here is anot her complicat ion.
T h e p r op er l y sor t i n g “ch ar act er s” i n
Gurmukhi oft en requires t reat ing mult iple
(t wo or t hree) code point s as a single sort ing
el emen t . T h u s we can n ot d ep en d on
char act er en codin g or der t o get cor r ect
sort ing inst ead we have t o develop using
sor t i n g r u l es of G u r mu kh i , col l at i on
funct ion which convert s t he word int o some
i n t er med i at e for m for sor t i n g. Aft er
discussions wit h linguist s and st udying in
detail the alphabetic order in Punjabi we have
developed a general sort ing algorit hm which
wor ks on t ext en coded i n an y popu lar
Gurmukhi font or ISCII.
1.4 Bilingual Punjabi/English Word Processor:
A Bilingual Punjabi/ English Word processor
has been developed keeping in view t he
difficult ies being faced by t he users while
working on t he current ly available word
processors. It was found t hat as far as Punjabi
i s con cer n ed n o wor d pr ocessor wh i ch
addresses t he word processing requirement s
t ypical for Punjabi language, such as Punjabi
sp el l ch ecker, Pu n j ab i t h esau r u s an d
dict ionary, virt ual Punjabi keyboards and
Punjabi sort ing and font conversion ut ilit ies,
has been developed. In t his direct ion, effort s
were made a Punjabi word processor Likhari
was d evelop ed . Likhari su p p or t s wor d
processing under t he windows environment
and allows t yping and processing in Punjabi
Language t hrough t he common t ypewrit er
keyb oar d l ayou t . I t h as MS- Wor d
compat i ble feat u r es an d comman d s. I t
provides a number of feat ures t hat make t he
use of Punjabi Language on a comput er easy
and provides a number of t ools t o increase
t he efficiency of t he user. These t ools include
4 5
Bilingual Spell Checker wit h suggest ion list ,
O n scr een keyb oar d l ayou t s wi t h
composit ion reference for Punjabi Language
t yping, bilingual search and replace, sort ing
as per t he lan guage, alphabet ical order,
t echnical glossaries and onscreen Bilingual
dict ionaries. The main feat ures of Likhari
are :
• Very simple user int erface
• Online act ive Keyboard for users who
do not know how t o t ype in Punjabi.
• Choice of Phonet ic, Remingt on and
Alphabet ic Keyboar d layout s wit h
composit ion reference.
• Bilingual Spell Checker for Punjabi
and English.
• Bilingual Search and Replace.
• Support for sort ing t he t ext in English
or Pu n j ab i as p er t h e l an gu age
alphabet ical order.
• Support for more t han 60 commonly
used Punjabi font s.
• Su p p or t for feat u r es l i ke Tab l es,
N u mb er i n g, Bu l l et s, C h ar act er
& Par agr aph for mat t in g, Page set
up, Print Preview, Header and Foot er
et c.
Fig 2 : A Screen Shot of Punjabi WordProcessor Likhari
• O n l i n e Tech n i cal En gl i sh - Pu n j ab i
• Support for. ISCII, .T XT, DOC, .RTF
and .H TML file format s.
• Ext ensive help at various levels t o make
it easy for t he user t o learn.
1.5 Gurmukhi OCR : Opti cal Character
Recognition (OCR) is t he process whereby
t yped or print ed pages can be scanned int o
comp u t er syst ems, an d t h ei r con t en t s
r ecogn ized an d con ver t ed in t o machin e-
readable code. The t ext , which t he machine
can r ead, has gr eat advan t ages over t hat
merely displayed as an image, since it can be
edit ed, expor t ed t o ot h er pr ogr ams an d
indexed for ret rieval on any word. If one
needs t o elect ronically st ore and manipulat e
large amount s of t ext or print ed mat t er such
as newspapers, cont ract s, let t ers, faxes, price
list s or corpus development , one will find
OCR programs can save lot s of effort . For
t he first t ime a complet e OCR package for
Gurmukhi script has been developed. The
O CR has a recognit ion accuracy of more
t h an 97%. T h e O CR can aut omat ically
det ect and correct skewed pages and can also
d et ect p age or i en t at i on . T h e O C R can
recognize mult iple Punjabi font s and sizes.
T h e mai n feat u r es of t h e G u r mu kh i
OCR are:
Recognition Accuracy
Recognit ion accuracy is around 97% for
books, phot ocopies and medium degraded
document s and around 99.5% for laser print
out s and good qualit y document s.
Import Image
B/ W I mages, 24- bi t col or i mages, 256
grayscale images.
Output File Format
ISCII / t ext file encoded in any one of t he
popular Punjabi font s.
4 6
Fonts Supported
All non-decorat ive Gurmukhi font s.
On-screen verifier
Additional Features
Inbuilt Spell checking facilit y. Aut omat ic
skew det ect ion and correct ion (skew range –
5 t o +5 degrees). Upside down image aut o
det ect ion and correct ion.
Fig. 3 : A Screen shot of Gurmukhi OCR
1.6 Gurmukhi to Shahmukhi Transliteration:
Punjabi language is used in bot h part s of
Punjab in India and Pakist an. In East Punjab
( India ) Punjabi is writ t en in Gurmukhi
script . T his script was invent ed by Guru
Angad Dev ji. This is writ t en from left t o
right . In West Punjab ( Pakist an ) Punjabi is
writ t en in Persian script , also known as
Shahmukhi script . This is writ t en from right
t o left like Urdu and Persian. In East Punjab
man y San skr i t wor d s ar e absor bed i n t o
Pu n jabi lan gu age an d si mi lar ly i n West
Punjab many Persian and Urdu words are used
i n Pu n j abi . Bu t t h e essen ce of Pu n j abi
language remains same. Bot h Shahmukhi and
Gurmukhi have been in use simult aneously.
Aft er t he Part it ion, Punjabi on t he Indian side
of t he border was rest rict ed officially t o t he
Gurmukhi script . On t he Pakist ani side it was
rest rict ed t o Shahmukhi script . The result is
t hat a script -wall has come up bet ween t he
t wo sides of t he Pun jab, which preven t s
cult ural and lit erary exchanges — t here is
ignorance on bot h sides about development s
in cont emporary prose and poet ry on t he
ot her side. To break t his wall, it is necessary
t o develop t ranslit erat ion programs which can
aut omat ically con ver t Gur mukhi t ext t o
Shahmukhi and vice-versa. In t his direct ion
t he Punjabi Resource Cent re has collaborat ed
wit h t he Urdu Resource Cent re est ablished
at CD AC, Pu n e t o develop a compu t er
p r ogr am wh i ch au t omat i cal l y con ver t s
Gurmukhi t ext int o Shahmukhi. Using t his
soft ware, a collect ion of short st ories penned
by Mr. K. S. Duggal in Gurmukhi has been
convert ed int o Shahmukhi (Figs 4-5).
Fig 4 : A Short Story in Gurmukhi
Fig 5 : The Short Story in Gurmukhi of Fig. 4
converted to Urdu
4 7
2. Contents Uploaded on Internet
For benefit of common man, Punjabi lit erat ure,
dict ionaries and product s for free download have
been uploaded on t he Resource Cent re’s websit e
( h t t p : / / p u n j ab i r c. t i et . ac. i n ) . T h e con t en t s
uploaded on t he int ernet are:
2.1 Li terature : Pu n j ab i C l assi cs su ch as
Bullehshah Dian Kafian, Farid De Salok, Heer.
Luna, Chandi di War, Japji Sahib have been
uploaded in Punjabi using Punjabi Dynamic
Font s. T he websit e cont ains t he det ailed
descript ion of t hese classics. Tool t ips of
difficult words have also been provided in
Punjabi. Audio clips of t hese classics have also
been included.
2.2 Bilingual Dictionaries and glossary : The
following bilingual dict ionaries and glossary
have been uploaded at our sit e.
• Punjabi English On-line Dictionary : A
Pu n j ab i En gl i sh d i ct i on ar y i s mad e
available on t he websit e. The dict ionary
has about 40, 000 Punjabi words. T he
d i ct i on ar y h as samp l e sen t en ces for
common wor d s an d au d i o cl i p s of
pronunciat ion of Punjabi words have also
b een p r ovi d ed . T h e D i ct i on ar y i s
accessible in bot h t ypewrit er layout and
ph on et i c layou t . O n e can sear ch for
comp l et e mat ch or a p at t er n i n t h e
dict ionary.
• English Punjabi On-line Dictionary: An
English-Punjabi dict ionar y cont aining
ab ou t 4 5 0 0 0 wor d s h as b een mad e
available on t he websit e.
• Hindi Punjabi On-line Dictionary: A
H i n di - Pu n jabi di ct i on ar y con t ai n i n g
about 45000 words has also been made
available on t he websit e.
• Gl ossar y of Engl i sh-Punj abi
administrative terms : A technical glossary
of ar ou n d 1 7 , 0 0 0 En gl i sh - Pu n j ab i
ad mi n i st r at i ve t er ms h as al so b een
4 8
uploaded. A CD has also been developed
and t he glossary has been inst alled on
comput ers in many offices of Punjab St at e
Government .
2.3 On Line Teaching of Punjabi : For benefit of
Punjabis settled abroad and others interesting
in learning Punjabi, a web sit e for On-line
teaching of Punjabi has also been developed.
Work is complete for Gurmukhi Orthography,
Punjabi Pronouncing r ules and a limit ed
vocabulary both in text and pictorial format
along with audio effects is also provided. The
alphabets can be learnt in Animation pattern
as we draw with hands.
2.4 On Line Font Conversion Utility : An online
font conversion ut ilit y has been provided.
The user can past e Punjabi t ext encoded in
any of t he support ed sixt y Punjabi font s and
convert it t o t he desired Punjabi font .
• Punajbi Spell Checker – A Punjabi-English
spell checker has been uploaded on t he
home page and any user can download and
inst all t he spell checker on his syst em.
• Punj abi Fonts – Two Pu n j ab i
font s(LIKH ARI_P and LIKH ARI_R) for
phonet ic and keyboard layout s have been
developed an d made available for fr ee
download. The font s are available in bot h
True Type and Dynamic format s.
3. Interaction with Punjab State Government
A five day t raining programme was organised for
st aff of St at e Language Depart ment . They were
given t raining in word processing, email and
int ernet . We have been in const ant t ouch wit h
t he Secret ary IT, Punjab and a demonst rat ion of
t he product s developed at our RC was given t o
him. The Punjabi wordprocessor and Gurmukhi
OCR were inst alled on t he syst ems in t he office
of Secret ary, IT, Punjab. We have been providing
t echnical support t o St at e Language Depart ment .
Besides providing t raining t o t heir st aff, we have
also set up t heir web sit e in Punjabi. A CD of
English-Punjabi administ rat ive t erms has been
developed for t he Sat e Language Depart ment and
t he CD has been inst alled in Pat iala DC office
and St at e Revenue office.
4. Publications
1. G. S. Lehal, “St at e of Comput er izat ion of
Punjabi”, Proceedings Second World Punjabi
Conference, Prince George, Canada (2003).
(Accept ed for Publicat ion)
2. G . S. Leh al , “A Gu r mu kh i C ol l at i on
Algorit hm”, Journal of CSI. (Accept ed for
publicat ion)
3. G.S.Lehal, Chandan Singh and Renu Dhir,
“St ruct ural feat ure-based approach for script
i d en t i fi cat i on of Gu r mu kh i an d Roman
charact ers and words”, Document Recognition
and Retrieval X, Proceedings SPIE, USA, Vol.
5010 (2003). (Accept ed for publicat ion)
4. G.S.Lehal and Chandan Singh, “A complet e
OCR syst em for Gurmukhi script”, Structural,
Syntactic and Statistical Pattern Recognition,
T. Caelli, A. Amin, R.P.W. Duin, M. Kamel
and D. de Ridder (Eds. ), Lect ure Not es in
Compu t er Scien ce, Vol. 2396, Spr in ger -
Verlag, Germany, pp. 344-352, (2002).
5. G.S.Lehal, Chandan Singh and Renu Dhir,
“Aut omat ic separ at ion of Gur mukh i an d
Roman scr ipt wor ds”, Proceedings Indo-
European Conf erence on Mul til ingual
Communication Technologies, Pun e, R. K.
Arora, M. Kulkarni and H . Darbari (Edit ors),
Tat a McGraw-Hill, pp. 32-38,(2002).
6. G. S. Leh al an d C h an d an Si n gh , “A Post
Processor for Gurmukhi OCR”, SADH ANA
Academy Proceedings in Engineering Sciences,
Vol. 27, Part 1, pp. 99-112, (2002)
7. G . S. Leh al an d C h an d an Si n gh , “ Text
segment at ion of machine print ed Gurmukhi
script ”, Document Recognition and Retrieval
VIII, Paul B. Kan t or, D an iel P. Loprest i,
Jiangying Zhou, Edit ors, Proceedings SPIE,
USA, Vol. 4307, pp. 223-231, (2001).
4 9
8. G.S.Lehal and Chandan Singh, “A t echnique
for segmen t at i on of G u r mu kh i t ext ”,
Computer Analysis of Images and Patterns, W.
Skarbek (Ed.), Lect ure Not es in Comput er
Sci en ce, Vol . 2 1 2 4 , Sp r i n ger - Ver l ag,
Germany, pp. 191-200, (2001).
9. G.S.Lehal, Chandan Singh and Rit u Lehal, “A
shape based post processor for Gurmukhi
O C R”, Proceedings of 6th International
Conf erence on Document Anal ys is and
Recognition, Seat t le, USA, IEEE Comput er
Societ y Press, USA, pp. 1105-1109, (2001).
10. G.S.Lehal and Nivedan Bhat t , “A recognit ion
syst em for Devnagri and English handwrit t en
numerals”, Advances in Multimodal Interfaces
– ICMI 2001, T. Tan, Y. Shi and W. Gao
(Edit ors), Lect ure Not es in Comput er Science,
Vol. 1948, Springer-Verlag, Germany, pp.
442-449. (2000).
11. G.S.Lehal and Chandan Singh, “A Gurmukhi
script recognit ion syst em”, Proceedings 15th
International Conf erence on Pattern
Recognition, Bar cel on a, Sp ai n , I EEE
Comput er Societ y Press, California, USA, Vol
2, pp. 557-560, (2000).
5. The Team Members
Dr. R. K. Sharma
Dr. G. S. Lehal
Dr. Rajesh Kumar
Rajeev Kumar
Ramneet Mavi
Deepshikha Goyal
Nivedan Bhatt
Sukhwant Kaur
Ramanpreet Singh
Parneet Cheema NA
Pallavi Dixit NA
Rupsi Arora NA
Yoginder Sharma NA
Pooja Dhamija NA
Zameerpal kaur NA
Dr. Kuljeet Kapoor NA
Dr. Devinder Singh NA
Karamjeet Kaur NA
Jaspal Singh NA
Manpreet Singh NA
Rakesh K. Dawra NA
Shallu Kalra NA
Kuldeep Kumar NA
Baljit Singh NA
Aarti Gupta NA
Sunita Sharma NA
Surjit Singh NA
Courtesy: Prof. R.K. Sharma
Thapar Institute of Engineering & Technology
Department of Computer Science & Engineering
Patiala 147 001
(RCILTS for Gurmukhi)
Tel: 00-91-175-2393137, 393374, 2283502
E-mail :
5 0
Computer Vision and Pattern Recognition Unit
203, Barrackpore Trunk Road
Indian Statistical Institute, Kolkata-700108
Tel. : 00-91-33-25778086 Extn. 2852
E-mail :
Website : http:/ / ~rc_bangla
Resource Centre For
Indian Language Technology Solutions – Bengali
Indian Statistical Institute, Kolkata
Achievements Achievements Achievements Achievements Achievements
Indian Statistical Institute, Kolkata
Main object ive of t his project was resource and
technology development for all aspects of language
processing for Eastern Indian languages particularly,
Bangla. It included Corpus development , Font
generation, Website generation, OCR development,
Information retrieval system development etc. as
well as t raining of people on Indian language
technology through courses and workshops.
Outcome in Physical Terms
• Corpus of Bangla Document Images, Bangla
Text Corpus in Electronic Form and Corpus
of Bangla Speech Data
• Web si t e on East er n I n di an Lan gu age
Techn ologies in cludin g Ban gla Lan guage
Technology design guide
• Bangla Font Package
• OCR System for Oriya and Assamese
• I nformat ion Ret rieval Syst em for Bangla
Electronic Documents
• Multi-lingual Script Line Separation system
• Automatic Processing system for Handprinted
Table-Form Documents
• Neur al Net wor k based t ools for pr in t ed
docu men t ( i n east er n r egi on al scr i pt s)
Production Agencies with which Memorandum of
Understanding/link up has been established
MoU has been signed with (i) Webel Mediatronics
Lt d, Kolkat a (speech synt hesis Technology), (ii)
Centre for Advanced Computing, Pune (Devanagari
and Bangla OCR Technology) (iii) Elect ronics
Research & Development Cent re India, Noida
(D evan agar i O CR Techn ology), (iv) O r issa
Computer Application Centre, Bhubaneswar (Oriya
O CR Tech n ology) ( v) I n di an I n st i t u t e of
Technology, Guwahati (Assamese OCR Technology)
Link up for Data source has been established with
Bangla Academy.
1. Core Activities
1.1 Web Site Development and The Language
Design Guide
• Creation of a web site on Eastern Indian
Languages and Language Technologies for
information of people interested in TDIL
The name of this site is Resource Centre for Indian
Language Technology Solutions –Bangla. The URL
of this site is: At
present, this site contains details about this MIT
project (Resource Cent re for Indian Language
Technology Solutions - Bangla), a brief description
of t he product s developed at Indian St at ist ical
Institute, Kolkata. A Bangla design guide is also
introduced in the site so that new technology on
Bangla could maintain a common framework. It
covers details of the origin and development of the
script, the alphabets, character statistics, information
relat ed t o t he font s, present at ion and st orage
consideration, information related to Unicode of
Bangla etc. The local language academy (Bangla
Academy) has been contacted. A prototype of a Web-
based front-end to our spell-checker and phonetic
dictionary has been developed using an evaluation
copy of CDAC’s GIST Software Development Kit
(SDK). Initiative has been taken so that this front-
end can be put on the Internet for public use with
the GIST SDK and iPlugin (or Modular Software’s
Shree Lipi). Issues related to hosting this web site on
another server/ Web hosting service are currently
being explored.
1.2 Training Programmes
• Education through Training Programmes and
Two International workshops under the banner of
“I n t er n at i on al Wor ksh op on Tech n ology
Development in Indian Languages (IWTDIL)” were
held respectively during March 26-30, 2001 and
January 22-24, 2003. Some main topics covered in
these workshops were: (a) Machine translation from
English to Indian Language (particularly to Bangla),
5 2
( b) Text cor pu s gen er at i on , desi gn i n g an d
annotation, (c) Speech synthesis, processing and
recognition, (d) OCR technology for Indian scripts,
(e) H an d wr it t en ch ar act er r ecogn it ion , (f )
Document processing and analysis, (g) Some general
features of Indian Languages (e.g, phonology and
acoustics of Bangla phonemes and diphthongs), (h)
Anaphora resolut ion ellipsis in Hindi and ot her
Indian Languages. In IWTDIL’01, the first three
days had introductory talks and tutorials on various
areas of language technology and the last two days
featured lectures by international experts on these
subjects. A total of twenty-nine speakers presented
talks at the workshop and there were seventy-five
participants from India and abroad. In IWTDIL’03,
four distinguished scientists from abroad and six
fr om I n di a deli ver ed lect u r es t o for t y- fi ve
participants. A cultural programme consisting of a
play by a renowned theatre group of Kolkata was
also arranged.
One National workshop entitled “Indian language
Spell-checker Design” was organized in July 18-19,
2002. The main theme of the workshop was to
present the work on Spell-checker done by various
Resource Centres and groups in different Indian
languages. The participants demonstrated the Spell-
checker soft war e developed by t hem. I t was
anticipated that a benchmarking method for spell
checker would be evolved out of t he workshop.
Twenty-five participant attended/ presented lectures
in the workshop.
2. Services
2.1Corpus Development
• Printed Bangla Document Images & Ground
Truth for OCR and Related Research
Some famous Bangla novels and books have been
selected to prepare a Bangla document image corpus
along with ground truth. A brief description of
those selected novels and books are given below:
Maitreya Jatak – T h i s i s a famou s Ban gla
mythological novel written by Bani Basu, a popular
Bangla writer. As far as the linguistic aspect goes,
t he overall language of t he book is old Bangla.
Polished and chaste form (Sadhu bhasa) of narration
is used through out the book and some archaic forms
(terms mostly derived from Sanskrit and Prakrit
sources) are also used mainly to create an atmosphere
of the period of Goutama Buddha, which the book
depict s. Ananda Publishers, one of t he largest
publishers of Kolkata has published this book. The
qualit y of t he print ing is excellent . Comput er
generated font developed by Ananda Publishers has
been used in printing.
Pratham Alo – This is also a famous Bangla novel
wr i t t en by famou s Ban gla wr i t er Su n i l
Gangopadhyaya. The incident s of t he novel are
nearly one hundred and fifty years old and the theme
is related to Bangla renaissance. The standard Bangla
language has been used for this book. The percentage
of archaic words is much less than that in Maitreya
Jat ak. This book is also published by Ananda
Publishers, but using offset printing technology. The
overall print ing qualit y is good but worse t han
Maitreya Jatak.
Bhasa Desh Kal – The book is written by Dr. Pabitra
Sarkar, a distinguished author of Bangla Linguistics.
The language of t he book is st andard Bangla
colloquial. The publisher of the book is Mitra and
Ghosh, Kolkata. Printing quality is moderate and
worse than Pratham Alo.
Upendra Kishore Rachana Samagra – This book is
written by the famous author of juvenile literature
Shri Upendra Kishore Roychowdhuri. Old formal
and polished Bangla language is used in this book.
Here, offset font is used and the book has been
published by Ananda publishers of Kolkata.
Amar Jibanananda Abiskar o Anyanya – Sunil
Gangopadhyaya is t he aut hor of t his book. The
language of the book is standard Bangla colloquial.
Ananda Publisher has published t his book. The
pr in t in g qualit y is good an d offset pr in t in g
technology is used.
Rajarshi – This is one of the classic novels written
by Rabindranath Tagore. Old formal and polished
Bangla language is used in this novel. Ashoke Book
Agency of Kolkata is the publisher of the book. The
printing quality is not as good as that of Ananda
5 3
Bangla Choto Galpa - The book is a collection of
Bangla short stories written by different writers of
old Bangla literature and edited by Rabindranath
Tagore. Old formal language (Sadhu bhasha) is used in
most of the stories. The book is published by Model
Publishing House. The printing quality is good.
Punarujjiban – It is the Bangla version of a novel
written by famous Russian novelist Lev Tolstoy. The
book is published by Raduga Prakashan, Moscow.
The printing quality is good and modern colloquial
language is used.
Bhraman Amnibus – The book is written by Sri
Uma Prasad Mukhopadhyaya. It contains detailed
description of different places of Himalayas. Modern
Bangla language is used in this book. It is published
by Mit ra & Ghosh, Kolkat a. O ffset print ing
technology is used.
Parakiya – This book is collection of Bangla short
stories written by both old and modern writers of
Ban gla li t er at u r e an d edi t ed by Su n i l
Gangapadhyaya. The publisher of t he book is
Punashcha. Both old formal and modern colloquial
language are used in different stories of the book.
The printing quality is not so good.
Amar Debottar Sampatty – The book is an auto
biography of Nirod Chandra Chaudhuri, published
by Ananda Publishers. Offset printing technology
and old formal language are used.
Prabasi Pakhi - It is a Bangla story book written by
Sunil Gangapadhaya and published by Ananda
publishers. Offset printing technology and modern
colloquial language are used.
Mahabharat Katha – The book is published by
Udbodhan publicat ion of Ramkrishna Mission.
Modern Bangla language has been used in this book
(Table 1 is the list of books scanned).
OCR – A HP Scan Jet flat bed scanner of high
resolution (300dpi ordinary and gray level spectral
r esolu t i on ) h as been u sed t o get t h e i mage
documen t s. T he images ar e t hen saved in t o
un compr essed T I FF for mat an d are scan n ed
generally using same setting. Minor adjustment of
brightness and contrast may be done by using Corel
Photopaint software.
Table 1
Name of Books No. of Total No.
Pages of Words
Maitreya Jatak 419 2,43,020
Pratham Alo 37 18, 833
Bhasa Desh Kal 15 5,760
Upendra Kishore 330 1,38,270
Amar Jibanananda 21 7,812
Rajarshi 100 39,800
Bangla Chotogalpa 260 1,11,800
Punarujjiban 68 25,704
Bhraman Amnibus 31 14,198
Parakiya 163 85,412
Amar Debottar Sampatty 29 10,730
Prabasi Pakhi 10 4,020
Mahabharat Katha 4 1,304
Total 1,487 7,06,663
Those page-by-page scanned images are then fed as
input of the Bangla module of the existing Bilingual
(Hindi and Bangla) Optical Character Recognition
(OCR) System developed in our department. Each
character of those input images are being recognized
by t his O CR Syst em and are st ored as t ext
documents in 8- bit ISCII format.
The present OCR system gives an accuracy of 96 %
t o 98 % in charact er level depending on paper
quality and font style of the books. So, the 4% to
2% error is corrected manually to prepare the text
documents, which serves as ground truth.
A benchmark software is also developed to give a
specified format to all ground truths generated using
Indian OCR technologies for automatic evaluation
of them (different OCR technologies). The database
formed in this way is shown in the tabular form in
5 4
Figure 1: The image is at left hand side and the
corresponding ground truth is at right hand side
• Development of Bangla Text Corpus in Electronic
Form including a Bangla dictionary, and several
Bangla classics
Sever al n ovels h ave been en t er ed i n t o t h e
comput er in ISCII format . The descript ion of
n ovels, t he cor r espon din g aut hor s an d t ot al
number of words are given in a t abular form
(Table-2). More t han 34000 words of a bilingual
( Ban gla- En gli sh ) d i ct i on ar y h ave also been
ent ered. A comprehensive corpus for elect ronic
dict ionary for Bangla-Bangla (wit h 65,000 words)
has been const r uct ed and checked. Using t he
gu i d el i n es of t h e ab ove t wo d i ct i on ar i es a
t r ilin gual (Bangla-English-Hindi) dict ion ar y
creat ion has st art ed. Till t oday 17,000 words wit h
meaning and part s of speech as well as ot her
informat ion has been ent ered. However, lack of
st andard size Bangla-H indi dict ionary in print ed
form has creat ed some problem (Bangla-H indi
dict ionaries available in t he market are small,
cont aining about 7,000 words only). We have also
designed a prot ot ype of an elect ronic t hesaurus
for Bangla (based on WordNet , a well-known
elect ronic resource for t he English language).
Papers Published
• Dash, N.S. and Chaudhuri, B.B. (2001) “A
corpus based st udy of t he Bangla language”.
Indian Journal of Linguistics. 20: 19-40.
• D ash , N . S. an d C h au d h u r i , B. B. ( 2002)
“Cor pus gen er at ion an d t ext pr ocessin g”.
International Journal of Dravidian Linguistics.
31(1): 25-44.
Table -2
1. Author’s Name
Iswarchandra Vidyasagar is known as one of the
great social reformers, philanthropist and father of
Ben gali pr ose st yle of Ben gal. H e played an
import an t role in primary educat ion , widow
marriage and women education.
Name of the Classic & Publisher
Vidyasagar Rachanabali (Tuli – Kalam)
Name of Articles (Story, Novel, drama, etc.)
Betal Panchabingshati
Niskritilabh Prayash
Provaboti Sambhashan
Sanskrit Bhasa O Sankriti
Sitar Banobas
Balyabibaher Dosh
Ramer Rajjyabhishek
Banglar Itihas
No of Words (per Classic)
Sub. Total : 315783
2. Author’s Name
Bankim Ch. Chattopadhyay is considered the first
and one of the greatest novelists in Bangla language.
He has written 14 novels besides a large number of
5 5
essays on various literary and social issues. He wrote
the ‘Vande Mataram’ song, which played a major
role in Indian Independence struggle.
Name of the Classic & Publisher
Bankim Rachanabali (Patrajo Publication)
Name of Articles (Story, Novel, drama, etc.)
Muchiram Gur
Bibidha Prabandha (Vol. 1)
Bibidha Prabandha (Vol. 2)
Krishnacharita (Vol 1)
Krishnacharita (Vol 2)
No of Words (per Classic)
Sub. Total : 320232
3. Author’s Name
Sarat Chandra Chattopadhyay is probably the most
popu lar n oveli st of Ben gal. H i s n ovels
sympathetically depicted the weaker section of the
society including woman.
Name of the Classic & Publisher
Sarat Rachanabali (Sarat Samiti)
Name of Articles (Story, Novel, drama, etc.)
No of Words (per Classic)
Sub. Total : 246032
4. Author’s Name
Michael Madhushudan Dutt was a great poet and
play writer of Bengal. He has written a few English
son n et s an d poems. H e i n t r odu ced t h e
“Amitrakshar” rhyme in Bangali poem. His famous
work is the epic named ‘Meghnadbadh’.
Name of the Classic & Publisher
Madhushudan Rachanabali (Kallol Prakashani)
Name of Articles (Story, Novel, drama, etc.)
Buroshaliker Ghare Row
Krishnakumerir Natak
Padhyabatir Natak
No of Words (per Classic)
Sub. Total : 54089
5. Author’s Name
Computer Dictionary
Name of the Classic & Publisher
B. B. Chaudhuri (Ananda Publishers)
No of Words (per Classic)
Sub. Total : 95435
6. Author’s Name
Samsad Dictionary
Name of the Classic & Publisher
S. Biswas et. al. (Sahittya Samsad)
No of Words (per Classic)
Sub. Total : 167941
Grand Total : 883729
• Electronic Corpus Of Speech Data
The composition of existing speech databases for
English has been st udied. Speech dat a has been
categorized into several classes based on criteria such
as sex, age, region of speaker, place of data collection,
whether the source of the spoken material is a written
script etc. Based on these, the composition of the
database has been designed. All India Radio, Calcutta
has been designated as a potential source of audio
mat erial. However, t he speech lab, where in t he
controlled environment the data could be generated,
is yet to be set up. due to delayed sanction of the
5 6
2.2 Font Generation and Associated Tools
• Public Domain Bangla Font Generation
Overall font generat ion and edit or development
process can be divided into the following modules:
1. Designing the exhaustive glyph set.
2. Converter program from Font file to ISCII file
and vice-versa
3. Designing a Bangla text editor
4. Designing a floating Keyboard
Background Of Designing The Exhaustive Glyph
Set : Every language has its own character set and
many of t hem can be represent ed wit hin 256
charact ers using individual 8-bit charact er set s
including Indic scripts using ISCII code. However
for unification of all these languages recently the
Unicode consort ium proposed 16 bit charact er
encoding scheme. Here characters can be designed
at 2
spaces. The Government of India is a member
of the Unicode Consortium and has been engaged
in a dialogue wit h t he UTC about addit ional
characters in the Indic blocks and improvements to
the textual descriptions and annotations. Unicode
is designed t o be a mult ilingual encoding t hat
requires no escape sequences or switching between
scripts. For any given Indic script, the consonant
and vowel let t er codes of Unicode are based on
ISCII, so t hey correspond direct ly. For Bangla,
Unicode 3.0 provides places 0980 to 09FF. One
Bangla True Type Font (called ISI.t t f ) has been
developed in this center which should be compliance
wit h ISCII and could be upgraded t o support
Unicode. The font is designed and generated with
the help of ALTSYS Fontographer 3.0. The following
Bangla orthographic characters can be written with
this font :
1. Vowels (11)
2. Consonants (38)
3. Vowel Matras (10)
4. Halant
5. Conjuncts: Bangla contains numerous conjuncts
(250+), which essentially are clusters of up to four
consonant s wit hout t he int ervening implicit
vowels. The shape of these conjuncts can differ
from those of the constituting consonants.
6. Punctuation: (12)
7. Numerals (10)
Converter Program From Font File To ISCII File
And Vice-versa : The converter program has two
submodules. The first one takes a font encoded string
as input and delivers an ISCII encoded string as
output. The second one takes an ISCII encoded
string as input and gives a font encoded string as
Bangla Text Editor : Along with the font a Bangla
editor has been developed. It supports ISI.ttf font.
A web version is generated using Bit stream’s web-
font wizard to display text in Bangla within a web
browser. The edit or provides st andard edit ing
features such as cut, copy, paste, select, select all, file
operations, bold, italic, underlined text, superscript
and subscript, left, right and center alignments, find
and replace specific strings etc. This editor can save
the content in plain text, RTF and in ISCII format.
A floating keyboard can be invoked on demand for
t he new user t o get t he keying informat ion for
writing Bangla in this editor.
Designing a Floating Keyboard : The float ing
Keyboard is illustrated below.
Figure 2:. Standard Bangla DTP typewriter layout
5 6
• Bangla Spell-Checker
Bangla spell-checker is a tool to detect errors in
Bangla words and correcting them by providing a
set of correct alternatives which includes the intended
word. An errorneous word can belong to one of
two distinct categories, namely, non-word error and
real-word error. Let a string of characters separated
by spaces or punctuation marks be called a candidate
string. A candidate string is a valid word if it carries
a meaning. A meaningless string is a non-word. A
real word error means a valid word but not t he
intended one in the sentence. It makes the sentence
syntactically or semantically ill-formed or incorrect.
In both cases, the problem is to detect the erroneous
word and eit her suggest correct alt ernat ives or
automatically replace it by the appropriate word. In
t his spell-checker, only non-word errors are
Word errors can be classified into four major types
namely, subst it ut ion, delet ion, inser t ion, and
t ransposit ion error. In Bangla, wrong uses of
characters, which are phonetically similar to the
correct ones is observed. A great deal of confusion
occur in the use of long and short vowels, aspirated
and unaspirated consonants, dental and cerebral
nasal consonant due to phonetic similarity. Another
type of error is typographic error which is caused by
the accidental slip of fingers on the keys which are
neighbours of the intended key.
In t his spell-checker, t he main t echnique of error
det ect ion is based on mat ching t he candidat e
st ring in t he normal as well as in t he reversed

( fol l owi n g p u b l i cat i on s may b e
referred). To make t he syst em more powerful, t his
approach is combined wit h a phonet ic similarit y
key based approach where phonet ically similar
charact ers are mapped int o a single symbol and a
nearly-phonet ic dict ionary of words is formed.
Using t his dict ionary, phonet ic errors can be easily
det ect ed and correct ed. Here a candidat e st ring
first passes t hrough t he phonet ic dict ionary. If t he
word is not found in t he dict ionary and also failed
t o give suggest ion t hen it t ries t o divide t he word
in root part and suffix part by separat ely verifying
bot h. I f an er r or is foun d, t he spell-checker
attempts providing suggestions. If it fails, it checks
whet her t he st ring is a conjunct word generat ed
by appending t wo noun word and suffix. Opt ion
for adding new words permanent ly or t emporarily
is provided in t he spell checker.
• B. B. Chauhduri and T. Pal, Detection of word
error position and correction using reverse word
dict ionar y’, Intl. Conf. On Computational
Linguistics, Speech and Document Processing
ICCLSDP’98, February 18-20, 1998, pp, C41-
• B. B. Chauhduri, A Novel Spell-checker for
Bangla Text Based on Reversed-Word Dictionary,
Vivek, Vol. 14(4), pp. 3-12, October 2002.
Figure 3: Standard Bangla DTP typewriter layout
with a support of a spell checker
For the spell-checker, several files containing root-
words and suffix words are maintained. The main
dictionary contains about 60,000 root-words and
100,000 inflected words. Noun and verb suffix files
are also used. The spell-checker works fast and the
non-word error are all correctly detected but it makes
about 5% false alarm. This is mainly due to conjunct
words formed by euphony and assimilation as well
as proper nouns in the corpus.
3. Products
Previously Bangla, Devnagari OCR systems were
developed at CVPR Unit of Indian St at ist ical
Institute and these core technologies were transferred
t o Indust r y for commercializat ion. T he lat est
5 8
developments are Oriya OCR and Assamese OCR.
These are described below.
3.1 OCR System for Oriya
The purpose of this system is to recognize printed
Oriya script automatically.
Summary of the System
In this recognition system, the document image is
first captured using a flatbed scanner. The image is
then passed through different preprocessing modules
like skew correct ion, line segment at ion, zone
detection, word and character segmentation, etc.
Next, individual characters are recognized using a
combinat ion of st roke and run-number based
feat ures, along wit h feat ures obt ained from t he
concept of water overflow from a reservoir. These
techniques are discussed in a greater detail in the
System Description
Text D igit izat ion an d Noise Clean in g: Text
digitization is done using a flatbed scanner (Model:
HP Scanjet 660 C) at a resolution varying from
200 to 300 dots per inch (dpi). The digitized images
are in gray tone and a histogram-based thresholding
approach is used to convert them into two-tone
images. For a clear document, the histogram shows
two reasonably prominent peaks corresponding to
whit e and black regions. The t hreshold value is
chosen as the midpoint between the two peaks of
the histogram. The two-tone image is converted into
0-1 labels where 1 and 0 represent object and
background, respectively. The digitized image shows
protrusions and dents in the characters, as well as
isolated black pixels over the background, cleaned
by a morphological smoothing approach.
Figure 4: Uppermost and lowermost point of
components in a skewed text line
Skew Detection and Correction: When a document
is fed to the scanner either mechanically or by a
human operat or, a few degrees of skew (t ilt ) is
unavoidable. The skew angle is the angle that the
t ext lines in t he digit al image make wit h t he
horizontal direction. Skew detection and correction
are import ant preprocessing st eps of document
layou t an alysis an d O CR appr oach es. Skew
correction can be achieved in two steps, namely (i)
estimation of skew angle, and (ii) rotation of the
image by the skew angle in the opposite direction.
Here a Hough transform based technique is used
for estimating the skew angle of Oriya documents.
It is observed that the uppermost and lowermost
points of most of the characters in an Oriya text
line lie on the mean line and base line, respectively.
The lowermost and uppermost points of characters
in a skewed Oriya text are shown in Figure 4. To
reduce the amount of data to be processed by the
H ou gh t r an sfor m, on ly t h e u pper most an d
lowermost pixels of each component are considered.
First, the connected components in a given image
are identified. For each component, its bounding
box (minimum upright rectangle containing the
component ) is defined. The mean widt h of t he
bou n di n g box bm i s also compu t ed. Next ,
component s having bounding box-widt h great er
than or equal to bm are retained. By thresholding at
bm, small components like dots, punctuation marks,
small modified characters, etc., are mostly filtered
out. Because of this filtering process, the irrelevant
components cannot create errors in skew estimation.
Now, the usual Hough transform technique is used
on t hese point s t o get t he skew angle of t he
document. The image is then rotated according to
the detected skew angle. Font style and size variation
do not affect the proposed skew estimation method.
Also, the approach is not limited to any range of
skew angles.
Line, Word and Charact er Segment at ion: For
convenience of recognition, the OCR system should
automatically detect individual text lines, segment
t he words from t he line, and t hen segment t he
characters in each word accurately. Since Oriya text
lines can be partitioned into three zones (Figure 5),
it is convenient to distinguish these zones. Character
r ecogn i t i on becomes easi er i f t h e zon es ar e
5 9
distinguished because the lower zone contains only
modifiers and the halant marker, while the upper
zone contains modifiers and portions of some basic
Figure 5: Zones in an Oriya text line
Text Line Detection and Zone Separation: The lines
of a text block are segmented by finding the valleys
of the projection profile computed by counting the
number of black pixels in each row. The trough
bet ween t wo consecut ive peaks in t his profile
denotes the boundary between two text lines. A text
lin e can be foun d bet ween t wo con secut ive
boundary lines. After line segmentation, the zones
in each line are detected. From Figure 6 it can be
seen that the upper zone is separated from the middle
zone of a text line by the mean line, and that the
middle zone is separated from the lower zone by
the base line. The uppermost and lowermost points
of the connected components in a text line are used
to detect the mean line and base line, respectively. A
set of horizontal lines passing through the uppermost
and lowermost point s of t he component s are
considered. The horizontal line that passes through
the maximum number of uppermost points (lower
most points) is the mean line (base line). It should
be noted that the uppermost and lowermost points
of the components are previously detected during
skew detection, so these points do not have to be
recalculated during zone detection.
Figure 6 : Projection profile of rows in Oriya text
lines (dotted lines show line boundaries)
Word and Charact er Segment at ion: Aft er a t ext
line is segment ed, it is scanned vert ically, column-
by-column. If a column cont ains t wo or fewer
black pixels, t hen t he scan is denot ed by 0, else
t he scan is denot ed by t he number of black pixels
in t hat column. In t his way, a vert ical project ion
profile is const ruct ed. Now, if in t he profile t here
exist s a run of at least k consecut ive 0s t hen t he
mi d p oi n t of t h at r u n i s con si d er ed as t h e
boundary bet ween t wo words. The value of k is
t aken as 2/ 3 of t he t ext line height (t ext line height
is t he normal dist ance bet ween t he mean line and
t h e base li n e) . To segmen t each wor d i n t o
individual charact ers, only t he middle zone of t he
word is considered. To find t he boundary bet ween
charact ers, t he image is scanned in t he vert ical
direct ion st art ing from t he mean line of t he word.
I f d u r i n g a scan , t h e b ase l i n e wi t h ou t
encount ering any black pixel is reached, t hen t his
scan marks t he boundary bet ween t wo charact ers.
However, t he gray-t one t o t wo-t one conversion
of the image gives rise to some touching characters,
which cannot be segment ed using t his met hod.
To segmen t t h ese t ou ch i n g ch ar act er s, t h e
principle of wat er overflow from a reservoir is
used, which is as follows. If wat er is poured on
t op of t he charact er, t he posit ions where wat er
will accumulate are considered as reservoirs. Figure
7 shows t he locat ion of reservoirs in a single
ch ar act er as wel l as i n a p ai r of t ou ch i n g
charact ers. The height of t he wat er level in t he
reser voir, t he direct ion of wat er overflow from
t he reservoir, posit ion of t he reservoir wit h respect
t o t he charact er bounding box, et c are not ed. A
reservoir whose height is small and which lies in
t he upper part of t he middle zone of a line is
considered as a candidat e reservoir for t ouching
charact er segment at ion. T he cusp (lowermost
point ) of t he candidat e reservoir is considered as
t he separat ion point of t he t ouching charact ers.
In Figure 7, t his posit ion is marked by a vert ical
line. Because of t he round shape of most of t he
O r iya ch ar act er s, it is obser ved t h at such a
reservoir is formed in most of t he cases when t wo
charact ers t ouch each ot her. Somet imes, t wo or
more reservoirs may be formed. In such cases,
t he reservoir close t o t he middle of t he bounding
box for segment at ion is select ed.
6 0
Figure 7: Water reservoirs in a single and touching
Oriya characters
Feat ure Select ion and Det ect ion: Topological
feat ures, st roke-based feat ures as well as feat ures
obt ained from t he concept of wat er overflow
for charact er recognit ion are considered. T hese
feat ures are known as principal feat ures. T he
f ea t u r es a r e c h o sen wi t h t h e f o l l o wi n g
considerat ions: (a) Robust ness, accuracy and
si m p l i ci t y o f d et ect i o n ( b ) Sp eed o f
comp u t at i on ( c) I n d ep en d en ce of si ze an d
fon t s, an d ( d ) Tr ee cl assi fi er d esi gn n eed .
St r oke- b ased an d t op ol ogi cal feat u r es ar e
con si d er ed for t h e i n i t i al cl assi fi cat i on of
charact ers. T hese feat ures are used t o design a
t ree classifier where t he decision at each node
of t he t ree is t aken on t he basis of t he presence/
absence of a part icular feat ure (see Figure 8).
St roke-based feat ures include t he number and
p osi t i on of ver t i cal l i n es. T h e t op ol ogi cal
feat ur es used in clude exist en ce of holes an d
t heir number, posit ion of holes wit h respect t o
t h e ch ar act er b ou n d i n g b ox, r at i o of h ol e
height t o charact er height , et c. In addit ion, t he
concept of wat er overflow from a reservoir is
also u sed. T h e r eser voi r s i n a ch ar act er ar e
ident ified, and t he posit ion of t he reservoirs
wit h respect t o t he charact er bounding box, t he
height of each reser voir, t he direct ion of wat er
over fl ow, et c. ar e u sed as feat u r es i n t h e
recognit ion scheme.
This syst em is developed in C language in UNIX
plat form. A WINDOWS based version is also
T h e syst em can r u n i n an y U N I X an d
WINDOWS plat form.
Figure 8: Portion of the tree classifier for Oriya
Line Segmentation: Our system identifies individual
text lines with an accuracy of 97.5%.
Word Segmentation: The overall word segmentation
accuracy of the system is 97.7%. The error rate for
the inferior documents is 4.2%, whereas for the
good-qualit y document s, it is only 1.2% (t hese
figures were calculated based on correctly segmented
text lines only).
Ch ar act er Segmen t at i on : T h e ch ar act er
segmentation accuracy of the system is 97.2%. The
proposed method for separating touching characters
based on the water reservoir concept is generally
Charact er Recognit ion: On average, t he syst em
recognizes charact ers wit h an accuracy of about
96.3%, i.e. the overall error rate is 3.7%.
3.2 Adaptation of Bangla OCR to Assamese
We have already developed an efficient O CR
syst em for print ed document s in Bangla. Since
6 1
Assamese and Bangla share t he same script , t his
OCR syst em can be successfully used for Assamese
d ocu men t s aft er some mod i fi cat i on s. T h e
modi fi cat i on i s n eeded mai n ly i n t h e post -
processing st age when language specific O CR
error correct ion is needed.
Summary of the System
The segment at ion of a document image int o lines,
words, and charact ers, and t he recognit ion of
segment ed charact ers is dependent on t he script
only. Thus, t he modules of t he exist ing O CR
syst em for Bangla can be used for Assamese OCR.
However, cert ain post -processing st eps aft er basic
recognit ion in order t o improve OCR accuracy
are required. For example, t he words in t he out put
of t he OCR syst em may be looked up in a lexicon;
a correct ly recognized word will be present in t he
lexicon, whereas an incorrect ly recognized word
will usually not be found. The incorrect word
can t hen be replaced by t he lexicon-word nearest
t o it . This post -processing is obviously language-
dependent . In t his project , we have made t he
necessary modificat ions in our OCR syst em so
t hat it can be used on Assamese document s.
System Description
T he major component s of t he syst em are same
as t hose of Bangla. Det ails of t hese component s
can b e fou n d i n t h e fol l owi n g r efer en ces.
Mo d i f i cat i o n s ar e d o n e i n t h e f o l l o wi n g
modules :
1. Updat e of Symbol-list : The symbol-list for
Bangla script is updat ed t o cont ain Assamese
“ra” and “wa” and all t he conjunct s involving
t hese t wo charact ers. Ot her charact ers remain
unalt ered.
2. Format ion of t he Prot ot ype Librar y: T he
prot ot ype library used in Bangla O CR is
modified by adding new charact er shapes
found in Assamese script . Charact er shapes
not appearing in Assamese script are delet ed
from t he librar y.
Figure 9. A sample output from the Assamese OCR
Test pages are selected from three Assamese books.
Pages are scanned at 300 dpi. In total, 50 pages are
used in testing phase. Analysis of test results shows a
character-level accuracy of about 95%. Since the font
used for printing Assamese materials are somewhat
different from the fonts used in Bangla, generation
of a new prot ot ype library considering major
Assamese fonts would improve the overall accuracy
of the system.
Design of post-processing module: Post-processing
in Bangla OCR is done using a lexicon of Bangla
language. The same module can be used to do post-
processing for Assamese also. However, a lexicon of
Assamese language is needed for this purpose. This
activity has been taken up by IIT, Guwahati.
Technology Transfer
The source code of t he syst em along wit h t he
technical details has transferred to IIT, Guwahati.
6 2
Technical Reports/Correspondence
An MCA st udent , Anirban Mukherjee did work
on this project for his MCA dissertation submitted
t o In dir a Gan dhi Nat ion al O pen Un iver sit y
(I GN O U), New D elh i. Tit le of h is wor k is
“D evel op men t of an O p t i cal C h ar act er
Recognit ion Syst em for Assamese Script ”.
1. B. B. Chaudhuri and U. Pal, “A complet e
p r i n t ed Ban gl a O C R syst em”, Pattern
Recognition, vol. 31, pp. 531-549, 1998.
2. U . G ar ai n an d B. B. C h au d h u r i ,
“Segment at ion of Touching Charact ers in
Print ed Devnagari and Bangla Script s using
Fu zzy Mu l t i fact or i al An al ysi s”, IEEE
Transactions on Systems, Man and Cybernetics,
Part C, Vol. 32, No. 4, pp. 449-459, 2002.
3. A. Ray Chaudhuri, A. Mandal, and B. B.
C h au d h u r i , Page Layout Anal yzer f or
Mul itingual Indian Documents , Pr oc.
Language Engineering Conference, IEEE CS
Press, 2002.
3.3 Information Retrieval system for Bangla
Digit al informat ion is available in various forms
like t ext , image and speech dat a or mult imedia
cont ent . Among t hem t he t ext informat ion is
considerably abundant and can be easily creat ed.
Passage ret rieval from t ext document s is gaining
moment um over document ret rieval for t he last
sever al year s. D ocu men t r an ker r et u r n s
document s, which is oft en infeasible t o search and
ext ract t he necessary informat ion from t he ent ire
document . Passage ret rieval on t he ot her hand
ret urns fixed or variable sized t ext chunks from
document (s) where t he informat ion is likely t o
r esi d e. T h i s saves bot h t i me an d effor t for
searching from a huge t ext document corpus.
Figure 10. A sample output from the prototype
developed. The question “what is Pythagoras
theorem” was submitted in Bangla and the system
retrieves passages and ranked as output.
Summary of the System
A prot ot ype n-gram based language ident ifier for
ident ifying Indian languages has been developed
[1]. A prot ot ype of nat ural language t ext indexer
for passage r et r ieval has been developed for
System Description
Not e t hat Indian languages can be grouped int o
five cat egor ies based on t heir or igin s: I n do-
Eu r op ean ( H i n d i , Ban gl a, Mar at h i , et c. ) ,
Dravidian (Tamil, Telugu, et c.),Tibet o-Burmese
(e.g., Khasi), Ast ro-Asiat ic (Sant hali, Mundari,
et c. ) an d Si n o- T i b et an ( e. g. , Bh u t an ese) .
Languages wit hin a group share a number of
common elemen t s. For i n st an ce, t h er e i s a
significant overlap in t he vocabulary of Bangla
an d ot her I n do-Eur opean lan guages an d ar e
mut ually closer t han t he profiles for a pair of
languages from t wo different groups. We have
t est ed t he charact er level n-gram algorit hms for
lan gu age iden t ificat ion fr om a mu lt ilin gu al
collect ion of Indian language document s. Also, a
pr ot ot ype of a “En glish t o Ban gla” phon et ic
t r an sl i t er at i on sch eme i s d esi gn ed an d
implement ed for cross-lingual informat ion.
Part of t he n-gram dist ance mat rix bet ween every
pair of Indian languages is shown in Table 3.
6 3
Table 3. The n-gram distance between some major
Indian language
Profile Bangla Hindi Kannada Kashmiri Malayalam Telugu Urdu
Bangla 0 16.54 19.42 23.66 20.27 19.08 24.01
Hindi 16.54 0 18.40 23.65 19.74 18.58 24.12
Kannada 19.42 18.40 0 23.80 18.11 16.65 24.09
Kashmiri 23.66 23.65 23.80 0 24.02 23.88 19.54
Malayalam 20.27 19.74 18.11 24.02 0 18.07 24.29
Telugu 19.08 18.58 16.65 23.88 18.07 0 24.15
Urdu 24.01 24.12 24.09 19.54 24.29 24.15 0
A passage det ect ion and ranking algorit hm for
Bangla text has been designed and implemented.
Stop-word-list are common words ignored by search
engines at the time of searching and these words
generally do not cont ain any informat ion. For
constructing Bangla search engines by combining
st at ist ical and manual met hods, about 500 st op
words are identified. The DoE Bangla corpus is used
for this purpose.
The indexer generates an indexed file, which keeps
the record in this fashion:
Document No., Term, Term_weight ( f
Frequency of the term in whole corpus f
, Occurence
position of the term
D ocumen t No: T he iden t it y n umber of t he
document in the whole corpus.
Term weight: The ratio of the term frequency for a
part icular t erm in a document (f
) and t he
document length in bytes (N
The occurrence position of the term: The numerical
value of the position of the term, counted from the
beginning of the document. The position of first
term of the document is one. So for each term f
occurrence positions are listed.
A prot ot ype of passage det ect ion algorit hm is
developed and a few st andard passage-ranking
algorithms have also been tested.
Technical Reports/Correspondence
1. “N-gram: a language independent approach to
IR and NLP”, P. Majumder, M. Mit ra, B. B.
Chaudhuri, Proc. International Conference on
UniversalKnowledge and Language (ICUKL-2002),
Goa, India, 2002.
3.4 Script Identification and Separation From
Indian Multi-Script Documents
India is a multi-lingual multi-script country, where
a single document page (e.g., passport application
form, examinat ion quest ion paper, money order
form, bank account opening application form ) may
contain lines in two or more language scripts. For
t his t ype of document page t here is a need for
separating different scripts before feeding them to
respective OCR systems. The purpose of this system
is to identify different script regions of the document
and hence separate them.
Summary of the System
The system works in two stages. In the first stage, it
separates each line in the scanned document page.
Line segmentation is based on horizontal profile
based t ech n i qu e. Secon dly, based on t h e
dist inguishing feat ures bet ween different Indian
scripts, it identifies the script for each line in which
it is actually written. The identification of a particular
script from other scripts is mainly based on water
reservoir principle based features, features based on
contour tracing, profile features etc. These features
are elaborated below. At present, the system has an
overall accuracy of about 97.52%.
System Description
Some of the major distinguishing features used to
separate different Indian scripts in a document page
from each other, are discussed.
Horizontal projection profile: From the following
Figure 11, it is apparent t hat t here is a dist inct
difference among some of the scripts in terms of
horizontal projection profile.
Figure 11: Different Indian script lines (from top
to bottom: Devnagari, Bangla, Gurmukhi,
6 4
Malayalam, Kannada, English, Tamil, Telugu,
Urdu, Kashmiri, Gujarati, Oriya) with their row-
wise maximum run (left side) and horizontal profile
(right side).
Wat er r eser voir pr in ciple based feat ure: Top
(bottom) reservoir is defined as the reservoir obtained
when water is poured from top (bottom) of the
component. (A bottom reservoir of a component is
visualized as top reservoir when water will be poured
from top after rotating the component by 180°).
Similarly, if water is poured from left (right) side of
t h e compon en t , t h e cavi t y r egi on s of t h e
compon en t s wh er e wat er wi ll be st or ed ar e
considered as left right) reservoirs. This is shown in
Figure 12. Here top, bottom, left and right reservoirs
are shown for the English character X. The water
flow levels of these reservoirs are also shown in this
Figure 12: Top, bottom, left and right reservoirs are
shown for the character X. Water flow level of reservoir
is shown by dotted arrow.
In some scripts, many reservoirs may be obtained
from a particular side of the characters whereas in
some ot her script s many reservoirs may not be
obtained from that side. Thus, this feature is useful
in distinguishing between scripts.
Left and right profile: In this feature each character
is located within a rectangular boundary, a frame.
The horizontal or vertical distances from any one
side of the frame to the character edge are a group
of parallel lines, known as the profile (Figure 12). If
left, right and top profile of the characters in a text
lines are computed, it is observed that there are some
distinct difference in some of the scripts according
t o t hese pr ofiles. Left an d r ight pr ofile of a
Malayalam character is shown in Figure 13.
Figure 13: Left and right profile of a character is shown.
Head-line feature: If the longest horizontal run of
black pixels on the rows of a text line is taken then
such run length (known as Head-line) can be used
to distinguish between scripts with this feature ( like
Bangla ) and those without Head-line ( like English).
Feature based on jump discontinuity: Here jump
discontinuity (a relatively small white run between
two black runs) of a component from a particular
side is considered. Some script’s characters (like in
Telugu, Kannada et c. , ) have prominent jump
discontinuties and the occurrence frequency of this
part icular feat ure is successfully used for script
identification. The jump discontinuity for a Gujrati
character is shown in Figure 14.
Figure 14: Example of jump discontinuity feature.
Based on t hese above major feat ures t he syst em
identifies the script of a particular line.
Script Identification Technique:
• Scanned gray tone document is converted to two
t one image using hist ogram based aut omat ic
thresholding approach.
• Noi se Removal an d Skew cor r ect i on ar e
performed on this two tone image.
• Line segmentation is performed to the refined
• For each line of the image the particular script in
which the line is actually written is identified based
on a Binary Tree classifier.
• Once the script of the line is identified its region
is marked by the name of the particular script.
• The marked lines are shown in the output image.
6 5
A part of t he Binary Tree classifier is shown in
Figure 15.
Figure 15: Flow diagram of the script
identification scheme.
(Here B =Bangla, D=Devnagari, E=English,
Gu=Gurumukhi, M=Malayalam, Ta=Tamil,
Te=Telugu, G=Gujarati, Ka=Kannada,
U=Urdu and O=Oriya).
This system is developed in C language in UNIX
platform. An user interface of this system is built in
VC++ 6.0 in WINDOWS2000 platform.
Both the C version and the VC++6.0 version of this
syst em can run in any UNIX and WINDOWS
platform respectively.
The system is tested based on data taken from various
sou r ces like Jou r n als, Newspaper, Syn t h et ic
Document etc. Currently the system has an overall
accuracy of 97.52%. The output generated by the
system when run on a document page is shown in
Figure 16.
Figure 16: The output generated by the system
Papers Published
• U. Pal and B. B. Chaudhuri, “Identification of
d i ffer en t scr i pt li n es fr om mu lt i - scr i pt
documents”, Image and Vision computing, vol.
20, no.13-14 pp. 945-954 2002.
• U. Pal an d B. B. Chaudhur i, “Scr ipt lin e
seperation from Indian multi-script documents”
IETE Journal of Research, vol. 49. no.1. 2003.
• U. Pal, S. Sinha, B. B. Chaudhuri, “Multiscript
Line Identification From Indian Documents”, 7
International Conference on Document Analysis
& Recognition, ICDAR 2003 (In press).
4. Research & Development
4.1Automatic Processing of Hand-printed Table-
Form Documents
In office environment , t housands of document s
cont aining t ables may be handled while processing
applicat ion forms. In Indian cont ext , most of t he
cases, t hese t able-form document s cont ain hand-
print ed t ext (like cust omer’s name, dat e, it em
details/ quantity/ price etc.) mixed with printed text
(invoice no., challan no., et c.). Once t hese forms
are collect ed from t he st ockist s, cust omers, or
ot her business cent ers, t he comput er operat ors
6 6
manually ent er t he dat a int o t he comput er t o
maint ain an elect ronic version of t he same. This
manual approach makes t his processing t ime
consuming, t edious and inefficient . H ence, an
aut omat ic approach is called for. Moreover, t he
work may be t he basis of handwrit ing recognit ion
of Indian languages.
Summary of the Project
This project deals with the automatic processing of
hand print ed t able-form document s. It ext ract s
different blocks of handwritten information from a
filled-in form. Each such block is segmented into
lines, words and characters. Identification of each
block is followed by t agging (numbering) t hem
according to the order of their physical placements.
In t he final st age of t his informat ion ext ract ion
procedure, images of individual hand print ed
characters are obtained which should be passed as
input s t o a hand print ed charact er recognit ion
Figure 17: The binarized image(left) and the result
after the form segmentation (right).
Description of the Work Done
I n our appr oach , ext r act ion of h an dwr it t en
information from a filled-in form, does not take
care of the grey values of the pixels but it works on
the binarized image of the input form. In the first
st ep, horizont al profiles of object pixels help t o
segment different horizontal lines of information.
In each such line, one or more blocks are identified
by exploit ing t he vert ical profiles wit hin each
horizontal strip. Within a block, words are identified
by considering the gap between two consecutive
words. Characters in a word are segmented again
by considering the vertical profiles restricted to the
image of the word. This information in a block-
word-charact er hierarchy is st ored in a t hree
dimensional list.
Testing of the System
We collected several hand-filled-in application form
seeking a job. In Figure 17, result of our form
document processing system is shown.
Technical Reports/Correspondence
An MCA student, Mr. Saikat Das did the project
work for his MCA dissertation submitted to Indira
Gandhi National Open University (IGNOU), New
Delhi. Title of his work is “Recognition of Hand-
written (touching) Bangla characters from Form
Type Documents”.
Another student Prasenjit Mitra to complete DOE
“B” level has st art ed his project work t o develop a
syst em which can aut omat ically process Indian
Money Order Forms. He will t ake approximat ely
anot her 4 mont hs t o complet e t he assignment .
4.2Research and Development of Neural Network
based tools for printed document (in eastern
regional scripts) processing
Art ificial neural net work (ANN) based met hods
in solving various pat t ern recognit ion problems
have several advant ages over t he convent ional/
cl assi cal ap p r oach es. Recen t l y AN N b ased
classificat ion approaches have gained t remendous
popularit y. They are commonly used in t he high
accu r acy syst ems b ecau se t h ey p er for m
sat isfact orily in t he presence of incomplet e or
noisy dat a and also t hey can learn from examples.
Anot her advant age is t he parallel nat ure of ANN
algorit hms. On t he ot her hand, we have already
6 7
developed an efficient OCR syst em for print ed
d ocu men t s i n Ban gl a. T h er e ar e obvi ou s
just ificat ion s t o explor e suit able super vised/
unsupervised neural net work/ hybrid models for
developing soft ware t ools for shape ext ract ion of
in dividual pr in t ed char act er s wit h a view t o
charact er classificat ion and t his may in t urn make
t h e al r ead y d evel op ed O C R syst em mor e
efficient .
Summary of the Project
For ext ract ion of shapes of individual print ed
char act er s we may con sider Self-O r gan izin g
n eu r al n et wor k or Vect or Q u an t i zat i on
t echniques. Inst ead of using t he input charact er
image for classificat ion purpose we may obt ain
it s graph represent at ion consist ing of a few nodes
and links bet ween t hem. Useful t opological and
geomet rical feat ures may be easily obt ained from
such a represent at ion and t hose in t urn should
result in bet t er classificat ion accuracy.
Description of the Work Done
We used Topology Adapt ive Self-O r gan izin g
Neural Net work (TASONN) model t o obt ain t he
graph represent at ion of t he input charact er. We
considered a few st ruct ural feat ures of t his graph
describing the topology of the character along with
a hierarchical t ree classifier t o classify print ed
Ban gla ch ar act er s i n t o a few su bclasses. To
recognize different charact ers in each of t hese
r esu l t i n g su b cl asses we con si d er ed sever al
geomet r i cal feat u r es. Fi n al r ecogn i t i on i s
performed using a look-up t able of t hese feat ure
Testing of the System
Test pages are select ed from different Bangla
books. Pages are scanned at 300 dpi. In t ot al, 100
pages are used for t raining purpose and anot her
50 pages are used in t he t est ing phase. Analysis of
t est result s shows a charact er-level accuracy of
about 98%. Error analysis indicat es t hat since t he
font used for print ing of different books are
somewhat different from each ot her, generat ion
of a larger t raining set would improve t he overall
accuracy of t he syst em.
5. The Team Members
S. K. Parui
A. K. Datta, M. Mitra
U. Pal
S. Palit
U. Bhattacharyya
U. Garain
T. Pal
N. S. Dash
A. Datta and D. Sengupta
Courtesy: Prof. B.B Chaudhary
Indian Statistical Institute
Computer Vision and Pattern Recognition Unit
203, Barrackpore Trunk Road
Kolkata –700035
(RCILTS for Bengali)
Tel: 00-91-33-25778086
Extn. 2852, 25781832, 25311928
6 8
Department of Computer Science & Application
Utkal University, Bhubaneswar, Orissa – 751004
Tel. : 00-91-674-2585518 / 0216 E-mail :
Website : http:/ /
Orissa Computer Application Centre
Plot No.-N-1/ 7-D, Acharya Vihar Square
RRL Post Office, Bhubaneswar-751013
Tel. : 00-91-674-2582490/ 2582850 E-mail :
Website : http:/ /
Resource Centre For
Indian Language Technology Solutions – Oriya
Utkal University, & OCAC, Bhubaneswar
Achievements Achievements Achievements Achievements Achievements
Utkal University, Bhubaneswar
Resour ce Cen t er s (RC) for In dian Lan guage
Technology Solutions (ILTS), under Technology
Development of Indian Language (TDIL) programs
of the Ministry of Communication and Information
Technology, Government of India, are established
for providing platform to disseminate knowledge
to common man through digital unites. One such
RC is est ablished in Ut kal Universit y, Orissa t o
handle the issues of Oriya language, the official
language of Orissa. For the effective implementation
of t he idea, several t echnologies, such as Image
Processing, Speech Processing and Natural Language
Processin g ar e mer ged t oget her t owards t he
development of hardware and software under the
project for the services of Oriya people. This enable
t hem t o be comput er lit erat e and t hus t o be a
knowledgeable person. RC-ILTS-Oriya works for
the development of tools like
1. Bilingual E-Dictionary English<>Oriya)
2. Oriya Spell Checker
3. Oriya WordNet
4. Oriya Machine Translation System (English –
5. Oriya Optical Character Recognition System
6. Oriya Text-To-Speech System
All these softwares are Copy Righted.
Besides this RC team members are also keen in the
development of softwares like
1. Oriya Speech-To-Text System
2. Oriya Word Processor with trilingual (Oriya,
English, Hindi) Word Processor wit h spell
checking capacity.
3. Sanskrit WordNet.
4. Jagannath Philosophy
1. Intelligent Document Processing (OCR) for Oriya
Document processing is a need for high-level
represent at ion of t he cont ent s present in t he
document so that the content can be analyzed and
understood clearly and unambiguously. Different
approaches are made for opt ical recognit ion of
charact ers for different languages like English,
Chinese, Japanese and Korean. Very little efforts
have been made for t he recognit ion of Indian
languages. We have made an at t empt for t he
recognit ion of alphabet ic charact ers of O riya
language using novel technique, which helps for
efficient processing of text (document). Machine
intelligence involves several aspects among which
optical recognition is a tool, which can be integrated
to text recognition and text-to-speech system. To
make these aspects effective, character recognition
with better accuracy is needed.
The process of Optical Character Recognition of a
document image mainly involves six phases:
1. Digitization
2. Pre-processing
3. Segmentation
4. Feature Extraction
5. Classification
6. Post Processing
The digitization phase uses a scanner or a digital
camera t hat divides t he whole document int o a
rectangular matrix of dots taking into consideration
the change of light intensity at each dot. The matrix
of dots is represented digitally as a two dimensional
array of bits. Each dot can be represented by a single
bit for b/ w image (0-black, 1-white) and for color
image, needs 24 bits for its representation. The better
the resolution, the better the image.
Preprocessing involves several act ivit ies, which
transform the scanned image into a form suitable
for recognition. These activities are noise clearing,
filtering and smoothing, thinning, normalization,
and skew correct ion. But for t he recognit ion of
printed characters segmentation plays an important
role in the activities of preprocessing. After noise
clearing phase, it needs individual characters to be
ext ract ed wit h bet t er approximat ion so t hat no
characters should loose its important features during
the process of extraction. So efficient segmentation
7 0
algorithms should be employed which will lead to
better recognition.
During the segmentation phase, the whole image is
analyzed and different logical regions in the image
are separated. The logical units consist of texts which
has lines thus words thus characters. The error during
isolating characters changes their basic shapes. So
the characters must be properly extracted so that it
will lead to better and accurate representation of
the original character. Many algorithms exist for
ext ract ing charact ers from an image. But t he
problem arises when some characters are connected
in the document image. So two characters when
connected may be mistaken as a single character,
thus erroneous extraction leading to misrecognition.
In t he feat ure ext ract ion phase, t he feat ures of
individual characters are analyzed and represented
in terms of its specialty or uniqueness. These features
help in classificat ion and ident ificat ion of t he
characters. The feature set needs to be small and the
values needs to be coherent for objects in the same
class for better classification.
Technology Involved
1. Literature survey done.
2. Gray t one convert ed t o t wo-t one image by
dynamic thresholding on intensity.
3. Skew correction implemented for easy handling
of documents.
4. Li n es ext r act ed fr om a docu men t u si n g
histogram analysis.
5. Individual characters extracted using region-
growing and histogram analysis method.
6. Matras extracted by region analysis.
7. Skeletonization for efficient processing, storing
(less memory space allocation) and searching.
8. Connected characters handled by backward and
forward chaining of appropriate mask.
9. Features extracted from isolated characters as
well as composite characters like jukta.
10. An alysis on Mult icolor ed documen t s by
applying Hue, Saturation and Intensity (HSI)
11. The system tested on documents from book
(Bigyandiganta) and also for some old Oriya
The system also integrated to text-to-speech system
for Oriya.
(An input image in bmp format)
(The output text in our editor)
2. Natural Language Processing (Oriya)
Natural Language Processing is the technique of
engineering our language t hrough machine (t he
computer) by which we can overcome the language
barrier and the difference between man & machine.
With that sincere motive we have taken several steps
in assisting the computer system behave like a person
in exchanging our knowledge. O ur effort s are
convert ed t o product form and are ment ioned
below for the interest of researcher of this field and
for public interest.
In short our products already developed or under
developmen t are O r iya Machin e Tr an slat ion
(OMTrans), Oriya Word Processor (OWP) support
mult ilin gual, O r iya Mor phological An alyser
(OMA), ORI-Spell, the Oriya Spell Checker (OSC),
Oriya Grammar Checker (OGC), Oriya Semantic
An alyser (O SA), O RI -D ic, t h e bilin gual E-
7 1
Dictionary (English ó Oriya), OriNet (Word-Net
for Oriya), and SanskritNet (Word-Net for Sanskrit)
are under developing phase to produce a complete
Natural Language Processing System.
2.1 Oriya Machine Translation System (OMTrans)
In OMTrans the source language is English and
target language is Oriya. We have developed a parser,
which is an essential part of Machine Translation.
Our parser is capable in parsing various types of
sentence including complex types of sentence such
I am going to school everyday for learning.
He said that Ram is a hard working boy.
After the parsing phase the real translation is done.
Our t ranslat ion syst em is capable in doing t he
t ranslat ion of various t ypes of simple as well as
complex sentences such as:
I am a good boy.
I am going to school everyday.
I will eat rice.
Who is the prime minister of India?
India is passing through financial crisis.
He told that Ram is a good boy.
Our system is doing the translation of sentences
having all types of tense structure. In addition to
that our system is also capable of doing the sense
disambiguation task based on N-gram model such
I am going to bank for money.
I am going to bank everyday.
I am going to bank. I will deposit money
In the above examples the meaning of underline
word (bank) is decided according to the context of
the sentence. Presently it gives very good result for
this type of ambiguous sentences.
2.2 Oriya Word Processor (OWP) (Multilingual)
We have developed OWP, which facilitates OSP,
OGC and OSA along with multilingual editing. The
OWP avails phonetic typing for Oriya character with
help facility (Fig-1). All other basic features are also
available in the OWP like other word processor (Fig-
2) . T h e OWP edi t s O CR ( O r i ya Ch ar act er
Recognition) outputs and also the documents of
other editors.
Fig 1 : Help option of OWP.
Fig 2 : Multilingual processing of OWP.
2.3 Oriya Morphological Analyser (OMA)
Indian languages are charact erised by a very rich
syst em of inflect ions (VIBH AKTI), derivat ion
and compound format ion for which a st andard
OMA is needed t o deal wit h any t ype of t ext in
Oriya language. The number of words are being
derived from a given root word by some specific
synt act ic rule. Our OMA deals wit h morphology
of pronoun, number it self (Nominal, Verbal),
Number less and prefix (Fig-3). We have developed
and implement ed some decision t ree of each t ype
of morphology, by which our OMA is running
successfully. It can help t o all t he applicat ions
involved in MT, OriNet , OSC, and OGC et c.
7 2
Fig-3 : Out put of the OMA for the Oriya derived
word “baHiguDika” in OriNet.
2.4 Oriya Spell Checker (OSC)
The misspelled words are being t aken care of
successfully by our OSC. We have developed some
algorithms to perform our OSC in order to find
out more accurate suggestion for a misspelled word.
The searching algorithm of the OSC is so fast that it
processes 170000 Oriya words simultaneously for
each misspelled word. The words are indexed
according to their word length in our word database
for effective searching. On the basis of the misspelled
word, it (OSC) matches the number of (i) equal
character, (ii) forward character and (iii) backward
character to give more accurate suggestive words for
the misspelled word. Moreover, it also takes help
fr om t h e O r iya Mor ph ological An alyser for
ascertaining the mistakes of derived words. The OSC
functions successfully in our Word Processor. This
S/ W supports both the Windows-98/ 2000/ NT and
the Linux as well. Output of the Spell Checker is
shown in Fig- 4.
Fig. 4 : Suggestive words for the misspelled word
“bAsitAmAne” of OSC.
2.5 Oriya Grammar Checker (OGC)
The S/ W for OGC has been developed to determine
the grammatical mistakes occurring in a sentence
by the help of the OMA and e-Dictionary. It first
parses the sentence according to the rule of Oriya
grammar and checks t he grammat ical mist akes.
Presently, the OGC functions successfully in our
Editor (fig.-5, fig.-6). The OGC -S/ W supports both
the Windows-98/ 2000/ NT and the Linux as well.
Fig.-5 Suggestion of word sequence of OGC in
our OWP.
Fig.-6:- Detection of grammar by OGC
2.6 Oriya Semantic Analyser (OSA)
It deals with the understanding procedure of the
sentence for which it takes immense help from the
KARAKA theory of NAVYA NYAYA philosophy,
one of the most advanced epistemological traditions
of India. The OSA determines the semantic position
of Subject (KARTA), Object (KARMA) etc on the
basis of the verb (KRIYA). In other words, it at first
links to the verb of the Verb Table (VT) and from
7 3
verb it links to the subject. Other KARAKAs are
determined on the basis of these two linkages (fig.-
7). It also takes help from OriNet, OMA and OGC
for better understanding. We have worked out 100
verbs in t he VT det ermining t heir subject ive,
objective, locative categories etc with respect to these
verbs. Presently, with these verbs the OSA functions
successfully in our Editor (fig.- 8). The OSA S/ W
supports both the Windows-98/ 2000/ NT and the
Linux as well.
Fig-7:- Semantic Grammatical Model (SGM) and
Semantic Extract Model (SEM)
(Here downward arrows represent the SGM and
upward arrows along with down arrow from verb
represent the SEM where nominative case acts as
chief qualifier.)
Fig. 8:- Detection of semantic by OSA.
2.7 E-Dictionary (Oriya↔English)
It provides t he word, cat egory, synonymy and
corresponding English meaning of Oriya as well as
English words. The system is successfully functioning
over 27000 Oriya words and 10000 English words.
Help option also provides the keystroke for each
Oriya character in phonetic form. Search engine
handles the misspelled word and gives some accurate
suggest ive words. This S/ W support s bot h t he
Windows-98/ 2000/ NT and the Linux. We are also
in progress to include Hindi and Sanskrit words in
this system for the benefit of the users.
Fig. 9: Overview of the Oriya Word “AkAsha.”
Fig. 10 : Overview of the English Word “delay”.
Fig. 11 : Search Engine (SE) handles misspelled
word and gives suggestion.
2.8 OriNet (WordNet for Oriya)-Online lexical
dictionary/ thesaurus.
One of the major problems in the implementation
of Natural Language Processing (NLP) or Machine
Translation (MT) is to develop a complete lexical
dat abase cont aining all t ypes of informat ion of
words. There are some difficulties in deciding that
what information should be stored in a lexicon and
even greater difficulties in acquiring this information
in proper form. The OriNet system is designed on
the basis of multiple lexical database and tools under
one consist ent funct ional int erface in order t o
7 4
facilitate systems requiring syntactic, semantic and
lexical information of Oriya language. The system
is divided int o t wo independent modules. One
module is developed t o writ e t he source files
containing the basic lexical data and these files are
taken as the input for OriNet system. Lexicographer
takes care the major work of this module. Second
module is a set of programs by which it accepts the
source files, processes it to display for the user and
also provides different int erface t o use ot her
applications. System has been designed using Object-
Oriented paradigm according to Oriya language
structure with over 1100 lexical entries, which allow
flexibilit y, reusabilit y and ext ensibilit y. It also
provides X-windows interface to access the data from
the OriNet database as per user’s requirement. It
can be widely used in different application like, (i)
Word Sense Disambiguat ion (WSD) in O riya
Machine Translation (ii) Oriya Grammar Checker
(OGC) and (iii) Oriya Semantic Checker (OSC).
Moreover, it helps as a lexical resource for Oriya
learners and also for expert scholars who are
involved in the research in NLP. The system also
consists of Oriya Morphological Analyzer (OMA),
which takes care of any type of word such as root
word or derived word and also provides syntactic
information of the word. This S/ W supports both
the Windows-98/ 2000/ NT and Linux. Presently, we
are adding more and more lexical ent ries in t he
source file and developing different applicable
programs for use in wider range.
Architecture and Output
The architecture of the OriNet system is divided
into five parts (Fig.-12) such as Lexical Source File
(LSF), OriNet Engine (OE), OriNet Dat abase
(OD), Morphological Analyser (MA) and User
Applicat ion (UA). The LSF is t he collect ion of
different sorted files according to their syntactic
cat egories, which are t aken as t he input s t o t he
OriNet system. The OE is a set of programs, which
compiles t he lexical source files int o a dat abase
format that facilitates machine-retrieved information
in proper manner. It also acts as the CPU of the
OriNet system. It is used as a verification tool to
ensure the syntactic integrity of lexical files. All of
the lexical files are processed together to build the
final OD. It also provides sufficient information as
per user’s query or other application with the help
of MA.
At the heart of the OriNet system is the OD, which
stores the data in simple ASCII format, several easy-
to-use interfaces have been devised to cope with
varied user requirements and the raw data extracted
from the database are well formatted for display
purpose. MA takes care syntactic analysis of any types
of lexicon and also provides the root words with
ot her gr ammat ical in for mat ion . X- win dows
interfaces are also designed to access the database
with different types of options (Fig.-13).
Fig. 12:- Architecture of the OriNet System
Fig. 13:- Overview of the word bþmþ (good).
2.9 SanskritNet (Word-Net in Sanskrit)-Online
lexical dictionary/ thesaurus.
Sanskrit language is the base for most of the Indian
languages. It has link with foreign languages too.
To st udy t his language and also t o use it for
kn owledge en h an cemen t , effect i ve Mach i n e
Translation (MT) from one language to the other is
7 5
necessary for which online lexical resources are
needed. Towards making such resources we have
tried to develop a Word-Net for Sanskrit language
using the Navya-NyAya (-||-|||) Philosophy and
Paninian grammar. Besides Synonymy, Antonymy,
H yper n ymy, H ypon ymy, H olon ymy an d
Meronymy we have also introduced Etymology and
Analogy separately as they play important roles in
Navya-NyAya (-||-|||) Philosophy, which is t he
specialt y on t his Sanskrit Word-Net for a bet t er
classification of words.
Presently, we have defined and analysed 300 Sanskrit
words (200 nominal words and 100 verbal words)
in the SanskritNet. On the basis of the analysis and
the definition of these words, we are on progress to
design and implement the SanskritNet and other
allied applications as well. Here the prototype model
of the SanskritNet is displayed (Fig.-14).
Fig. 14: - Output of the Nominal word “ºÉÚªÉÇ” (Sun)
in SanskritNet
3. Speech Processing System For Indian Languages
In the age of fast technology, where information
travels at the speed of light, we still rely on feeding
our text inputs in typographical manner. Speech
Processing System is an approach to provide a speech
interface between user and the computer. Basically
our system zeros in on Hindi, Oriya and Nepali
languages at the Resource Centre.
Broadly the system is classified into two sections.
• Text To Speech (TTS)
• Speech To Text (STT)
To design a syst em/ algor it h m, wh ich wor ks
efficiently, naturally and utilizes memory as less as
We have developed T TS for Oriya and Hindi
language and the TTS for Nepali language is in
3.1 Text to Speech (TTS)
As the name signifies this system provides an interface
through which a user enters certain text/ document
and it is the software that reads it as natural as a
human. The basic approach followed here is, first to
analyse the document (language, font etc.), and then
extract words from the text, try to parse individual
words into vowels and consonants respectively. Then
corresponding t o t hese vowels and consonant s
existing (previously stored in the database) “.wav”
files are concatenated and played.
Technologies Behind :
• Creating the wave file database.
For creation of such a database we studied a lot of
recorded words and sentences and try to break them
int o vowels and consonant s by minut e hearing.
Then we analyse t hose cut pieces and st ore t he
appropriate and generalised form in the database.
• Parse extracted exact words of a given sentence?
The same words in different sentences have different
stress due to its position in the sentence. Appropriate
hidden vowels are detected form the words extracted.
For example considering a word “jc¯“ ( SAMAYA)
in Oriya is parsed as follows :-
jç + @ + c ç + @ + ¯çÆ + @
The format of vowel and consonant break point is
shown in the fig1.
Again considering the same word in Hindi i.e., Þle;ß
(SAMAY) is parsed as follows :-
l~ $ v $ e~ $ v $ ;~ $ v
• Choosing of appropriat e ‘.wav’ file from t he
7 6
Con sider in g t he above example, may be t he
vowels we get aft er parsing are t he same as ‘@’
(for Oriya) or ‘~’ (for Hindi), but it is not exact ly
t he same ‘a.wav’ we concat in every case. Thus,
we analyse vowels broadly in t hree cat egories as
ma:t ra: in
• Beginning
• End
And this is observed that the duration of ma:tra:s
say ‘@’ here, varies from each other, i.e. ‘@’ in middle
is not the same as that in the end. Again accordingly
we need to get the appropriate “.wav” files from
the database.
As observed in the example
i) The durations of ‘@’(Oriya) are as follows:
Starting ma:tra: - 0.065 sec
Middle ma:tra: - 0.105 sec
End ma:tra: - 0.116 sec
ii) The durations of ‘~’(Hindi) are as follows :
Starting ma:tra: - 0.063 sec
Middle ma:tra: - 0.078 sec
End ma:tra: - 0.020 sec
• It is observed that concatenation of the wave files
is not that natural as expected. This is due to the
certain transitions between the characters in the
actual pronunciation. Thus we are developing a
robust algorithm for the generation of naturalness
in the TTS output.
# Helpful for the blinds and illiterates.
# Airways and Railways announcement system.
# Init ial phase of Speech-t o-Speech conversion
3.2 Speech to Text (STT)
The speech processing is a pat t ern recognit ion
problem. The recognit ion of speech is defined as
an act ivit y whereby a speech sample is at t ribut ed
t o a person on t he basis of it s phonet ic-acoust ic
or percept ual properties. In our approach we study
t he nat ure of spoken words by different speakers.
From a cont inuous sent ence t he word boundaries
ar e det ect ed an d t h e n at ur e of ut t er an ce of
individual consonant s and vowels are marked t o
st udy t heir behaviour for a part icular speaker. We
use word spot t ing t echniques and t he variat ions
of p i t ch an d i n t on at i on ar e mar ked . T h e
algorit hm makes t he use of t he propert ies of F
cont ours such as declinat ion t endency, reset t ing
and fall- rise pat t erns in t he ut t erance. Once t he
patterns are obtained it is mapped with a particular
charact er and t he proper t ext out put is obt ained.
The boundary and paramet er det ect ion is shown
in fig2. Aft er obt aining t he paramet ers of an
ut t ered word, t hey are st andardise t o maint ain t he
speech dat abase. We also provide some t olerance
in each paramet er, so t hat lit t le variat ion in t he
ut t erance doesn’t affect t he recognit ion of t he
word for t ext conversion. Form each word t he
charact ers are obt ained and when a speaker ut t ers,
t he corresponding charact er is mapped t o t he t ext
and t he out put is obt ained.
wah kitna sunder hai !
|r |+|-|| ·|-·· r '
The st ress on t he vowels (rising and falling) are
shown in the pitch curve. The spectrograms of the
ut t ered words show t he cont ent frequency and
int ensit y of ut t erance. The word boundar y is
detected by formant analysis. Though F
plays a
major r ole. T he cepst r al compon en t of t he
cont inuous speech put s major emphasis in t he
utterance. Hance taking the word boundary into
consideration the speech database show around 73%
accuracy for hindi and 90% in O riya speech
# Speech password for security in banking sector.
# Foreign sic experts need for speech recognition
& identification of criminals.
# For the blind and illiterate people, provides an
intuitive interface with machine.
7 7
# Final phase of Speech-t o-Speech conversion
jç @ (starting ma:tra:)
cç @ (middle ma:tra:)
¯ç Æ @ (end ma:tra:)
Fig. 1
wa ha ki t na sun da r h ai
Fig. 2
4. The Team Members
Khyamanidhi Sahoo
Hemanta Kumar Behera
Pravat Chandra Satapathy
Suman Bhattacharya
Ajaya Kumar senapati
Pravat Kumar Santi
Gour Prasad Rout
Kabi Prasad Samal
Krushna Pada Das Adhikary
Courtesy : Prof. (Ms) Sanghmitra Mohanty
Utkal University
Department of Computer Science & Application
Vani Vihar, Bhubaneshwar – 751 004
(RCILTS for Oriya)
Tel: 00-91-674-2585518, 254086
OCAC, Bhubaneswar
1. Product Developed at OCAC and the Target
date of its Commercialization
1.1 Oriya Spell Checker
Oriya Spellchecker has been developed in Linux
platform. It consists of 60,000 base words. These
base words can be manipulat ed and st ored in
dictionary in a scientific manner. Root words can
also be man i pu lat ed. T h i s Spellch ecker i s
incorporated in our Oriya word processor Sulekha.
This project is developed for t he smoot h use of
novice users.
Here we are concentrating with our suggestion and
checking of words. The checker checks the pattern
string and suggests to the user more accurate and
required suggestion. Like English Indian languages
are different in their nature. Indian languages are
complex in their nature. For that we have designed
a checker and that checks the pattern string. Words
are stored in side the dictionary file line by line.
Each line is terminated through an escape sequence
of new line character. All these Oriya words are stored
inside a file having extension .txt.
Spellchecker consists of several facilities like Prefix,
Suffix, Add, Ignore, Change, Spellcheck, Suggestion
and Cancel. Suggestion is accurate and match more
closest word to the user requirement. It checks more
words within a fraction of seconds. Searching time
is minimum and complexity is less. Spellchecker can
be incorporated in any application for example it
can be incorporated in a word processor or an editor
or a browser. It uses ISCII data stored in a file.
ISFOC for display. Get s t he input ISFOC dat a
displayed the ISFOC data.
Wi n dows ver si on h as been complet ed an d
commercialized and incorporated in LEAP Office
of CDAC Linux version Completed
1.2 Thesaurus in Oriya
Oriya Thesaurus has been developed in Linux
platform. It consists of 40,000 base words. These
base words can be manipulat ed and st ored in
dictionary in a scientific manner. Root words can
7 8
also be manipulated. This Thesaurus is incorporated
in our Oriya word processor Sulekha. This project
is developed for the smooth use of novice users. It is
a tool for correct documentation. Words are stored
in side the dictionary file line by line. Each line is
terminated through an escape sequence of new line
character. All these Oriya words are stored inside a
file having extension .txt.
Thesaurus consists of several facilities like hesaurus,
Replace. Suggestion is accurate and matches closest
word to the user requirement. It checks more words
wit hin a fract ion of seconds. Searching t ime is
minimum and complexity is less.Thesaurus can be
incorporated in any application for example it can
be incorporated in a word processor or an editor or
a browser.
1.3 Bilingual Electronic Lexicon
The dict ionary for administ rat ion and official
correspondence has been entered for Oriya-English
and English-Oriya. It is proposed to bring out an
Elect ronic Dict ionary as a commercial product
which will have features such as phonetic key board
and user friendly functions for “add”, “delete”, “view”
etc. After the availability of text to speech technology
in Oriya the same will be integrated to this product
for providing help in correct pronunciat ion of
words. Lexical inputs from other domains such as
culture, business, science are being collected.
Words have been entered. As a product with more
words will be completed in June 2003.
1.4 Corpus in Oriya
A large corpus has been developed consisting of
nearly 8 million words for analysis and use in spell
checker and thesaurus. Corpus creation has been
taken as a continuing activity as a part of this project.
It will be used for research and use in other products
and is not for commercialization,. (completed)
1.5 Bilingual Chat Server
Oriya Chat is a Standalone Chat Application in
O r i ya Lan gu age d eveloped i n JAVA. Aft er
connect ing t o t he Server, one can chat in one’s
favorite room, send direct messages with emotion
icons to other online users as well as send public
User Log in to Chat Server with unique ID, Con-
nect to the Server, with or without firewall/ proxy,
Phonetic Keyboard Layout to write in Oriya,Sen d
Oriya text with Emoticons (images), Private and
Public Chat With Friends, Domain specific Pre-
defined Public Rooms.
System Description
O riya Chat has a client and a server module.
Serversocket and SocksSockets are used for com-
municating with remote machine. SocksSocket is a
CustomSocket that extends Socket to be used with
the firewall or proxy.
7 9
1.6 Net Education
A Net Education System (NES) is being developed
using Bilingual Chat Server and Mail Server. The
educational Content is being created for providing
students, public for education on various aspects
through the net.
The Chat Sever with Interface has been completed.
A full fledged system with content will be completed
1.7 XML Document Creation and Manipulation
XML document creation technology in English and
Oriya language has been developed. It incorporates
users defined tags, structures the document like a
dat abase providing real world meaning of t he
cont ent . This provides more flexibilit y t o t he
presentation of the same document in different style
sheets and same style sheet for different documents.
Also queries can be processed on the contents of the
XML document . Cont ent on Tourism has been
creat ed in XML. Agricult ural cont ent useful for
farmers are being created.
A few documents have been completed. More work
is in progress to be completed by August 2003.
1.8 Oriya OCR
OCAC is collaborating with ISI Calcutta to develop
and commercializat ion of Oriya OCR . All t he
language inputs has been provided to ISI Calcutta.
The software developed by ISI Calcutta is being
tested by OCAC. OCAC will commercialize the
Oriya OCR with know how from ISI Calcutta.
Work in collaborat ion wit h ISI, Calcut t a is in
progress. Will be completed in June 2003
1.9 Oriya E-Mail
A majority of Indian population cannot speak, read
and write in English language. Oriya language based
tools helps the common man to benefit from e-
governance and other applications of Information
In order to bridge the digital divide it is necessary to
provide Internet application such as e-mail and chat
in Oriya language for use by the common man.
Varta is an e-mail solution, which enables people to
send and receive mail in Oriya language. It uses
dynamic version of the OR-TTSarala font, created
using Microsoft WEFT-III. Use of dynamic font
allows t he client t o read and compose t he mail
without downloading the Oriya font. The system
has been developed using ASP t echnology, Java,
JavaScript and HTML. Varta uses default SMTP
Server on the Internet Information Server 5.0 of
Window 2000 Server.
Most of the basic e-mail features are available in
Vart a 0.2, which includes : Sending, Replying,
Forwarding and Reading Oriya Mail using dynamic
Font , Online Regist rat ion of unique user mail
account, Password Security with Online Help to
ret rieve forgot t en Password, Phonet ic keyboard
t yping in Oriya, Online t yping help t hrough a
keyboard map to dynamically select appropriate keys.
Phonetic Keyboard Engine
The Oriya keyboard has been designed as per Indian
phonetic standard. It is implemented as an applet
in Java language. It is compiled in Java 1.0 JDK so
that it runs in MS Internet Explorer 5.0 onwards.
1.10 Oriya Word Processor with Spell Checker
under LINUX
For the first time an Oriya word processor has been
developed on Linux operating system. It runs on
any X-windows syst em. Two Keyboard engines
(Inscript keyboard engine and Phonetic Keyboard
8 0
Engine) have been incorporat ed for t he ease of
t ypi n g. All t h e t ext s wi ll be di splayed i n
GLYPHCODE. For display of a stored ISCII file,
a con ver t er en gi n e au t omat i cally gen er at es
GLYPHCODE equivalent of the file.
Most of t he basic edit ing feat ures available in
standard word processor have been incorporated
which includes : Opening new file, saving a file,
closing window, cut, copy, paste, Single level Undo
and Redo is provided, Keyboard engine for typing
in Inscript layout, Keyboard engine for typing in
Phonet ic layout , The ISCII t o GLYPH CODE
con ver si on , T h e GLYPH CO D E t o I SCI I
conversion, Storing content of a file in plain text as
well as in ISCII format, Other features such as Find,
Find and Replace, Go To Line, File Information, A
Spell Checker in cor por at ed wit hin t he wor d
Keyboard Engine
Two Keyboard engines have been designed. One
according t o Indian script (Inscript ) st andard.
Another according to Phonetic standard. All the
consonant s, vowels and conjunct s and special
symbols etc are displayed in Glyph Code format.
Invalid word combinat ions according t o Oriya
Grammar are disallowed. Pressing the Scroll Lock
Key activates the Keyboard Engine.
Converter Engine
An Export file option is provided to convert 8-bit
glyph code to 8-bit ISCII code format. The 8-bit
ISCII code can also be converted back to 8-bit glyph
1.11 Computer Based Training (in Oriya)
It is well known that multimedia CDs are highly
effective medium of learning in many disciplines.
The integration of text with hypertext links, speech,
graphics, animat ion and video in an int eract ive
mode makes t he lessons easy t o visualize and
underst and for self-learning. Comput er Based
Tut or ial CD s for school children have been
developed to support the classroom teaching with
interactive self-learning program.
Informat ive cont ent wit h animat ed illust rat ions,
I n t er act i ve qu i z for evalu at i on of su bject
This project is aimed at the development of tutorial
CD s based on School Cur r iculum, in O r iya
language. T he cont ent is based on t ext books
prescribed by Board of Secondary Education, Govt.
of Orissa. The lessons are illustrated with text, speech,
and animat ion wit h facilit y t o navigat e t hrough
various chapters using hyperlinks. At the end of a
chapter, the on line quiz helps in self-evaluation by
the student. It is based on Windows 98 Operating
System and Tools used are Macromedia Authorware
6, Macromedia Flash 5.0, Adobe PageMaker 6.5, Adobe
Illustrator 9.0. The minimum System Requirements
for Installation is Pentium II or higher, 128 MB or
more RAM, Internet Explorer 5.5 or more
Two CD have been completed .Write up with colour
brochure enclosed. Work in more subject is in
8 1
1.12 Oriya Language Based E-Governance
A n u mb er of soft war e mod u l es h ave b een
developed for enabling t he common cit izens t o
access government informat ion and services in
Oriya and English languages from t he Kiosks. A
cit izen can submit grievance pet it ion; can apply
for one or more cert ificat es ( such as birt h, deat h,
caste, income etc; down load the application forms
and know t he rat es of consumer commodit ies et c.
These modules have been used as a pilot project
on improving cit izens’ access t o informat ion in
Kalahandi dist rict of Orissa.
Development of a number of applications has been
completed and int egrat ed wit h t he project on
Citizens’ Access to Information.
2. TOT done for various projects so far
1. Spell Checking Grammar rules in Oriya Language
has been Transferred to CDAC.
2. Documents on Oriya Language Standards and
Pr i n ci ples h ave been pr epar ed an d for
UNICODE and sent to MIT.
3. T he draft for t he preparat ion of Nat ional
Language Design Guide for Oriya has been
prepared and submitted to MIT
4. Oriya Script with Grapheme details have been
provided to ISI, Calcutta for Oriya OCR.
3. Training programmes run for the officials for
the State Govt.
OCAC had earlier trained 70 persons in various
government offices on O riya Language Word
Processing using LEAP Office in which the Oriya
Spell Checker has been incorporat ed. The last
formal t raining course on “ Word processing in
Oriya Language using Leap Office and ISM “ was
conducted during January 17 - February 12 2002.
in which only four officials attended the course.
Again a course has been organized for 106 staff of
the Board of Revenue during 29-07-02 to 3-08-02
(first bat ch) and 5-08-02 t o 10-08-02 (second
batch). So far 180 officers have been trained. We
are developing course material on a CD for training
on Oriya Language.
4. Core Activities
4.1 Web hosting of Oriya Language Classics
Three well known Oriya classics :
(1) “Chha Mana Atha Guntha” by Fakir Mohan
(2) “Chilika” by Radhanath Ray and
(3) “Tapaswini” by Gangadhar Meher - have been
hosted on I n t er n et at h t t p:/ /
4.2 Hosting of Web Sites of Govt. Colleges in Orissa
Web sites have been hosted for the Khallikote College,
Berhampur www.khallikot ecollege.ut kal.ernet .in
an d Raven sh aw College, Cu t t ack.
www.ravenshawcollege.ut kal.ernet .in The pages
include information on the college as well as forms
for students feed back. More dynamic information
will be supported by the web sites in future, which
will be undertaken for many colleges in Orissa.
5. Products proposed to be developed
1. CBT for all classes (std – 4
to 10th)
2. OCR
3. Upgradation of spell checker
4. Grammar module development
5. Machine Translation System (Oriya –English)
6. Net Education System in Oriya Language
6. The Team Members
Biswa Ranjan Sahoo
Sambit Kumar Sahu
Girija Shankar Sarangi
Sunil Kumar Panda
Sarita Das
Subhendu Kumar Mohanty
Sanjay Kumar Dey
Courtesy : Shri S.K. Tripathi
Orissa Computer Application Centre
OCAC Building, Plot No. 1/7-D,
Acharya Vihar Square, RP-O,
Bhubaneswar – 751 013
(RCILTS for Oriya)
Tel: 00-91-674-2582484, 2582490,
2585851, 2554230 (R)
8 2
Indian Institute of Technology, Guwahati
Department of Computer Science & Engineering, Panbazar
North Guwahati, Assam-781031 India
Tel. : 00-91-361-2691086 E-mail :
Website : http:/ / rcilts
Resource Centre For
Indian Language Technology Solutions – Assamese & Manipuri
Indian Institute of Technology, Guwahati
Achievements Achievements Achievements Achievements Achievements
RCILTS-Assamese & Manipuri
Indian Institute of Technology, Guwahati
Int roduct ion
Conceived in t he millennium year, t he Resource
Cent re for Indian Language Technology Solut ions
Indian Inst it ut e of Technology Guwahat i, is a
project funded by t he Minist ry of Informat ion
Technology of t he Government of India. The
main object ive of t he Cent re is t o make elect ronic
informat ion available in nat ive languages mainly
Assamese an d Man i p u r i t h er eby ai d i n g t h e
disseminat ion of informat ion t o t he larger masses.
This Cent re is one among several in t he nat ion
and is equipped wit h all modern syst ems and
language relat ed soft ware. Four invest igat ors from
t he Indian Inst it ut e of Technology Guwahat i and
one collaborat or from Guwahat i Universit y are
present ly involved in research in t he areas of
Nat ural Language Processing, Speech Recognit ion
and Opt ical Charact er Recognit ion Syst ems. The
Cent re is equipped wit h all modern equipment s,
which include several Pent ium-III and Pent ium-
IV machines, a Linux server and a Windows2000
ser ver, scanner, web camera, print ers, digit al
camera and audio recording syst ems. Eight project
per son n el, five t echn ical an d t hr ee lin guist s
working in joint collaborat ion wit h Guwahat i
Universit y, man t he Cent re.
The Centre has developed various language related
t ools and t echnologies over t he last t wo and odd
years. Det ails of t he same follow:
1. Knowledge Resources
1.1 Corpora : A syst emat ic collect ion of speech
or writ ing in a language or variet y of a languages
forms a language corpus. The t wo corpora creat ed
by t he Cent re are-
" Assamese Corpora: Seven Assamese novels
have been t ransformed int o elect ronic form
t o form t he base corpus. The salient feat ures
1. Number of words : 6,00,000
2. Font s used : AS-TTDurga (C-DAC font ) and
Geet anjalilight (A popular font used in DTP
3. Encoding St andard: ISCII.
" Manipuri Corpora : The Manipuri corpora
were received from Minist ry of Informat ion
Technology, Govt . of India. T his corpora
creat ion st art ed at Manipur Universit y under
t he aegis of Prof. M. S. Ningomba and Dr.
N . Pr amod i n i . T h e cor p or a wer e i n LP1
format , which is compat ible t o Leap Office.
The t ot al word count comes out t o be around
3,60,000. The font s used in t he creat ion of
t his corpus are BN-TTDurga, BN-TTBidisha
and AS-TTBidisha. Furt her invest igat ion t o
make it compat ible t o exist ing syst ems is in
1.2 Dictionaries : Derived from t he Lat in word
dictio (t he act of speaking) and dictionarius (a
collect ion of words), a dict ionary is a reference
book t hat provides list s of words in order along
wi t h t h ei r mean i n gs. D i ct i on ar i es may also
pr ovide in for mat ion about t h e gr ammat ical
for ms, syn t act i c var i at i on s, p r on u n ci at i on ,
variat ions in spellings, et ymology, et c. of a word.
T he Cent re has developed e-dict ionaries, t he
det ails of which are shown in t he Table 1.
Dictionaries Root Words Total words
English-Assamese 5000 49,627
Assamese-English 2000 9,015
English-Manipuri 2500 36,778
Manipuri-English 3000 7,228
Table 1. Dictionary details
Each en t r y i n t h e D i ct i on ar y con t ai n s
informat ion on – i) t he Grammat ical form, ii)
Mean i n gs i i i ) Syn on yms i v) An t on yms, v)
Pronunciat ion of t he word (in t he from of sound
files) vi) Translit erat ion in English, vii) Soundex
code and viii) Semant ic cat egory. More words are
being added on.
a) Provides informat ion about a word.
8 4
b) These dict ionaries have been st r uct ured t o
su p p or t t h e sp el l ch ecker an d mach i n e
t ranslat ion syst ems being developed at t he
Cent re.
1.3 Design Guides
A d esi gn gu i d e gi ves a br i ef over vi ew of a
part icular language. The main t opics covered by
t he design guide are t he consonant s, vowels,
conjunct s and mat ras which form t he charact er
set of t he language and numerals, punct uat ion
marks, mont h names, weekday names, t ime zones,
currency, weight s and measurement s used in t hat
part icular language. Addit ionally, some linguist ic
i n for mat i on su ch as Ph on ologi cal feat u r es,
grammat ical feat ures, t he hist ory of t he language
and the geographical description of the state where
the particular language is spoken are also included.
T h e Resou r ce C en t r e for I n d i an Lan gu age
Tech n ol ogy Sol u t i on s, I n d i an I n st i t u t e of
Technology Guwahat i has designed t wo design
guides for Assamese and Manipuri.
1.4 Phonetic Guides
Pr on u n ci at i on i s also an i mpor t an t par t i n
language learning for which we have a Phonet ic
guide in t he websit e. This guide describes how a
part icular alphabet should be pronounced. The
Int ernat ional Phonet ic Alphabet (IPA) is used in
t he dict ionaries as well as in t he descript ion of
t he languages (Nort h –East India) available in t he
web sit e.
2. Knowledge Tools
2.1 The RCILTS, IITGuwahat i Websit e (ht t p:/ /
www.iit g.ernet .in/ rcilt s)
As part of it s core object ives t he Cent re host s a
websit e t hat offers a wide variet y of informat ion
ranging from Assamese and Manipuri languages
t o Geographical and Cult ural issues.
Informat ion host ed may be cat egorized as:
(i) Linguistic Information: T he Nor t heast is
known for it s large diversit y in languages.
Present ly t he websit e holds informat ion on
sixt y-five of t he exist ing languages.
(a) Northeastern Languages : T h e websi t e
cont ains informat ion on sixt y-five Nort h–
East ern languages like Assamese, Chokri,
C h akesan g, Z el i an g, Poch u r i , Lot h a,
San gt am, D eu r i , D i masa, Kokb or ok,
T i n t eki ya, Koch , H u r so, Mi j i , C h an g,
Kh i amn gamn , Kon yak, N oct e, Ph om,
Tan gsa, Wan ch oo, H mar, Kar b i , Ku ki ,
Lakher, Man ipur i, Mizo, Rian g, Khasim
Monpa, Takpa, Tsangla, Sherdukpen, Sulung
( Pu i r ot ) , Ad i , Ap at an i , Bor i , Mi sh mi ,
Mishing, Nishi, Tagin, Liangmaialong wit h
t heir classificat ion.
(b) Li ngui sti c Map : Pr el i mi n ar y wor k
commenced wit h t he compilat ion and design
of t he Linguist ic Map of t he Nort heast . Aft er
acquiring published and unpublished material
on t hese languages of t he Nort heast and wit h
i n p u t s fr om t h e C en su s D ep ar t men t ,
Surveyors, Linguist ic fieldwork and findings
of Dr. Dipankar Moral, t he Linguist ic Map
came int o being. The websit e present ly has a
coloured updat ed version of t his map. The
size of t he fon t con n ot es t he den sit y of
speakers of a part icular language in t hat area.
Figure :1 Resource Centre’s Home page
8 5
(ii) On-line Dictionary: A transliterated Assamese
dict ionary has been put up on t he websit e. It
carries informat ion on t he Meanings, (in
Assamese an d En gl i sh ) , G r ammat i cal
cat egories and t he Pronunciat ion.
(iii) Geographic and Demographic information:
Bot h geogr ap h i c an d d emogr ap h i c
informat ion are made available for each of
t he seven Nor t heast er n st at es of Assam,
Meghalaya, Manipur, Nagaland, Mizoram,
Tripura and Arunachal Pradesh.
(iv) Guides: Guides have been put up on t he
websit e t o aid users t o underst and language
relat ed issues. Two such guides t hat are
current ly are:
(a) Design Guides: D esign guides for bot h
Asamiya and Manipuri t hat give a general
idea of t he respect ive language t oget her wit h
cert ain frequent ly used sent ences has been
host ed t o aid a novice get a basic idea about
t he language.
(b) Phonetic Guides: Pr on u n ci at i on i s an
i n t egr al p ar t of t h e l an gu age l ear n i n g
mechanism. A Phonet ic Guide describes how
an alphabet should be pron oun ced. T he
Int ernat ional Phonet ic Alphabet (IPA) has
been used in t he dict ionaries as well as in t he
descript ion of t he languages (Nort h–East
India) t o convey t he manner in which t he
words are pronounced. The Phonet ic Guide
allows t he user t o int erpret t he same correct ly.
Figure 2 depict s t he manner in which one
can correlat e t he IPA symbols t o t he normal
ort hography especially for t he pronunciat ion
of vowels using t he phonet ic guide. Figure 2
depict s t he IPA chart for vowels.
IPA English Phonetic
Word Representation
i pin pin
e pen pen
ε Pat (in t he product ion of pæt
vowels t he upper surface
of t he t ongue is always
convex, hence t he result
in t ongue height . In t his
cont ext it may be ment ioned
t hat Asamiya/ ε / is slight ly
higher t han English / æ / as
in / pæt / pæt
a farm fam
u t oo tu
T put pTt
o saw so
N got gNt
Figure 2. Vowel IPA chart
(v) Web based Dictionary: A t r an sli t er at ed
Assamese Dict ionary has been put up on t he
websit e. This dict ionary is possibly t he first
cont emporary st andard on-line Assamese
di ct i on ar y. Mor eover, t h i s di ct i on ar y i s
designed in such a way t hat it will be helpful
for t h e n on - n at i ve sp eaker. I t p r ovi d es
i n for mat i on on a) En gl i sh Wor d b )
G r ammat i cal C at egor y c) Mean i n g d )
Pronunciat ion and e) Assamese Meaning.
(vi) Bookmarks: Addit ional Links t o various
Newspapers of Assam have been provided. A
Manipuri News let t er t it led “Manika” has also
been put up. Furt her links t o t he various
t ourism relat ed sit es of Assam and Manipur,
News Port als and ot her Resource Cent res are
also made available in t he websit e.
2.2 Fonts : The Cent re has developed a True Type
Font “Asamiya” using Font ographer soft ware.
This font can be used for t yping Assamese t ext in
Microsoft Word. Fine Tuning of t he exist ing Font
8 6
Set has been done where t he spacing and Mat ras
have been adjust ed. A Manipuri True Type font
has also been creat ed.
« · · « c
· · · «
· · . ·
Figure : 3 Snapshot of Assamese font “Asamiya”
2.3 Spell Checker
A Spell Checker forms a vit al ingredient of a word
p r ocessi n g en vi r o n men t . T h e b asi c t asks
performed by a spell checker include comparison
of words in a document wit h t hose in a lexicon
of correct ent ries and suggest ing t he correct ones
when required. The t wo commonly used met hods
for det ect ion of non-words are t he Dict ionary
look-up and t he N-gram analysis. Isolat ed-word
er r or cor r ect ion is achieved by eit her of t he
Minimum edit dist ance t echnique, t he Similarit y
key t echnique, Rule-based met hods, N-gram,
Probabilist ic and Neural Net t echniques. Cont ext -
depen den t er r or cor r ect ion met hods usually
employ Nat ural Language Processing (NLP) and
St at ist ical Language Processing for correct ing real-
word errors.
Spell checking strategy for Assamese
The development of a Spell Checker for Assamese
has been undert aken at t he Resource Cent re, and
code has already been developed in Perl. The Spell
Checker exist s in t he form of separat e modules
for error det ect ion and correct ion, as well as a
st and-alone syst em in which all t he spell checking
rout ines have been int egrat ed. The st rat egy used
is described below.
Non-Word Detection
T h e non-words ar e d et ect ed by looki n g u p
document words in a dict ionary of valid words.
The dictionary used is actually a word list in which
5000 dist inct Assamese words have been ext ract ed
from t he English-t o-Assamese online dict ionary
developed by t he Cent re, and t he rest of t he words
were t aken from a corpus of about 67,000 words.
This means t hat t he look up dict ionary cont ains
around 72,000 words. A hash t able has been used
as a lexical lookup data structure. The performance
of a dict ionary using a hash t able is quit e adequat e,
even for complex access pat t erns. Perl has an
efficient implement at ion of a hash t able dat a
st ruct ure, which has been used in t he non-word
det ect ion module.
Isolated-Word Error Correction
A t h r ee-pr on ged st r at egy h as been used for
gen er at i n g su ggest i on s comp r i si n g of t h e
Sou n d ex, Ed i t - d i st an ce an d Mor p h ologi cal
processing met hods.
Soundex Method
This met hod maps every word int o a key (code)
so t hat similarly spelled words have similar keys.
A Soundex encoding scheme for Assamese has
been designed based on t he encoding scheme for
English which comprises of a set of rules for
encoding words and 14 numerical codes. The
Soundex code of t he misspelt word is comput ed,
and t he dict ionary is searched for words, which
have similar codes. An example is given below
(Table 2).
Word Soundex code
+++ meaning +-
affect ion
·‘ - - meaning ·‘ ·
+ ¯·. ·.-.· .- meaning ++.-+·
would have analyzed
Table : 2 Assamese words and soundex codes
Edit-Distance Method
Candidat e suggest ions are obt ained by ‘Damerau’s
err or reversal’ met hod. T he four well known
error-inducing edit act ions are t he insert ion of a
8 7
su p er fl u ou s l et t er, d el et i on of a l et t er,
t r an sp osi t i on of t wo ad j acen t l et t er s an d
su b st i t u t i on of on e l et t er for an ot h er. I n
‘Damerau’s error reversal’ each edit act ion is
applied t o t he misspelt st ring and a set of st rings
is fir st gen er at ed. T h ese ar e ch ecked in t h e
dict ionary t o see which of t hem are valid words
t o finally produce t he suggest ions.
Morphological Processing
Morphological analysis is performed t o ext ract t he
root word from t he misspelt word as also a list of
valid affixes that can be attached to that root word.
By at t aching t he affixes t hat closely mat ch t hose
of t he misspelt word t o t he root word, a list of
suggest ions is generat ed.
A module for ranking of suggest ions, based on
Minimum edit -dist ance met hods has also been
A r u n - t i me sn apsh ot of t h e Assamese Spell
Checker is shown in Figure 4.
Features of The Stand-Alone Spell Checker
• GUI developed using Perl/ Tk, wit h simple t ext
edit ing facilit ies.
• Current ly suppor t s C-DAC’s AS-T T Durga
font .
• Assamese t ext files can be loaded ont o t he GUI,
edit ed, and saved.
• Misspellings can be marked by clicking t he
‘Det ’ (‘Det ect ’) and t hen t he ‘Show’ but t on.
• Select i n g t h e mi sspelt wor d an d cli cki n g
‘Su g’( ‘Su ggest ’) b u t t on can gen er at e
suggest ions.
• A facilit y t o add new words t o t he dict ionary
• ‘Select All/ Unselect All’ opt ion available in t he
‘Edit ’ menu on right clicking on t he t ext , for
select ing/ deselect ing ent ire t ext .
• ‘Copy’/ ’Cut’ and ‘Past e operat ions possible for
t ext wit hin t he GUI window.
• Text can be copied/ cut from an I-leap document
and pasted onto the GUI window.
Figure 4. The Assamese spell checker.
The Soundex encoding scheme for Assamese has
been refined into a more fine-grained one, and now
compr i ses of 21 n u mer i cal codes. Added
fun ct ion alit y has been in corporat ed in t o t he
Soundex code generator for handling matras attached
t o consonant s and conjunct s in t he first let t er
posit ion, as also for khandat a/ chandrabindu/
anuswar/ bisarga attached to consonants, vowels and
The Edit-distance module now accounts for 2234
dist inct let t er combinat ions (mat ras at t ached t o
consonant s/ conjunct s, khandat a/ chandrabindu/
anuswar/ bisarga attached to consonants/ vowels /
Tests conducted against a corpus of about 67,000
words reveal that the Edit-distance method gives the
best result s, followed closely by t he Soundex
met h od. A n on - wor d det ect i on modu le for
Manipuri has also been developed. Work has also
been undertaken to integrate a spell checking facility
for Assamese in the Microsoft Word environment .
8 8
2.4 Assamese language support for Microsoft
Word : Word processors like Leap Office provide
t h e faci l i t y of d ocu men t t yp i n g i n I n d i an
languages. Cont ent creat ion can be done wit h t he
help of these editors by using the inscript keyboard
layout (provided by edit ors). T he Cent re has
developed a macro t hat allows t he user t o t ype
Assamese text in Microsoft Word without the need
of available Indian Language Edit ors.
Technology Description
This Microsoft Word macro maps t he input s t o
appr opr iat e glyphs. T he macr o suppor t s t he
Inscript keyboard layout . Typing can be done
using t he font “Asamiya” developed by t he Cent re.
The use of t he Inscript keyboard layout facilit at es
smoot h migrat ion from C-DAC t o Microsoft
t echnologies.
• Inscript keyboard layout
• Templat e macro
• Support s “Asamiya” font and C-DAC Assamese
fonts (As-ttDurga, As-ttBidisha, As-ttDevashish,
As-t t kali, As-t t Abhijit ).
• Al l t h e feat u r es of MSWor d ( su ch as
just ificat ion, different font st yles and different
font sizes et c.) can be used.
• Document s are st ored in glyph form.
Figure 5. Snapshot of Assamese macro with
different font styles.
2.5 Morphological Analyzers : Given any word
or a group of words, a morphological analyzer
det ermines t he root and all ot her inflect ional
forms. Morphological Processing plays a vit al role
in t he development of Spellcheckers and Machine
Translat ion Syst ems. The Cent re has developed
morphological analyzers for bot h Assamese and
a) Assamese Morphological Analyzer: T h e
analyzer has been developed for use wit h t he Spell
checker and t he Machine t ranslat ion syst ems.
Technology Description St emming t echnique
forms t he base of t he Assamese morphological
an alyzer. In t he above t echn ique, affixes are
added/ delet ed according t o t he linguist ic rules.
The derived words are verified wit h t he exist ing
Corpus/ Dict ionary t o t reat as valid words.
a) Current ly works wit h Eleven linguist ic rules
b) More rules can be added wit hout alt erat ion
in code.
c) Modules are available in t he form of API’s
for cust omizat ion.
b) Manipuri Morphological Analyzer : A design
for developing a Morphological Analyzer for
Manipuri is being invest igat ed. Development
of a root dict ionary, a morpheme dict ionary
an d an affi x t ab l e h as commen ced .
Identification of linguistic rules for nouns and
pr on ou n s h as also commen ced. Fu r t h er
l i n gu i st i c r u l es for ot h er gr ammat i cal
cat egories are being st udied.
Technology Description
The Manipuri Morphological Analyzer will be
realized using t he same t echniques used in t he
d evel op men t of Assamese Mor p h ol ogi cal
• Five rules are being used current ly
8 9
• Can be easily upgraded for more complex
pat t erns
• The modules, which have been developed so
far are available in t he form of API’s and can
b e u sed accor d i n g t o t h e n eed b y ot h er
applicat ions like, spell checker, et c.
• A Graphical User Int erface has been developed
(see Figure 6).
Figure 6. Graphical User Interface of Manipuri
Morphological Analyzer
3. Translation Support Systems
3.1 Machine Translation System T h e t er m
Machine Translat ion (MT ) refers t o t he process
of p er for mi n g or ai d i n g t r an sl at i on t asks
involving more t han one human language. It is a
syst em t hat t ranslat es nat ural language from one
source language (SL) t o a t arget language (TL).
Technology Description
This MT syst em developed is basically a rule based
one and relies on a bilingual dict ionary. It can
current ly handle t ranslat ion of simple sent ences
fr om En gl i sh t o Assamese. T h e d i ct i on ar y
cont ains around 5000 root words. The syst em
simply t ranslat es source language t ext s t o t he
corresponding t arget language t ext s word-for-
word by means of t he bilingual dict ionary lookup.
T h e r esult in g t ar get lan guage wor ds ar e r e-
or gan i zed accor d i n g t o t h e t ar get lan gu age
sent ence format . In order t o improve t he out put
qu alit y, t h e syst em per for ms mor ph ological
an alysi s befor e p r oceed i n g t o t h e bi li n gu al
dict ionary lookup.
Current ly it can handle general-purpose simple
t r an slat i on . I t i s bei n g u pgr ad ed t o h an d le
complex sent ences. Efficiency of t he syst em can
be impr oved by select in g a specific domain .
Current ly t he r ule based Machine Translat ion
syst em cont ains 22 rules. The syst em has been
t est ed for more t han 250 frequent ly used simple
sent ences. An example of Machine t ranslat ion
from English t o Assamese is depict ed in Figure 7.
Figure 7. English-Assamese Machine Translation
4. Human Machine Interface Systems
4.1. Optical Character Recognition for Assamese
and Manipuri: Opt ical Charact er Recognit ion
(OCR) is the process of converting scanned images
cont aining t ext s int o a comput er processable
format (such as ASCII, ISCII, and UNICODE
et c.). An MoU t o t ransfer t he OCR t echnology
from t he Resource Cent re at t he Indian St at ist ical
Inst it ut e, Kolkat a was signed in August 2002 and
accordingly t he same was effect ed in Sept ember
Technology Description
The syst em t akes a gray level (8bit ) TIFF (Tagged
Image file format ) image as input . The images
scanned from eit her Assamese (or Manipuri)
books or from paper document s can be processed
by t he OCR soft ware. The current version of t he
OCR system produces output in the ISCII format,
9 0
which can be viewed or edit ed using any edit or
support ing ISCII.
Features: Table 3 shows t he salient feat ures of t he
OCR syst em while Figure 8 shows t he associat ed
Features Specifications of Assamese
& Manipuri OCR
Scanning resolution 300-600 dpi
Input image TIFF (8 bit gray level)
Skew detection +5 t o -5 degrees
Skew correction +5 t o -5 degrees
Font name Assamese: Git anjalilight ,
Luit & AS-TTDurga
Manipuri: Font used by t he
publisher, Manipuri Sahit ya
Parishad, Imphal
Font size Assamese: 12-28 point s
Manipuri: 12-18 point s
Test data size Assamese: 600 pages from
books & print ed document s
Manipuri: 100 pages from
Template size Assamese: 2600
Manipuri: 1800
Post processing Morphological analysis
Output file format ISCII format
Accuracy Assamese: 95% ( wi t h ou t
post processing)
Manipuri: 90% (wit hout
post processing)
Portability/ Win dows 98/ 2000/ XP &
Expandability LINUX
Table 3. Specifications of the Assamese/Manipuri
OCR System
Figure 8. GUI of the OCR System
4.2 Speech Recognition System
Au t omat i c Sp eech Recogn i t i on ( ASR) for
Assamese, which is concerned wit h t he problem
of recognit ion of human speech by a machine, is
t he core of a nat ural man-machine int erface. A
speaker-dependent cont inuous speech recognit ion
syst em wi t h a l i st of vocab u l ar y h as b een
d evel op ed for t h e Assamese l an gu age. T h e
object ives of t he syst em are:
• Assamese an d En gl i sh Sp oken D i gi t
Recognit ion
• St u d y of N oi se Effect s on Assamese
Recognit ion
• Phonet ic Alignment of Assamese Digit s
A set of acoust ic rules has been formulat ed from
t he acoust ic-phonet ic feat ures of Assamese and
English t hat will be able t o classify a given sound.
The rules based on t hese feat ures show an overall
success rat e of around 76% for randomly collect ed
t est ut t erances in Assamese. The classificat ion rat e
is least among t he sounds falling under t he nasal
class (65% success rat e). The sounds falling under
t he vowel, st ops and t he fricat ives show bet t er
r esu l t s ( 8 0 %, 8 5 % an d 8 5 % su ccess r at e
respect ively) while t he classificat ion result s for
dipht hongs (70% success rat e) were not so good.
9 1
The unique Assamese sound of / x/ is found t o be
ver y similar t o / s/ acoust ically. A comparat ive
st udy of acoust ic propert ies of Indian spoken
English and Assamese vowels has been performed
4.3 Interface for e-Dictionary : An int erface for
t h e En glish -Assamese an d Assamese-En glish
dict ionaries has been developed t hat allows users
t o choose bet ween one of t he t wo languages viz.
English and Assamese, ent er a word and find it s
equivalent in t he ot her language. Based on Client-
Server Archit ect ure, t he syst em allows users t o
access t he Dict ionary Server using a Java applet
opened via a browser.
Technology Description: The basic component s
t hat make up t he on-line dict ionary are:
1. Cl i ent: T h e cl i en t i s p r ovi d ed wi t h a
Graphical User Int erface (GUI) in t he form
of a Java applet sh own in Figur e 9 an d
init iat ed from t he web browser. The user can
invoke t he URL of t he dict ionary and choose
t he source language; t ype t he word in t he
search window provided in t he applet and
click on t he Search but t on. The query from
t he user is sent t o t he server and t he result s
are displayed at t he client -end.
Figure 9 Graphical User Interface of the
2. Web Server: Apart from t he main ht ml page,
t his server host s t he Java jar files and serves
t he applet classes t o t he client .
3. Dictionary Server: Coded in Visual Prolog,
t he Dict ionary Server has been developed as
a menu-based applicat ion t hat assist s t he
administ rat or t o maint ain t he dict ionaries,
t o st op and rest art t he server and monit or
i n comi n g r equ est s. T h e ser ver p r ovi d es
informat ion on t he request s and t he act ions
t aken during run t ime. The current version
runs on a Windows plat form but t he can be
port ed t o a Linux syst em wit h minor changes,
making it virt ually plat form independent .
Wh en ever a con n ect ion t o t h e ser ver is
est ablished, it checks t he incoming request
fr om t h e cli en t , con su lt s t h e d at abase,
searches for informat ion on t hat part icular
word aft er which it acknowledges t he request
by serving t he informat ion.
4. Database: The e-dict ionaries t hat comprise
t he dat abase were init ially developed using
MS Access. T hese ar e t hen con ver t ed t o
Prolog fact s for fast er access by t he search
engine. The server provides t he administ rat or
an op t i on t o op en an O D BC ( O p en
Dat aBase Connect ivit y) connect ion t o t he
MS- Access d at ab ase an d p er for m t h e
con ver sion t h er eby uploadin g t h e lat est
version s of t he dict ion aries. T he curren t
version of t he syst em support s t wo (English-
Assamese and vice versa) of t he four elect ronic
dictionaries (viz. English-Assamese, Assamese-
English, English-Manipuri and Manipuri-
English) being creat ed.
a) User Friendly GUI available.
b) Dict ionaries can be used t o aid machine
t r an sl at i on an d / or sp el l ch ecki n g
applicat ions.
c) Pronunciat ions of words are available in t he
form of wave files.
d) APIs provided for prospect ive programmers
t o allow t h em t o mou ld t h e d i ct i on ar y
informat ion t o suit t heir cust om applicat ions.
9 2
e) A st and-alone and a web enabled version of
e-dict ionaries available.
5. Language Technology Human Resource
5.1 Workshops Conducted
Two workshops were conduct ed t o disseminat e
and make people aware of how t o use Indian
Languages in IT.
(i) A one-day workshop on “Nat ural language
Processing” was conduct ed at t he Indian
Inst it ut e of Technology Guwahat i on t he 31
of March, 2001. The workshop was int ended
t o act as a forum for promot ing int eract ion
amon g in t er est ed gr aduat e st uden t s an d
r esear cher s wor kin g on n at ur al lan guage
processing and allied areas. The t opics t hat
were addressed in t he workshop were :
" Int roduct ion t o Comput ing
" Nat ural Language Processing
" Linguist ics
" Demonst rat ion of language t echnology
solut ions
(ii) A Training Programme on “Web page Design
and Office Aut omat ion Using Assamese” was
held at t he Indian Inst it ut e of Technology
Guwahat i on t he 15
t h
& 16
t h
of March, 2002.
The Training Programme was designed for
t he benefit of Web Designers and people who
work wit h Assamese Word Processors. Wit h
St at e Government officials using Assamese as
a medium of communicat ion, t he need for
Office Aut omat ion in t his language is picking
up fast . Likewise, t he demand for put t ing up
informat ion in t he nat ive language on t he
web i s al so i n cr easi n g. T h i s t r ai n i n g
pr ogr amme was ai med at di ssemi n at i n g
t echniques t o generat e e-cont ent in Assamese
and use t his language for rout ine office t asks.
T he t raining programme was specifically
design ed t o en cour age st at e gover n men t
emp l oyees t o u se Assamese for offi ce
aut omat ion. Part icipant s of t he programme
ranged from employees of Cent ral, St at e and
Pu b l i c sect or u n d er t aki n gs t o Pr i vat e
ent repreneurs. The course cont ent s of t he
workshop were :
" Office Aut omat ion
" Web development in Assamese.
" Creat ion and manipulat ion of Dat abase
and Spreadsheet s in Assamese.
" Hands-On session.
6. Standardization
A draft of t he Assamese Unicode Code-Set was
prepared aft er due consult at ion wit h linguist s and
t he St at e Govt . and submit t ed t o t he Minist ry of
Communicat ions & Informat ion t echnology,
Govt . of I n dia for fur t h er evaluat ion . T h e
Assamese Code –Set is similar t o t hat of t he
Bangali, except ions being.
• The let t er Ra + has Code 09F0 inst ead of 09B1
and t he let t er Wa « has code 09F1 in lieu of
• An addit ional let t er Khya v wit h code 09BB is
int roduced.
7. Publications
1. Mon i sh a D as, S. Bor goh ai n , Ju li Gogoi ,
S.B.Nair, Design and Implementation of a
Spell Checker for Assamese, Proceedings of
t h e Lan gu age En gi n eer i n g C on fer en ce,
LEC2002, D ecember 2002, H yder abad,
Published by t he IEEE Press.
2. Monisha Das, S.Borgohain, S.B.Nair, Spell
checki ng i n MSWord for Assamese,
Proceedings of t he ITPC-2003:Informat ion
9 3
Technology: Prospect s and Challenges in t he
21st Cent ury Kat hmandu, Nepal, May 23-
26, 2003.
8. Manika Newsletter
Snapshot of the Manika: e-newsletter in Manipuri
Man i ka, an e- n ewsl et t er i n Man i p u r i , was
launched on t he Republic day t his year (26t h
January 2003) t o mark t he beginning of a new
elect ronic informat ion era in Manipur - t he Land
of Jewels. The newslet t er host ed by t he RCILTS
at Indian Inst it ut e of Technology Guwahat i will
bring news and share knowledge of t he herit age
and local innovat ions in Manipur and help enable
t h e Man i p u r i s t o b e wel l i n for med of t h e
elect ronic age and forge ahead.
9. The Team Members
Prof. Gautam Barua
Dr. S.B.Nair
Dr. S.V.Rao
Dr. P.K.Das
Dr. Dipankar Moral
Samir Kr. Borgohain
Sushil Kr. Deka
Monisha Das
Nilima Deka
Juli Gogoi
Dr. L.Sarbajit Singh
Sirajul Chawdhury
Courtesy : Prof.Gautam Barua
Indian Institute of Technology
Department of Computer Science & Engineering
Panbazar, North Guwahati,
Guwahati -781 031, Assam
(RCILTS for Assamese & Manipuri)
Tel: 00-91-361-2690401, 2690325-28
Extn 2001, 2452088
E-mail :,
9 4
Indian Institute of Science
Centre for Electronics Design & Technology, Bangalore - 560012.
Tel. : 00-91-80-2932377, 293 3267
Resource Centre For
Indian Language Technology Solutions – Kannada
Indian Institute of Science, Bangalore
Achievements Achievements Achievements Achievements Achievements
Indian Institute of Science, Bangalore
T h e Resou r ce C en t r e for I n d i an Lan gu age
Technology Solut ions-Kannada at Indian Inst it ut e
of Sci en ce ch ose a ver y br oad sp ect r u m of
act ivit ies, and undert ook several act ivit ies relat ed
languages ot her t han Kannada. Part icipat ion by
a large number of facult y of t he Inst it ut e enabled
t he Resource Cent re t o creat e knowledge bases,
web sit es, OCR for Tamil and Kannada, speech
syn t h esi s for Tami l an d Kan n ad a, r esou r ce
informat ion on European hist ory in H indi for
high school childr en , a bilin gual in t er act ive
German-Hindi course, CDs in support of learning
Kannada at primary and high school level, a
variet y of t ools for language processing, word net
i n Kan n ad a, an d i n vest i gat e i n t o l an gu age
ident ificat ion. It was also recognized t hat language
t echnology segment of t he IT indust ry remained
very small in India for a variet y of reasons. Two
i mp or t an t ap p r oach es wer e t aken t owar d s
improving t his. One is t o get some jobs done by
t he indust ry inst ead of get t ing t hem done t hrough
project assist ant s, and t he ot her one is t o make
available many of t he basic t ools developed in t he
public domain. The web sit e KANNUDI should
meet t he long felt needs of Kannadigas t o know
about t heir language and st at e. A large number
of sch ol ar s i n var i ou s d i sci p l i n es act i vel y
cont ribut ed in making t his web sit e very rich. An
organizat ion like Kannada Sahit ya Parishat felt
ent hused enough t o permit t his resource cent re
t o creat e a web sit e on t heir act ivit ies and have it
co-locat ed wit h KANNUDI. In an at t empt t o
make a difference t o an import ant segment of
populat ion, namely primary and high school
st udent s, Bodhana Bharat i series were creat ed,
which provide int eract ive learning of Kannada
language as per the state syllabus. Even the process
of creat ion sensit ized a large number of t eachers
t o t he usefulness of IT t ools in educat ion, and
enabled t hem t o act ively part icipat e in creat ing
t h e C D s. I t was r eal i zed t h at man p ower
availabilit y in t he area of language t echnology is
very poor. The web sit e LT-IISC is creat ed t hat
would en t huse bot h facult y an d st uden t s at
engineering colleges t o do t heir mini project s and
main project s in language t echnology relat ed
act ivit ies.
This phase of Resource Cent re creat ed adequat e
moment um in creat ing IT t ools for Kannada and
mad e a b egi n n i n g t h at wou l d en su r e t h at
Kannadigas need not feel alien in t heir own st at e.
1. Web Sites And Support To Instruction
1.1 Kannudi
Kannudi is bilingual web sit e for t he benefit of
Kannadigas t o learn about t heir language and
herit age. Non -Kan n adigas will also have an
opport unit y t o know about Karnat aka. I t is
mainly aimed at non-specialist s, and provides a
first level overview of a large number of t opics.
The cont ent s of t his sit e are organized under 11
t opics: Language, Epigraphs, Lit erat ure, Folklore,
Geography, Hist ory, Art s, Classics, Personalit ies,
Temples and Fest ivals, and Cult ural Societ ies and
t heir Act ivit ies. Some of t he import ant feat ures
of Kannudi are:
• The sit e has more t han 2000 pages of cont ent .
• Most of t he art icles are writ t en by expert s in
t he respect ive areas.
• Images have been incorporat ed at all possible
9 6
• Br i ef l i fe sket ch es of a l ar ge n u mb er of
personalit ies of Karnat aka are prepared.
• Several classics of Kannada lit erat ure are made
available on t he web sit e.
The web sit e also host s t he web sit e of Kannada
Sahit ya Parishad.
T h e con t en t s of t h e si t e ar e con t i n u ou sl y
en h an ced . T h e web si t e ad d r ess i s
This is a web sit e t o meet t he needs of st udent s,
t eachers, and developers int erest ed and concerned
wit h language t echnologies in general and in
part icular wit h emphasis on Kannada language.
The main sect ions of t he web sit e are St andards,
lan gu age r esou r ces, offi ce au t omat i on , web
t echnologies, opt ical charact er recognit ion, speech
t echnologies, machine t ranslat ion, mult i-lingual
issues, applications, and open source software. The
websit e will have a download sect ion common t o
all t opics which would also have all t he product s
generat ed at IISc. The websit e also feat ures a
Discussion Forum on each t opic wit h which
i n t er est ed p eop le can exch an ge t h ei r vi ews
regarding t echnologies, solve problems (if any),
et c. Informat ion on event s, seminars, workshops
in t h e ar ea of lan guage t ech n ologies is also
provided. Annot at ed links t o ot her websit es on
language t echnologies, language product s, and
st andards is also given.
The address of t he sit e is
Publicat ions:
1. N ar mad amba, K. : e- lear n i n g i n Kan n ad a
( Kan n ad ad h al l i e Kal i ke) ( Book) , 2 0 0 2 ,
Kannada Sahit ya Parishat .
2. Narmadamba, K.: Kannada and comput ers in
Samyukt a Karnat aka, April, 2003.
3. Narmadamba, K.: Speech variat ion in Sout h
India languages wit h respect t o dialect , emot ion
and st yle, Sadhane, Bangalore Universit y, May,
1.3 Bodhana Bharathi: Multimedia Educational
CDs for 7
, 8
and 10
It has been observed t hat t he qualit y of t eaching
Kannada language at primary andhigh school
levels needs considerable improvement . It was,
therefore, decided to create interactive multimedia
learning mat erial, useful t o bot h t eachers and
st udent s, in t he form of CDs. This CD series are
named as “Bodhana Bharat hi”. These CDs are
meant to supplement the text books and classroom
t eaching. The framework wit hin which t hese CDs
are developed is,
1. To improve list ening, and writ ing skills.
2. To enable t he st udent s perform bet t er in t he
examinat ions.
9 7
3. To appreciat e t he background in which t he
lessons were creat ed by t he original aut hors.
4. To make t hem underst and and feel proud about
t heir nat ionalit y and cult ure, lit erat ure, people
and language.
5. To enjoy poems and songs.
Some of t he specific aspect s of t he inst ruct ional
mat erial include,
1. Int roducing t he st udent t o t he writ ers of t he
lessons and t heir works.
2. To make t hem underst and t he import ant point s
and t he moral of t he lessons.
3. Exposing t hem t o new words t hat t hey come
across in t he lessons.
4. To help t hem recit e poems appropriat ely.
5. Test ing t heir knowledge in t he lessons by giving
t hem unit t est s.
T h e mat er i al i s d i vi d ed i n t o t h r ee sect i on s
Teacher’s section, Student’s section an d a
common section. T h e t each er s sect i on h as
informat ion such as lesson plan, object ives and
ot h er s. T h e st u d en t s sect i on con t ai n s ext r a
quest ions, new words, et c. The common sect ion
comprises of a preamble, summary of t he lesson,
values in t he lesson, key point s of t he lesson
support ed by visuals, model quest ion paper wit h
part ial int eract ion and answers, and meaning of
words wit h relat ions,.
1.4 Bilingual Instructional Aid for learning
German through Hindi
Bot h H indi and German belong t o t he Indo-
European family of languages. There are a lot of
similarit ies bet ween t he t wo in t erms of grammar,
vocabular y et c. These t wo are semant ically quit e
close t o each ot her. As a consequence it is easier
t o explain an unknown German word, phrase or
idiom in Hindi and vice versa t han using English.
Moreover wit h t he growing number of joint
vent ures and offshore project s t here has been
in creasin g in t eract ion bet ween German s an d
Indians. If t he visit ors from Germany need t o
st ay in I ndia for a longer period, t hen some
knowledge of Hindi would be of great advant age
i n maki n g p r ofessi on al, soci al an d cu lt u r al
cont act s. Similar would be t he case of Indian
visit ors t o Germany. This was t he mot ivat ion for
developing a bilingual inst ruct ional aid for Hindi-
German in t he form of a web-based mat erial.
This web sit e has four sect ions
9 8
Learn Hindi sect ion t eaches H indi alphabet s,
vowels, consonant s, count ing and met hods of
int roducing oneself.
Traveling to India sect ion gives informat ion on
how t o speak at Airport , Rest aurant , Shopping
complex, Bank, Post Office, Hospit al, Railway
St at ion, and for Rent ing a H ouse.
Life in India sect ion has informat ion on cult ure,
lifest yles, t radit ion, food et c.
Exercises sect ion provides pract icing exercises.
T h i s i n for mat i on can b e accessed at
www.mgmt .iisc.ernet .in/ ~fls/ project s and
www.mgmt .iisc.ernet .in/ ~fls/ german
1.5 Information Base in Hindi Pertaining to
German History
The Informat ion Base primarily concent rat es on
b r i n gi n g i n for mat i on ab ou t G er man y i n
per spect ive, which appeals t o I n dian school
children. The aim of t his project is also t o put all
t h i s mat er i al i n t o H i n di so as t o make t h e
informat ion accessible t o as many school children
as possible. N CERT has n ow un der t aken t o
provide all school children access t o informat ion
mat erials bot h in Hindi and English. This project
mad e u se of t h e kn owl ed ge of t h e for ei gn
language expert s at t he Inst it ut e, and evolved
mat erials in Hindi pert aining t o German hist ory.
This was done in conjunct ion wit h t he prescribed
The Web-sit e cont ains informat ion on German
Ant hem, Germany in Europe, German Weat her,
German Populat ion, World War II, Const ruct ion
and Fall of Berlin wall, German Aut omobiles,
Ger man Beer an d Wi n e, Ger man C u l t u r e,
German Economy, German Unificat ion, German
Lit erat ure, Polit ical Part ies, Social securit y syst em,
European Union, Indo-European Languages, and
German Cit ies-Berlin, Brandenburg, Cologne,
Hamburg, Frankfurt , Muenich, Trier.
2. Knolwedge Bases
2.1 Sudarshana : A Web Knowledge Base on
Darshana Shastras
The main object ive was t o creat e a “knowledge
base” cont aining basic t ext s, comment aries and
free t ranslat ions of t ext s on t he Syst ems of Indian
Philosophy, popularly known as Darshanas, in
mult imedia form, and have t hem web-enabled,
and make t hem available in t he form of CDs.
The t arget audience is classified int o Naïvet é,
St udent s, and Scholars.
All t he resource mat erial and t he expert s on all
Darshanas are ident ified. All t he web sit es on
Sanskrit and Darshana-s in part icular are surveyed
comp r eh en si vel y. A wh i t e p ap er on Sh ad -
darshana-s is creat ed. A series of lect ures by one
of t he expert s on Ishvara in all darshana-s and
“Art hasamgraha” is recorded. The audio files and
t ranscript ed version of t hese lect ures is made
available on t he sit e. Part ial t ranslat ion in English
is available. Nyayakusumanjali and ot her Sanskrit
9 9
t ext s referred for t he above discourses are also
made available on-line. Glossary of t echnical
t er ms i n Nyaya, Vai sh esh i ka an d Mi mamsa
darshana-s in English and Sanskrit is creat ed. A
relevant search engine for bot h Sanskrit and
English is also made available
For any clarificat ions Dr. N.R.Srinvasa Raghavan
can be cont act ed at raghavan@mgmt .iisc.ernet .in
2.2 Indian Logic Systems
The websit e“PRAMITI : PRAMana – IT wit h
I n dian logic” is bein g developed t o pr ovide
comprehensive informat ion on various aspect s of
Indian Logic and it s applicabilit y t o Comput er
Sci en ce an d I n for mat i on Tech n ology. T h e
websit e is designed t o be user-friendly by using a
concise and object ive st yle. The web address is
2.3 Indian Aesthetics
A websit e on Indian Aest het ics in Product Design
has been creat ed wit h following feat ures:
• Ident ifies t he body of knowledge in Indian
Aest h et i cs fr om Li t er at u r e, C u l t u r e an d
• A Visual Dat abase of Indian Product s, Visuals
and Examples.
• Met hods for Designers t o enable incorporat ion
of “Indianness” int o product design.
• Lear n in g mat er ials for st uden t s in design
programs regarding I ndian Aest het ics, and
met hodology of incorporat ion int o product
T h e con t en t s ar e b r oad l y cl assi fi ed as
“philosophical”, “cultural”, and “pragmatic”
The philosophical sect ion cont ains ext ract s of
ancient t ext s, book reviews, review and research
papers. It has a rich visual dat abase of images from
cr aft s, ar t ifact s an d gen er al visuals. I t has a
comprehensive glossary and rich bibliography.
T h e cu lt u r al sect i on h as ar t i cles on I n d i an
philosophy and West ern Philosophy. Case st udies
cont aining research in t he area of product and
visual semant ics are also included.
T he pragmat ic sect ion discusses t he learning
mat erial for design st udent s
3. Technologies and Language Resources
3.1. BRAHMI: Kannada Indic Input method,
Word Processor
Some scr i p t s h ave h u n d r ed s of i n d i vi d u al
charact ers, and it is not easy t o put all t hese
charact ers on a st andard keyboard t hat is designed
for simpler script s. Input Methods are developed
t o provide a bet t er approach for input t ing t ext
for t hese languages. The Brahmi Kannada Input
Method (BKIM) is one such met hod which allows
user t o give input in Kannada. The BKIM is
developed using Java which provides an input
met hod framework t hat enables t he user t o ent er
t ext direct ly int o t he t ext component . Applicat ion
developers who wish t o have direct Kannada input
in t heir applicat ions can make use of t he BKIM.
As it is developed in Java, it is also plat form
independent . The BKIM uses t he KGP keyboard
layout (st andardized by Govt . of Karnat aka) and
each keyst roke is mapped for t heir corresponding
Unicode charact ers. It uses Open t ype font s.
The Brahmi - Multilingual Word processor is a
d emon st r at i on of Br ah mi Kan n ad a I n p u t
Met hod. The word processor has t he opt ion of
choosing t he input met hod from among nine
Indian Languages (at present only Kannada is
enabled) and English. Apart from normal feat ures
like file open, print , cut , copy, past e, font feat ures,
paragraph set t ings, color changes it also has some
unique feat ures such as saving file in different
en cod i n gs ( UT F- 8 , UT F- 1 6 , Un i cod e Bi g
Endian, et c.), search and replace Unicode t ext and
email facilit y. It has an easy access graphical
t oolbar and comprehensive help.
3.2 OPENTYPE FONTS: Sampige, Mallige,
OpenType Font is a new cross plat form font file
for mat developed join t ly by Adobe Syst ems
Incorporat ed and Microsoft . Based on Unicode
st andard t he OpenType format is an ext ension of
t he TrueType SFNT format t hat can now support
Post Scr i p t fon t d at a an d n ew Typ ogr ap h i c
feat ures.
Open Type offers several compelling advant ages:
• A single, cross-plat form font file t hat can be
u sed on b ot h Maci n t osh an d Wi n d ows
plat forms
• An exp an d ed ch ar act er set b ased on t h e
int ernat ional Unicode encoding st andard for
rich linguist ic support
• Advanced t ypographic capabilit ies relat ed t o
glyph posit ioning and glyph subst it ut ion t hat
allow for t he inclusion of numerous alt ernat e
glyphs — such as old-style figures, small capitals
and swashes — in one font file
• A compact fon t out lin e dat a st r uct ur e for
smaller font file sizes
• OpenType font s are best suit ed for meet ing t he
demand for complex script handling and high
qualit y t ypography in t oday’s global publishing
and communicat ion environment .
T h r ee O p en Typ e fon t s cal l ed “Sampi ge”,
“Kedage” and “Mallige” are developed. Sampige
and Kedage are t ext based font s whereas Mallige
is a handwrit t en t ype of font . The font s have been
t est ed and are dist ribut ed along wit h Brahmi –
Multilingual Word processor and also individually.
Peop l e ar e fr ee t o u se t h ese fon t s i n t h ei r
applicat ions and/ or for document at ion purposes.
3.3 Kannada Wordnet : WordNet is an on-line
lexical reference syst em whose design is inspired
by current psycholinguist ic t heories of human
lexical memor y. The nouns, verbs, and adject ives
of a language are organized int o synonym set s,
each represent ing one underlying lexical concept .
D i ffer en t r el at i on s l i n k t h e syn on ym set s.
Synchronic organizat ion of lexical knowledge and
st ruct ured organizat ion of words is helpful for
nat ural language processing.
Alt hough t he design of Kannada WordNet has
been inspired by t he famous English WordNet ,
t he unique feat ures of Kannada WordNet are
‘Graded ant onyms and meronymy relat ionships’
an d ‘effi ci en t u n d er l yi n g d at ab ase d esi gn’.
Nominal as well as verbal compounding, complex
verb const ruct ions also play a vit al role.
T here are differen t organ izin g prin ciples for
di ffer en t syn t act i c cat egor i es. Two ki n ds of
relat ions are recognized: lexical and semant ic in
WordNet . Lexical relat ions hold bet ween word
forms; semant ic relat ions hold bet ween word
meanings. The basic cat egories such as nouns,
verbs, adject ives and adverbs are organized int o
synonym set s, each represent ing one underlying
lexical concept . Synset s or t he synonymy set s are
t he basic building blocks. The synonym set serves
as an ident ifying definit ion of lexical concept s.
The semant ic relat ions discussed in t he Kannada
WordNet are Synonymy, Ant onymy, Hypernymy
–hyponymy, Meronymy – holonymy, Ent ailment ,
t roponymy. This WordNet will init ially be built
for 2000 words.
3.4 OCR for Tamil
A, Opt ical Charact er Recognit ion (OCR) syst em
capable of convert ing mult i-lingual manuscript s
t o machine readable codes is one of t he key st eps
i n wor ki n g t owar d s t h e goal of mach i n e
t ranslat ion. There are numerous applicat ions t hat
OCR syst ems have t o offer which are of help in
day-t o-day act ivit ies of life. These include mass
conversion of exist ing document s and lit erat ure
int o elect ronic format , reading aid for t he blind
as a part of t ext -t o-speech convert er, aut omat ic
sor t i n g of mai l s i n p ost al d ep ar t men t an d
p r ocessi n g of b an k d ocu men t s, mach i n e
t r an slit er at ion / t r an slat ion of documen t s an d
lit erat ure of ot her script s, et c.
A OCR for Tamil script is developed t hat works
in a mult i-font and mult i-size scenario. The input
t o t he syst em is a scanned or digit ized document
and t he out put is in TAB code. The document s
are expect ed t o cont ain t ext only. The process
sequence is given in t he following block diagram.
Preprocessing is t he first st ep in O CR, which
i n vol ves b i n ar i sat i on , skew d et ect i on an d
cor r ect i on . Bi n ar i sat i on i s t h e p r ocess of
convert ing t he input gray scale image scanned
wit h a resolut ion of 300 dpi int o a binary image
wit h foreground as whit e and back ground as
black. Suit able t echniques have been used t o t ake
care of cont rast in t he images. The skew angle of
t he document is est imat ed using a combinat ion
of Hough t ransform and Principal Component
Analysis. Segment at ion is done by first det ect ing
lines, and t hen det ect ing t he words in each line
followed by det ect ion of t he individual charact ers
in each word. H orizont al and vert ical project ion
profiles are employed for line and word det ect ion,
respect ively. Connect ed component analysis is
performed t o ext ract t he individual charact ers.
The segment ed charact ers are normalized t o a
predefined size and t hinned before recognit ion
The segment ed symbols are fed t o t he classifier
for recognit ion. Tamil alphabet set cont ains 154
different symbols. The charact ers are divided int o
some clust ers based on domain knowledge, t o
r ed u ce t h e r ecogn i t i on t i me an d a smal l er
probabilit y of confusion. This is accomplished by
designing a t hree level, t ree st ruct ured classifier
t o classify Tamil script symbols.
A lin e in an y Tamil t ext has t hree differ en t
segment s; upper, middle and lower. Depending
upon t he occupancy of t hese segment s, t hese
symbols are divided int o one of t he four different
classes. T he n umber of dot s pr esen t in each
segment cont ribut es a lot for t he classificat ion,
which is dependent on t he scanning resolut ion.
This const it ut es t he first level clust ering
The second level of classificat ion based on Mat ras/
ext ensions is applied only t o symbols which have
upward mat ras and downward ext ensions. The
classes are furt her divided int o groups, depending
on t he t ype of ascenders and descenders present
in t he charact er. This classificat ion is feat ure based
i. e. t he feat ure vect ors of t he t est symbol are
comp ar ed wi t h t h e feat u r e vect or s of t h e
normalized t raining set . The feat ures used in t his
level are second order geomet ric moment s and
t he classifier employed is t he nearest neighbor
Feat ure-based recognit ion is per formed at t he
t hird level. For each of t he groups, t he symbol
normalization scheme is different. The dimensions
of t he feat ure vect or are different for different
groups, as t heir normalizat ion sizes are different .
Truncat ed Discret e Cosine Transform (DCT )
coefficient s are used as feat ures at t his level of
classificat ion. Nearest neighbor classifier is used
for t he classificat ion of t he symbols.
The syst em has been t est ed on files t aken from
Tamil magazines and novels which are scanned at
300 dpi (dot s per inch). It has been ensured t hat
t he t est files cont ain almost all t he symbols present
in t he script . An accuracy of over 99% has been
achieved on t he t raining set and 98% on ot her
samples. The sample size was 100.
[1] K G Aparna and A G Ramakrishnan, “A
complet e Tamil Opt ical Charact er Recognit ion
Syst em”, Document Anal ys is Sys tems V, 5
t h
Int ernat ional Workshop, DAS 2002, Princet on,
NJ, USA, August 19-21, 2002, pp. 53-57.
[2] K G Aparna and A G Ramakrishnan, “Tamil
Gn an i – an O CR on Win dows, Proc. Tamil
Internet 2001, Kuala Lumper, August 26-28,
2001, pp. 60-63.
3.5 OCR of Printed Text Documents in Kannada
The input t o t he syst em is t he image of t he print ed
page obt ained by scanning on a flat bed scanner
at 300 DPI resolut ion and convert ing t he image
int o binary by making use of a global t hreshold
select ed aut omat ically for each page. This image
is processed t o remove any skew (so t hat t he t ext
lines are aligned horizont ally) using a H ough
t ransform based t echnique. Next , t he individual
lines in t he image and t he words in each line are
separat ed using Project ion Profile based met hods.
Due t o t he special charact erist ics of Kannada
script , separat ing t he individual charact ers in t he
word is not an at t ract ive choice. H ence a novel
segmen t at ion algor it h m is developed wh ich
segment s words int o sub-charact er level so t hat
each akshara may be composed of many segment s.
A pat t ern classificat ion met hod is used which is
based on t he Support Vect or Machines t o assign
a classificat ion label t o each segment so obt ained.
Aft er labeling individual segment s, rules are used
on how aksharas are composed t o finally effect
recognit ion of individual aksharas. T he final
out put of t he syst em is an ASCII file compat ible
wit h kant ex t ypeset t ing package for Kannada
which is built around t he st andard Lat ex syst em.
The words are first vert ically segment ed int o t hree
zones as shown in t he figure.
T he n ext t ask is t o segmen t t he t hr ee zon es
horizont ally. The middle zone is t he most crit ical
since it cont ains a major port ion of t he let t er.
Aft er segment ing t he document each segment has
t o be recognized t o effect final recognit ion.
Feat ures (set of numbers t hat capt ure t he salient
charact erist ics of t he segment image) from each
segment are ext ract ed. The charact ers in Kannada
have a rounded appearance. Therefore feat ures
which can ext ract t he dist ribut ion of t he O N
pixels in t he radial and t he angular direct ions will
be effect i ve i n cap t u r i n g t h e sh ap es of t h e
charact ers.
The feat ures ext ract ed have t o be classified using
a classifier. H er e a Suppor t Vect or Machin e
(SVM) classifier is used.
The out put aft er classificat ion is t ransformed int o
a format which can be loaded int o a Kannada
edit ing package. The input is usually ASCII t ext
in which aksharas are encoded ASCII st rings.
The syst em was t est ed on pages scanned from
Kannada magazines, et c. Current ly t he syst em
recognizes about 85% of t he aksharas correct ly.
4. Research
4.1 Automatic Classification of Languages Using
Speech Signals
Aut omat ic language ident ificat ion is t he problem
of ident ifying t he language being spoken from a
sample of speech from an unknown speaker.
Amon g var i ou s ap p r oach es for l an gu age
iden t ificat ion , t h e ph on e r ecogn it ion offer s
considerable promise, as it incorporat es sufficient
knowledge of t he phonology of t he language t o
be ident ified, wit hout incurring t he significant ly
h i gh er cost of wor d b ased ap p r oach es. An
appr oach based on sub-wor ds t hat does n ot
r equir e man ually labeled dat a in an y of t he
languages recognized is used. Par t icularly, t he
focus was on t he specific archit ect ure t ermed
Parallel Phone Recognit ion (PPR), and t he syst em
is referred t o as parallel Sub-word Recognit ion
Syst em.
Research in aut omat ic language ident ificat ion
requires a large corpus of multi-lingual speech data
t o capt ure many sources of variabilit y wit hin and
acr oss t he lan guages. Amon g var ious I n dian
l an gu ages si x l an gu ages H i n d i , Kan n ad a,
Malayalam, Marat hi, Tamil an d Telugu were
select ed. English as spoken by t he same Indian
sp eaker s i s t h e seven t h lan gu age. For each
language, t went y adult speakers of different age
and gender are select ed. Care was t aken t o ensure
that a speaker was chosen for a particular language,
on ly if he/ she had t hat lan guage as a n at ive
language part icularly during childhood. Speech
was collect ed using a Sennheiser HMD224 noise-
canceling microphone and low pass filt ered at 7.6
kH z. The recording prot ocol was designed t o
obt ain Digit s and days of t he week, Numbers in
English, English alphabet s, commonly used words
in nat ive languages, railway reservat ion words in
nat ive language and in English, Banking words
in nat ive and English language. In t he elicit ed
free speech, personal det ails wit h name, age, nat ive
language, profession, family, passage reading in
Hindi, English and nat ive languages and et c.
H er e t h r ee appr oach es ar e st u di ed, n amely,
Par al l el Su b - Wor d Recogn i t i on . ( PSWR) ,
SWRLM, and Parallel – SWRLM. The following
t ables show t he performance of PWSR syst em
on OGI-TS and ILDB dat a base wit h different
This SWRLM approach uses a single front -end
sub word recognizer followed by N back-end LMs
for an N language LID t ask. The front -end SWR
can b e l an gu age d ep en d en t or l an gu age
independent . T he SWRLM per formance will
improve if we use language independent single-
word unit invent ory which is obt ained from all
languages in LID t ask.
Results: t he t able III shows t he LID performance
of SWRLM on OGI-TS dat abase for each of t he
front ends for bot h t he cases of MLC. Table IV
shows t he performance of SWRLM on ILDB for
all six front ends and bot h cases of MLC.
Fr om t h e ab ove t ab l es we n ot e t h at LI D
performance on ILDB is much bet t er t han on
OGI-TS dat abase for t raining dat a. Whereas for
t est dat a, OGI-TS dat abase gives bet t er result s as
compared t o ILDB dat abase.
As we know t hat t he sounds in t he language t o be
ident ified do not always occur in t he one language
used t o t rain t he front -end sub-word recognizer.
T h us it seems n at ur al t o look for a way t o
incorporat e phones from more t han one language
int o a SWRL like syst ems. Alt ernat ively anot her
approach is simply t o run mult iple SWRLM
syst ems in parallel wit h t he single language SWR
each t rained in different languages. Therefore P-
SWRLM uses mult iple front -end SWRs wit h each
SWR followed by N number of backend LMs for
a N language LID t ask.
T h e d at ab ase an d p r ep r ocessi n g wi t h
paramet erizat ion is ident ical t o t hat in PSWR
approach. The bias removal was done by Zissman
met hod on Pl. Result s of LID accuracy for six
languages Hindi, Kannada, Malayalam, Marat hi,
Tamil, Telugu PSWRLM syst em, wit h t wo t ypes
of classifiers Maximum Likelihood Classifier and
Gaussian Classifier are given in t he following plot s
t hat compare t he performance of LID accuracy
on PSWRLM syst em for t wo classifiers i.e., MLC
and GC-MD, and also across OGI dat abases and
for ILDB dat abase.
The following conclusions may drawn:
• The sub-word approach t o LID, developed in
our lab holds good promise for LID among a
small set of languages.
• The LID performance is quit e good among
sout h Indian languages, alt hough t hey share a
lot of phonet ic st ruct ure and even vocabulary.
• PSWR approach is more promising t han t he
SWRLM and PSWRLM approaches t o LID.
4.2 Algorithms for Kannada Speech Synthesis
T he basic un it s, n amely, CV, VC, VCV an d
VC C V h ave i d en t i fi ed an d r ecor d ed . A
framework has been st andardized for creat ion and
handling of dat abase of spoken basic unit s. A
syn t h esi s sch eme, b ased on wavefor m
concat enat ion of basic unit s has been at t empt ed.
N ew t ech n i q u es for p i t ch d et ect i on an d
modificat ion, and speech synt hesis wit h emot ion
are proposed.
5. Publications
1. R. Mu r alish an kar, K. Sur esh an d A. G.
Ramakrishnan, “DCT based approaches t o
Pi t ch Est i mat i on”, su bmi t t ed t o Si gn al
2. K. Suresh and A. G. Ramakrishnan, “A DCT
based approach t o Est imat ion of Pit ch”, Proc.
Int ern. Conf. on Mult imedia Processing and
Syst ems, Chennai, Aug. 13-15, 2000, pp. 54-
3. R. Murali Shankar and A. G. Ramakrishnan,
“Robust Pit ch det ect ion using DCT based
Spect r al Au t ocor r elat i on”, Pr oc. In t er n .
Conf. on Multimedia Processing and Systems,
Chennai, Aug. 13-15, 2000, pp. 129-132.
4. R. Murali Shankar and A. G. Ramakrishnan,
“Synt hesis of Speech wit h Emot ions”, Proc.
Int ern. Conf. on Commn., Comput ers and
Devices, Vol. II, Kharagpur, Dec. 14-16,
2000, pp. 767-770.
5. An oop Moh an , H ar i sh T, A. G .
Ramakrishnan and R. Muralishankar, “Pit ch
modificat ion sans pit ch marking”, submit t ed
t o Int ern. Conf. Acoust ics, Speech and Signal
Processing, May 7-11, Salt Lake Cit y, Ut ah,
USA, 2001.
Courtesy: Prof. N.J. Rao
Indian Institute of Science
Centre for Electronics Design and Technology
Bangalore – 560 012
(RCILTS for Kannada)
Tel. : 00-91-80-3466022, 3942378, 3410764
University of Hyderabad
Department of CIS, Hyderabad-500046
Tel. : 00-91-40-23100500 Extn. : 4017 E-mail :
Website : http:/ /
Resource Centre For
Indian Language Technology Solutions – Telugu
University of Hyderabad
Achievements Achievements Achievements Achievements Achievements
University of Hyderabad, Hyderabad
University of Hyderabad is a premier institute of
higher education and research in India. The University
Gr an t s Commission has select ed Un iver sit y of
Hyderabad, among four others in the country, as a
“Universit y wit h Pot ent ial for Excellence”. T he
Nat ional Assessment and Accredit at ion Council
(NAAC) has awarded the highest rating of five stars.
University of Hyderabad is the only University in
India to be included among the top 50 institutions
in India under the “High Output - High Impact”
cat egory by The Nat ional Informat ion Syst em for
Science and Technology (NISSAT) of the Department
of Scientific and Industrial Research. University of
Hyderabad has been rat ed t he “number one”
University in India in sciences by the Department of
Scientific and Industrial Research.
1. Resource Centre for Indian Language Technology
T his Resour ce Cen t r e for I n dian Lan guage
Technology Solut ions was est ablished by t he
Minist ry of Communicat ions and Informat ion
Technology, Government of India, at t he Universit y
of H yd er ab ad wi t h a fu n d i n g of n ear l y Rs.
100,00,000 spread over t hree years - April 2000
t o March 2003. The project has since been ext ended
t ill 30t h Sept ember 2003 t o enable t horough
con sol i d at i on of al l t h e wor k d on e. Two
depart ment s, eight members of t he facult y, 20
t o 30 st udent s and research st aff at any given
point of t ime have put in t heir very best for t he
past t hree years and several product s, services and
knowledge bases have been developed. The core
compet encies, t he dat a, t ools and ot her resources
developed here during t his period will enable
t his t eam t o scale new height s in fut ure.
2. Products
2.1 DRISHTI: Optical Character Recognition (OCR)
An Opt ical Charact er Recognit ion (OCR) syst em
converts a scanned image of a text document into
electronic text just as if the text matter was typed-in
by somebody. Scanned images are much larger in size
compared to corresponding text files. The statement
“A picture is worth one thousand words” is literally
true here. Texts occupy less storage space and less
network bandwidth when sent across a
net work. Convert ing images int o t ext s makes it
possible to edit and process the contents as normal
OCR syst ems can be used t o convert available
print ed document s int o elect ronic t ext s wit hout
t yping. Since OCR engine can be run day and
night on several comput ers parallely, we can
generat e large scale corpora wit h less t ime and
effort . OCR engines can also be used for a variet y
of ot her applicat ions. OCR syst ems have just st art ed
appearing for Indian script s. Most of t he current
OCR syst ems for Indian languages are designed
only for print ed t ext s and perform well only on
reasonably good qualit y document s. However,
r esear ch wor k on h an d - wr i t t en d ocu men t
recognit ion is going on.
Drishti is a complete Optical Character Recognition
system for Telugu language. Currently, it handles
good quality documents scanned at 300 dpi with a
recognition accuracy of approximately 97\%. The
system is tested with a number of different fonts
provided by C-DAC and Modular Infotech, and on
several popular novels, laser and deskt op print er
generated pages and books. Preprocessing modules that
separate textual and graphic blocks, handle multi-
column text inputs, and skew correction are also
implemented. Drishti is the first comprehensive
OCR system for Telugu.
A truthing tool with facilities for creating ground-
truth information, and to review the ground
DRISHTI: An OCR System for Telugu and other
Indian Languages
Preprocessing Stage
Binarization : refers to the conversion of a scanned
256-gray level image into a two-tone or binary (pure
black an d wh i t e) i mage. A bi n ar y i mage i s
appropriate for OCR work as the image document
cont ains only t wo useful classes of dat a — t he
background, usually paper, and the foreground, the
printed text. It is common to represent the background
paper colour by white-coloured pixels and the text
by black-coloured pixels. In image processing jargon,
t he background pixels have a value of 1 and t he
foreground pixels have a value of 0. Binarization
has a significant impact because it provides input
to every stage of an OCR system. Drishti provides
three options — global (the default), percentile based
an d it er at ive met hod — t o achieve desir ed
per for man ce on differ en t t ypes of scan n ed
documents and scanners.
Skew Detection and Correction : deal with improper
alignment of a document while it is scanned. The
normal effect is that the lines of text or no longer
horizontal but at an angle, called the skew angle.
Documents with skew cause line, word and character
breaking routines to fail. Skew also causes reduction
in recognition accuracy. In Drishti, skew detection
and correction are done by maximizing the variance
in horizontal projection profile.
Text and Graphics Separation: refers to the process
of identifying which regions of the document image
contain text and which regions contain pictures and
other non-text information that is not processed by
DRISHTI: Truthing Tool DRISHTI: How it works
truth against image data is also implemented. Such
truthing tools are extremely important in objective
and quick evaluation of OCR system performance.
Benchmark standards were proposed in collaboration
with Indian Statistical Institute (ISI) Kolkata, to
enable uniform and object ive evaluat ion and
performance comparison of OCR syst ems and
subsystems. In addition, several other useful modules
and library funct ions t hat enhance or simplify
adding new features are developed. Initial work is
also d on e on t ou ch i n g ch ar act er s t h at led t o
iden t ifyin g major char act er ist ics an d issues in
addressing the problem. Ours is the only major work
in this area apart from that by ISI, Kolkata for Bangla
characters [4]
Drishti, although designed for Telugu, has been tested
with Kannada, Malayalam and Gujarati scripts, with
recognit ion accuracies over 90\ %. O ur O CR
technology was transferred to the resource centre for
Gujarati. Work is underway in extending the system
to Amharic script of Ethiopia.
2.1.1 System Overview
Drishti contains three stages: preprocessing stage,
r ecogn i t i on st age an d post pr ocessi n g st age.
Binarization, separation of image regions into textual
and graphical regions, multi-column detection and
skew correction are the major tasks performed in
preprocessing phase. Separation of text into glyphs,
charact ers, words and lines, and recognit ion of
individual glyphs are tasks of the recognition stage.
Postprocessing comprises combining the recognized
glyphs into valid characters and syllables, and spell-
an OCR system. Drishti uses horizontal and vertical
projection profiles for such separation as well as for
many other preprocessing operations (see below). A
horizontal profile is obtained by counting and plotting
the number of text or black pixels in each row of the
image. A vertical profile is obtained by counting the
black pixels in each column of the image. Horizontal
profiles show distinct peaks that correspond to lines
of text and valleys that result from inter-line gaps.
A line of t ext is revealed by a peak in t he
horizontal profile whose width is approximately
the font size. A graphic object, in contrast, is much
larger. The actual shape of the peak is also different
because of higher density of black pixels in a graphics
block. Thus, the profile shapes discriminate between
text and graphical blocks.
Multi-column Text Detection: is done using Recursive
X-Y Cuts technique proposed in [5]. It is based
on recursively splitting a document into rectangular
regions using vert ical and horizont al project ion
profiles alternately. A different method that allows
recognit ion of non-rect angular regions is also
implemented but not yet included in Drishti.
The use of horizontal and vertical projection profiles
for all the major preprocessing tasks minimizes
system complexity and allows faster processing of
document s. T he preprocessing st ages, except
binarization, are not enabled in the basic version of
Drishti, but are available as add-on options.
Recognition Stage
Line, Word, Character and Glyph Separation: is a
very important task as the recognition engine processes
only one glyph at a time. In Drishti, word and glyph
separation are the key steps. Word segmentation is
done using a combination of Run-Length Smearing
Algorithm (RLSA) [8] and Connected-Component
Labelling. Words are combined into lines using
simple heuristics based on their locations. The
performance of RLSA in accurat ely segment ing
words is very high on good quality text but drops in
the presence of complex layouts and tightly packed
text that is sometimes seen in magazines. However,
the difficulty in applying zoning techniques to
Telugu because of the complex orthography requires
further studies for improvement.
Words are decomposed into glyphs by running the
connected component labelling algorithm again. The
glyph separation is extremely accurate and very few
segmen t at i on er r or s wer e fou n d i n ou r
Recognition: is based on template matching. A glyph
database containing all the glyphs in the script was
created from high-quality laser-printed text. Each
glyph is scaled to a size of 32 X 32 pixels that forms
a t emplat e for recognit ion and is st ored in t he
database. When a document is scanned for OCR,
the glyph obtained from the glyph separation step is
scaled t o t he same size as t he t emplat e and
matched using fringe distance maps [3] against
each of t he t emplat es in t he dat abase. The
template with the best matching score is output as
the recognized glyph.
Drishti provides several options and parameters
that affect recognition. The default setup scales the
glyphs using a linear scaling algorithm and matching
is performed using a fringe dist ance map. Linear
scaling is fast but suffers from problems with complex
shaped glyphs at large font sizes and with small glyphs
at small font sizes. Non-linear normalizat ion was
shown t o improve performance[7] by select ively
scaling regions of low cur vat ure. Non-linear
normalization, provided as a user option, gives better
performance on Hemalata and Harshapriya
fonts of C-DAC. Punctuation marks, which are
easily distorted because of their small sizes, are
handled separat ely wit hout using t emplat e
matching. Recognition accuracy is very high for
punctuation marks using a location and stacking
based heuristic developed for Drishti[2].
There are also several ways to modify the basic
fr in ge dist an ce measur e t o r eflect t he
i d i osyn cr asi es of t h e Telu gu scr i pt .
Experimentation was done on 18 distance measures
for matching, including 6 new measures, and
the best is chosen for recognition. Details on
the overall recognition process and modifications
for improving recognition accuracy may be found
in [1,2,6,7].
Output: is written into a file. Information about the
location of the glyph with respect to the text baseline,
the type of the glyph, i.e., whether it is a base, maatra,
vottu or punctuation glyph, and the recognized symbol
code are written into the file. Also, there is a facility
to output the k-best matches.
2.1.2 Postprocessing Stage
Assembling Glyphs into Syllables: is one of the most
challen gin g t asks of Drishti. T he complex
orthography of Telugu permits glyph placement all
around t he base charact er and finding syllable
boundaries is a non-trivial task. Currently, Drishti
uses the relative positions and types output from
recognition stage in conjunction with a stacking
heuristic to identify syllables. The heuristic works
correctly except in the case of certain large vottus with
an error rate of about 2%.
Converting Glyph Codes into ISCII: is currently done
by combining a simple table look-up method with
t ype cod es ou t pu t by t h e r ecogn i t i on st age.
Improvements are being done to the conversion code
which currently does not work for a limited
number of glyph/ punctuation combinations and
consonant clusters. Consequently, the accuracy of
the ISCII output is approximately 5% - 7\% worse
than that of the raw OCR output as on date. The
improved conversion algorithm being developed is
expected to mitigate this problem.
Spell-Checker: is recently added for detecting mistakes
in the ISCII output. The current version recognizes
nearly 98% of the misspelled words (false positive
are 2%).
2.1.3 Salient Features
High Accuracy, Complete OCR System
Drishti is the first and currently the only complete
O CR syst em for Telu gu . Cu r r en t ly, sever al
binarization algorithms (to select the best for given
set of documents and print quality), text-graphics
separation, multi-column layouts, skew detection
and correction are available as optional plug-ins.
The raw recognition accuracy (i.e., considering the
accuracy of the glyph codes) is currently 97%. The
accuracy of the ISCII output generated from the
Glyph Code-ISCII conversion process is currently
lower because of the errors in identifying syllable
boundaries and assigning ISCII byte codes. It is
already improved in int ernal t est ing wit hin t he
resource centre and the improved algorithms will be
included in Drishti very shortly.
The basic system was tested by STQC unit of
the Ministry of Communications and Information
Technology, Government of India, and it performed
with an accuracy of about 85% - 87% at the ISCII
level implying a raw accuracy of 93% - 95% without
using any of the preprocessing or postprocessing
routines. On our scanner, Drishti performed with an
accuracy of 96% - 97% on test documents provided
at C-DAC, Noida in September 2002. On our
tests on a number of documents after fine-tuning
the input images for scanner contrast variations
and other effects, Drishti consistently gives higher
accuracies. Currently, preprocessing stage is being
tuned to adapt for scanner differences.
A Unique Collection of Powerful Library Routines
Drishti is completely modular and implemented using
a number of highly useful C-callable library functions
for doing each of its tasks. The result is a main() routine
that is only about 100 lines in length. The complete
OCR system can be created by linking these 100 lines
of code wit h t he powerful, pre-compiled library
routines. The design process permits changing the
functionality of the system by calling different library
function as and when needed.
Visual Truthing Tool
The visual t rut hing t ool based on t he proposed
benchmark st andards allows easy generat ion of
ground truth data (including bounding boxes) from
scanned documents. It is a very powerful addition to
the OCR community to test and improve their
2.1.4 First Workshop on Indian Language OCR
The first O CR workshop for Indian script s was
organized by us. All the major groups in India working
on OCR technology participated. The underlying
technologies were discussed in great detail. The
various systems under development were installed
and tested during the workshop to identify the
strengths and weaknesses of various approaches.
The discussion and debate that followed helped all
the centres do make further progress.
2.1.5 Conclusion
It is now possible, using the developed tools and
library funct ions, t o implement a working OCR
syst em in less t han a day. The development of a
working Gujarat i OCR syst em from an absolut e
scratch in under two days and subsequent transfer of
technology is a testimony to the design of Drishti.
The result of the work on Telugu OCR system at
RCILTS (Telugu) is more than a product or a
technology. It is a powerful set of research and
technology tools and a platform that facilitates rapid
development of OCR technologies and solutions for
Telugu and other Indian languages in the future.
2.2 Tel-Spell: Spell Checker
A Spell Checker consists of a spelling error detection
system and a spelling error correction system. An ideal
spell checker detects all spelling errors and does not
raise false alarms for valid words. It also automatically
correct s all misspelled words. Clearly, real spell
checkers will not be able to match up to such as an
ideal system. Real spell checkers may raise false alarms
for some valid words and may also fail to catch
some wrongly spelled words. Also, pract ical spell
ch ecker s r ar ely cor r ect mi sspelled wor d s
automatically - they only offer a list of suggestions
for the user to choose from. Some spell checkers can
handle cases where an extra space has been typed in
the middle of a word or when two or more words
have been joined together into one. The performance
of a spell checker may therefore be measured in terms
of factors such as
• Percentage of False Alarm
• Percentage of Missed Detection
• Number of suggestions offered
• Whether the intended correct word is included
in the list of suggestions or not
• The rank of the intended correct word in the list
of suggestions
• Whether split and merged words are handled
Developing good spell checkers for Indian languages
has been a challenge. No spell checkers were available
at all for Telugu and a few other Indian languages.
Tel-Spell is the first ever spell checker for Telugu
and it includes both the spelling error detection
and correction components.
2.2.1 How do spell checkers work?
Many spell checkers store a list of valid words in the
language. A given word is assumed to be free of
spelling errors if that word is found in this stored
list. Otherwise it is presumed that the given word is
wrongly spelled. Since no dictionary can be perfect,
such a dictionary based approach is bound to produce
some false alarms and some missed detections. A large
dictionary may reduce false alarms but it is also more
likely to increase missed detections since rarely used
words may occur more because of errors in
typing than by intention. The choice of words to
be included in the dictionary is thus critical for the
best overall performance of the spell checker.
When a misspelled word is detected, other words
from the stored list that are similar to the given word
in terms of spelling are given out as suggestions for
correction. A quantitative measure of closeness of
spelling such as the minimum edit distance can be
used to select the words to be included in the suggestion
This description of spell checkers is obviously
a highly over-simplified description. The number
of techniques used for both detection and correction
are large and varied. One may use statistical
techniques such as n-grams. Sequences of characters
that do not occur or occur with various frequencies
are obtained from a training dataset to build a model
of the language. This model can then be used to detect
spelling errors. See Karen Kukich for a good survey
of the various techniques.
2.2.2 Why is it difficult to build a spell checker for
Indian languages in general and Dravidian languages
in particular are characterized by an extremely rich
system of morphology. Words in Dravidian languages
like Telugu and Kannada are long and complex,
built up from many affixes that combine with one
another according to complex rules of saMdhi. For
example, nilapeTTukooleekapootunnaaDaa?
which means somet hing like “Is it t rue t hat he is
fin din g it difficult t o h old on t o (h is wor ds/
Telugu is bot h highly inflect ional and agglut inat ive.
Auxiliary verbs are used in various combinat ions t o
i n d i cat e comp l ex asp ect s. C l i t i cs, p ar t i cl es,
vocat ives are all part of t he word. Telugu exhibit s
vowel harmony - vowels deep inside a verb may
chan ge due t o chan ges at t he boun dar ies of
saMdhi. Ext ernal saMdhi bet ween whole words
and compounds also occur in t he language. See
t h e r efer en ces b el ow for mor e on Tel u gu
morphology. Suffice it t o say t hat Telugu is one of
t he most complex languages of t he world as
far as morphology is concerned.
It is therefore not practically feasible to store all
forms of all words directly in a dictionary for
purposes of spelling error detection and correction.
At the same time building a robust morphological
analyzer and generator is an extremely challenging
task. Developing a good spell checker for languages
such as Telugu is thus a very difficult task. No wonder
no spell checkers are available for these languages to
dat e.
2.2.3 Design of Tel-Spell
Perhaps the best and most thoroughly worked out
morphological analyzer for Telugu is the one we
have developed at University of Hyderabad over the
past 10 years or so. The system has been tested on
lar ge scale cor por a an d en h an cemen t s an d
refinements have been going on for years. During
this project, a thorough re-engineering work was taken
up and a new version was developed. The new
version is far simpler, more transparent, portable,
well documented, and conforms to standards.
Research work has also been taken up on developing
stemming algorithms for Telugu. A pure corpus based
statistical stemming algorithm has been developed.
T he performance of t his st emmer for t he spell
checking applicat ion has been st udied in various
combinations with dictionary and morphology based
approaches. See thesis by Ravi Mrutyunjaya below
for reference.
A lot of empirical work and experimental studies
had t o be conduct ed t o come out wit h t he best
combination of dictionary and morphology to develop
the first version of the spell checker for Telugu. A
10 Million word corpus developed by us has been
used t o build and t est t he per formance. T he
performance of t he syst em has been found t o be
satisfactory both in terms of detection and correction
of spelling errors. See the references below for more
details. Our Telugu spell checker technology has been
transfered to M/ S Modular Infotech. Ltd. on a
non-exclusive and non-preferential basis for
commercialization. The Telugu spell checker has also
been integrated into our AKSHARA advanced multi-
lingual text processing system.
2.2.4 Error Pattern Analysis
Large scale spelling error data has been obtained from
our 10 Million word Telugu corpus. The raw corpus
as it was typed has been compared with the final
version after three levels of proof reading and
cert ificat ion by qualified and experienced proof
readers. A number of tools have been developed to
prepare such a data. Quantitative study of spelling
error patterns in Telugu is being conducted. This
will help us to build better spell checkers in future.
2.2.5 Syllable level statistics for spell checking
Since words are large and complex and hence t oo
numerous in Telugu and proper morphological analysis
is difficult, it would be useful to perform studies at
lower levels of linguistic units. The syllable level is
a natural choice since writing in Indian languages
is primarily syllabic in nature. n-gram models have
been built at syllable level. HMM models have also
been built. These models can be used to detect
spelling errors and to rank the suggestions for
correcting a given word. See the references below for
more technical details.
2.2.6 Further work
A agreement has been entered into with M/ S Modular
Infotech. Ltd. For further development and transfer
of spell checkers for Telugu and Kannada.
2.3 AKSHARA: Advanced Multi-Lingual Text
2.3.1 Why one more word processor?
A syst emat ic st udy of various available word
processors for Indian languages was performed in
order to choose the best ones for our own use here.
AKSHARA: Spelling Error Detection and Correction
The study indicated that none of the available
software products were satisfactory. They were slow,
fragile, unreliable, and they broke down when the
data is large. Even very simple operations such as
changing the font size cause the system to crash when
the file is big. There are a number of problems and a
detailed study convinced us that these are not merely
implementation level bugs that can be hoped to be
removed in the future versions. The basic design
philosophies are faulty and short sighted. It appears
that most of the commercial packages have been
designed without thinking beyond type-compose-
print paradigm for using computers as mere type-
writers. Most commercial packages work only
under Microsoft Windows platforms. The better
ones are a bit too costly for most ordinary users.
Adherence to standards is poor and compatibility
across fonts, versions and different packages is a big
problem. This motivated us to start developing our
own advanced multi-lingual text processor named
2.3.2 Encoding scheme
AKSH ARA encodes t ext s in a st andard charact er
encoding scheme such as ISCII or UNICODE.
Many commercial syst ems use font encoded pages
t o by-pass t he charact er t o font conversion process
- it self a complex st ep for which t here does not
seem t o be any fully sat isfact ory solut ion so far.
T h ese commer ci al syst ems al so oft en u se
p r op r i et ar y, n on - st an d ar d fon t s wi t h secr et
encodings. The document s so creat ed are not t ext s
at all. There will be serious port abilit y const raint s
and unfort unat e as it is, oft en t he only pract ical
way t o get back some t ext already in elect ronic
form encoded in such font s is t o re-t ype t he whole
t ext ! AKSH ARA document s are always
charact er encoded. Mapping t o font s is done only
for t he purposes of display and print ing - all ot her
operat ions are performed on charact er encodings.
Commercial packages use propriet ary and secret
encodings using non-print able cont rol charact ers
for st oring t he at t ribut es. In AKSHARA at t ribut es
ar e in cluded in an open XML st yle mar kup
language called Ext ensible Document Definit ion
Language (XDL) developed by us. This makes it
easy t o convert t o and from various ot her encoding
sch emes t h er eby en su r i n g h i gh est levels of
port abilit y and plat form independence.
2.3.3 Script Grammar
One of the unique features of Indian Language
writing systems is a script grammar. While “akshara”
or syllable is the basic unit of writing, these aksharas
are actually composed of more basic elements such
as vowels and consonants. Not all sequences of
such basic elements are valid and the script
grammar specifies legal combinat ions. Most
commercial packages do not seem to respect the
script grammar properly. A large percentage of errors
in the corpora developed using other tools earlier has
been found to be due to the inability of these tools
to check and strictly apply the script grammar.
What you see on the screen is not always what is
stored in the file and hence there is no way to check
and correct these mistakes by looking at the documents
on the screen.
A unique feature of AKSHARA is that it understands
the script grammar and warns you if you try to build
un gr ammat ical syllables. AKSH ARA h as been
successfully used to clean up all the corpora at CIIL,
2.3.4 AKSHARA is Robust, Reliable and Platform
AKSHARA is platform independent - you can use it
on MS Windows, Linux and many other platforms.
AKSHARA is also robust and reliable – you can
comfortably work with large documents without
worrying of silly restrictions such as line lengths.
AKSHARA has been successfully used to develop a
10 Million word corpus of Telugu.
AKSHARA: Advanced Multilingual Text Processor
2.3.5 Advanced Text Processing Tools
AKSHARA is an advanced text processing tool -
dictionaries, morphological analyzers, spell checkers,
OCR systems, TTS systems, text processing tools
in cludin g sear chin g, sor t in g et c. ar e par t of
AKSHARA. Several text processing tools, Telugu
spell checker and Telugu TTS have already been
integrated. We would also be happy to integrate any
of the other dictionaries, spell checkers, TTS etc. that
other centres may have developed. Full support for
Regular Expressions and Finite State Machines is being
2.3.6 AKSHARA as an email client
AKSHARA is unique in providing multilingual email
sending as well as receiving facilities. All you need
is a public email account somewhere. While many
other systems allow you to send emails, receiving
mails is not as easy. With AKSHARA there is no
longer any need to depend on any third party sites on
the Internet.
2.3.7 Developing Interactive web pages in Indian
AKSHARA also enables you to develop interactive
web pages in Indian languages and English. Just use
AKSH ARA and any web browser t o creat e, edit ,
modify and refine your web pages. These web pages
will work across platforms and browsers. All your
web pages will still be character encoded, not font
encoded. Thus you will be really building a long
lasting knowledge base. And what more, you can
create interactive web pages - pages into which your
users can directly type in in Indian languages. All
t his is made possible t hrough our unique WILIO
technology. You will find more on WILIO below.
T h e web pages i n ou r si t e
www. Lan gu ageTech n ologi es. ac. i n h ave been
developed using this technology. For example, you
will be able t o int eract ively search bilingual
dictionaries from our website.
2.3.8 Availability
Wondering how much AKSH ARA may cost ?
AKSHARA will be available for free and freely
available. Let everybody have the basic Indian
Language processing capabilities without restrictions.
2.4 email in Indian Languages
We have developed t echnology for composing,
sending as well receiving emails in any combination
of English and other Indian languages. Many other
systems support only sending of mails, receiving mails
would not be as straight forward. This technology
has been integrated into our AKSHARA system. You
will only need a public email account (that supports
POP3 or IMAP protocols) somewhere. Unlike other
technologies, here there will be no dependence on
any other third party web sites. AKSHARA installed
on your local machine will be your email client.
2.5 WILIO: Interactive Web Pages in Indian
Developing web content in Indian Languages has been
a challenge. None of t he browsers underst and t he
ISCII character encoding scheme. We may hope that
UNICODE compatible browsers and free availability
of UNICODE fonts will mitigate the situation to a
AKSHARA: Text Processing Tools
AKSHARA: email client
large extent in future. As on date, however, not all
browsers support UNICODE, UNICODE fonts are
not readily available for all Indian Languages and
UNICODE scheme is itself still not completely
satisfactory and revisions are going on. We explore
here briefly various alternatives people have tried and
present our own technology that we feel is far
superior to the others.
2.5.1 Text as Pictures
The simplest way to ensure that every client sees
exactly what you want him or her to see is to encode
texts as images. This technology Will work irrespective
of the platform and particular browser the user is using
and whether he or she has the required fonts or not.
However a picture is worth one thousand words (or a
lot more) - both in terms of storage and network
bandwidt h required. Clearly, t his cannot even be
considered as a solution as there is no text at all.
2.5.2 Font encoded pages
We can have font encoded web pages and expect
the users to have the fonts locally available on their
machines. Unfortunately, fonts are not yet freely
available for Indian languages and most computers
in the country will have no Indian language fonts
at all. Much more importantly, font encoded pages
are not texts at all. There is no font encoding standard
and encoding fonts in proprietary fonts is as good as
encr ypt ing t hem. Web sit es must be viewed as
knowledge bases – long lasting and easily maintainable.
Unlike in the case of languages like English where
ch ar act er an d glyph h ave a on e t o on e
correspondence, Indian scripts are complex and the
mapping from characters to glyph sequences in a given
font is a complex many to many mapping. Therefore
font encoded web pages are no solution at all.
Dynamic Font Technology
One part of the problem with font encoded schemes,
namely availability of fonts on the client machine,
can be solved by using the so called dynamic font
technology. The basic idea is to send the fonts also
along with the requested documents to the client.
The pages are still font encoded. Not much different
from the previous method. Further,
dynamic font libraries are required - the usual
fonts are not sufficient. One needs to buy tools
to prepare dynamic font libraries. Otherwise you
will have to depend upon some other service provider.
Requests for a web pages will require connections
to the service provider’s web site too. Clearly, this
is not a good solution.
2.5.3 Plug-in technology
Then there are plug-ins - add-on pieces of software
that do the character to font conversion on the client
machine. The plug-in will have to be downloaded
and installed only once by a client. The pages will be
character encoded. This sounds like a good solution
but it has not worked well in practice. Plug-ins are
add-ons to browser software and the browsers vary
widely in terms of the support and the details of how
they take on these add-ons. For each browser and in
fact for each version of a browser, a suitable version of
t he plug-in will have t o be developed. As new
browsers keep appearing in the market, new versions
of plug-ins will need to be developed too. Unless the
br owser d eveloper s t h emselves t ake on t h e
responsibility of supporting Indian script standards,
this technology is unlikely to be accepted as a good
and permanent solution.
2.5.4 Forget the scripts, use Roman
Of course one may forget Indian scripts and encode
Indian languages in t he Roman script . This will
provide complet e immunit y from plat form and
browser variations and font dependencies. Literate
Indian language users who are not comfortable with
the Indian scripts, say, non-resident Indians, will
also be able to use this technology. However,
this cannot be taken as a solution for Indian language
support for web content!.
2.5.5 What is WILIO?
We have endeavored to develop a better technology
which we call WILIO. WILIO permit s st andard
character encoded web pages to be viewed on any
browser and any operat ing syst em. The charact er
encoded pages are received by the client and mapped
to the required fonts before displaying them.
WILIO is unique in its ability to permit two-
way communication. We can develop interactive web
pages wherein t he users can also t ype-in Indian
language cont ent direct ly int o t he browser. The
required keyboard driver etc. are included in WILIO.
Thus one may prepare lessons, ask questions, allow
users to type in their responses, receive and validate
the answerers and get back to the users accordingly.
This will open up a whole new experience with
Indian language web content. The web pages in
our site have been developed using WILIO technology.
For example, look at our dictionary look up services
at our website
2.5.6 How WILIO works
WILIO works through a Java Applet. Browsers must
support Java. Most browsers do. In case the Java
plug-in is not inst alled, it will be aut omat ically
installed after getting user’s confirmation. WILIO also
requires that the fonts are locally available. A few
fonts are available freely in any country and we hope
we will not be far from such a day in India too. We
are making all out efforts to make a few fonts freely
available to everybody for non-commercial use. WILIO
itself is fully integrated into AKSHARA and AKSHARA
will be freely available and available for free.
2.5.7 Security
One of the very useful side-effects of this technology
is document security - WILIO it more difficult for
people to download and print the pages.
2.6 Telugu Corpus
A large, representative corpus is the first and most
essential resource for language engineering research
and development. A corpus is essential for building
language models as well as for large scale testing and
evaluation. Special emphasis was therefore laid on
developing a fairly large corpus of Telugu language,
the language of focus in the current project.
2.6.1 Status before the year 2000
Developing corpora for Indian languages has been
more challenging than it may appear. These days most
publishers use computers at some stage or the other.
Why not simply compile such readily available
material? While it is possible to get some material in
electronic form directly from publishers, DTP centres
and websites, it must be emphasized that there are
no free fonts and the proprietary fonts used by
various groups do not stick to any standard. While
the ISCII national standard for character encoding
has been around for a long time now, most of the
document s cont inue t o be developed using
pr opr i et ar y fon t s embed d ed i n pr opr i et ar y
commercial software. Thus it is not possible to simply
download and add to the corpus.
Before the year 2000 corpora of only about 3
Million words were available for the major Indian
languages. These corpora were developed with the
support of the Ministry of Communications and
I n for mat i on Tech n ology, t h en kn own as t h e
Department of Electronics. Even these corpora were
not released for many years for researchers because
of legal and technical problems relating to copy rights.
2.6.2 How large is a large corpus?
Given t he rich morphological nat ure of Indian
languages, it was felt that a mere 3 Million word
corpus would not be sufficient. In order to establish
this fact, we conducted a growth rate analysis of the
available corpus of Telugu. The corpus is split
randomly int o equal sized part s and t ype-t oken
analysis is performed. A “type” is a particular word
for m an d each occu r r en ce of t h at wor d for m
constitutes a “token”. For example, “word” is a type
and there are two tokens of this type in the previous
sentence. Each part of the corpus contributes a set
of types and the cumulative number of types is
plotted on the Y-axis against the size of the corpus
(measured in t erms of t he cumulat ive number of
tokens) on the X-axis. The resulting curve depicts the
rate at which new types grow as the size of the
corpus increases. If the curve shows signs of saturation
and tends towards the horizontal, that means that
most of the word forms have already been obtained
and adding more corpus will not add too many new
word forms. As long as the growth rate curve continues
to show a high slope, it indicates that the corpus we
have is insufficient and many of the possible word
forms are yet to be seen even once in the corpus. Given
below is the growth rate curves for all the major
Indian languages for which corpora were available.
2.6.3 Dravidian Languages vs Indo-Aryan Languages
The distinction between Dravidian languages and
Indo-Aryan languages is striking in this figure - there
are many more word forms (t ypes) in Dravidian
languages than in the other Indian languages. While
150,000 to 200,000 word types should be giving
an excellent coverage for nort hern languages,
Dravidian languages such as Telugu spoken mainly
in the southern parts of India require a much larger
number of word forms. And, more importantly, the
available corpus is not sufficient even to get a clear
idea of how many words are there in the language.
The morphology of these languages is so rich, no one
so far has an idea how many different word
forms are there in the language. One well known
linguist has argued that each verb root in Telugu can
give rise to as many as 200,000 different inflected/
derived word forms (Ref. Dr. G. Uma Maheshwara
Rao, personal communication).
What this shows is that techniques which work
well for Indo-Aryan languages may not be applicable
to Dravidian languages. For example, it would be
possible to simply list all forms of all words and use
this for dictionary based spelling error detection
and correction system for Hindi, Punjabi or Bengali
but such an approach cannot not be expect ed t o
produce comparable performance result s for say,
Telugu or Kannada. It would thus is not proper to
make out right comparisons of performance of
language engineering products across these classes of
languages. The inherent complexity of the languages
must be factored in when making any comparative
judgements of performance.
2.6.4 Copy Right Issues
Given the earlier experiences of groups developing
corpora, we had to take every care to ensure that
we did not get into copy right problems. At the
same time, we realized that it is not going to be easy
for us to take over the legal copy rights from the
authors or publishers. Hence it was decided that we
will only ask for right of electronic reproduction and
rights for hosting selected works on our web-site,
without asking for a legal transfer of copy rights.
The original copy right holders would continue to
hold their copy rights and would be free to sell,
distribute or transfer their works to any other party at
their will.
It was nevertheless an extraordinary convincing effort
to get the best authors to part with their best works
for our corpus without spending a single pie of money.
It took a tremendous amount of time and effort to
convince the copy right holders that what they and
the country at large gains in the long run will be
much more than the hypothetical loss incurred in
giving us their works free of cost. A variety of
strategies and tactics had to be used but at the end
we have been able to obtain the rights for more than
250 of the best works of the best known writers.
Add to this works for which copy rights have expired
and we have a list of more than 500 books. (It will
be int erest ing t o not e t hat t he expect at ion of t he
funding body was 10 good books!).
2.6.5 The Status
A corpus of 225 books adding upto about 30,000 pages
and 9.25 Million words has been completed. The corpus
includes a variety of topics and categories - newspaper
articles, short stories, novels, poetry, classical and
modern writings etc. Each of these works has been typed
in using our AKSHARA - advanced multi-lingual text
processor and other such tools and subjected to two
levels of thorough proof reading by qualified and
experienced proof readers and finally certified free of
errors. The entire corpus is encoded in the ISCII/
UNICODE character encoding standard and XML style
annotation scheme is used for meta information.
The growth rate curve of types against tokens for the
12 Million word total corpus of Telugu available now
was conduct ed recent ly. The curve shown below
shows clearly that even this corpus is not sufficient -
there is no sign of saturation and the growth rate
has not reduced significantly. We still do not have
even a single occurrence of most of the word forms
although the number of types has already reached
2.6.6 Tools
A number of tools have been developed to develop,
analyze and manage large scale corpora. Some of these
t ools have also been given t o ot her cen t res. A
comprehensive tool kit is being developed.
We have also developed tools for semi-automatically
decoding any unknown font with minimum effort.
Using t his t echnique, a mapping scheme can be
developed t o map t he t ext st rings encoded in t he
unknown font into an equivalent text string in a
standard character encoding scheme such as ISCII
or UNICODE. In fact several of the widely used
fonts have been decoded and we can now add more
free material such as newspaper articles at ease.
2.6.7 Plans
Now that our OCR system for Telugu and the Telugu
spell checker have reached a level of performance
that makes it suitable for use in content creation,
we hope to be able to develop even larger corpora of
Telugu very soon. A thorough investigation into the
spread across genres and represent at iveness of t he
corpus is also being carried out so that further work
can be fine tuned accordingly despite the non-
availability per se of texts in some of the categories
in Telugu language.
Plans for the future including various levels
of an n ot at ion . En glish-Telugu parallel corpus
development is also being considered.
2.7 Dictionaries, Thesauri and other Lexical
Dictionaries are the most basic and essential data
resource for any language. Accordingly, we have
developed a number of monolingual and bilingual
dictionaries as detailed below. The dictionaries
are available in the XML format for data exchange
and indexed cleverly for efficient search. Look-up
services are provided from our website using our unique
WILIO technology for OS and browser independent
deployment. All dictionaries are encoded in ISCII/
UNICODE standard character encoding schemes.
Apart from dictionaries, we have also developed
thesauri and more importantly, a tool by which we
can develop a thesaurus of sorts for any language in
just a few min ut es fr om a suit able bilin gual
dictionary. Here is a summary of the dictionaries we
have with us. Some of these are already being used
by researchers in other centres.
2.7.1 C P Brown’s English - Telugu Dictionary
Status : Completed; Size : 31,000 plus; Fields : POS,
Meanings, Usage; XML? : Yes; Indexed? : Yes; Web-
enabled? : Yes.
2.7.2 C P Brown’s Telugu - English Dictionary
Status : Completed; Size : 31,000 plus; Fields : POS,
Meanings, Usage, Etymology; XML? : WIP; Indexed?
: WIP; Web-enabled? : WIP.
2.7.3 English - Telugu Dictionary suitable for
Machine Aided Translation
Status : Completed; Size : 37,500 plus; Fields : POS,
Meanings; XML? : Yes; Indexed? : Yes; Web- enabled?
: WIP.
2.7.4 Telugu - Hindi Dictionary suitable for
Automatic Translation
Status : Completed; Size : 64,000 plus; Fields : POS,
Paradigm Class, Meanings; XML? : Yes; Indexed? :
Yes; Web- enabled? : WIP.
2.7.5 English - Kannada Dictionary
Status : Completed; Size : 15,000 plus; Fields : POS,
Meanings, XML? : Yes; Indexed? : Yes; Web- enabled?
: WIP.
2.7.6 Basic Material for English Dictionary
Status : Completed; Size : 6,00,000 plus; Fields : POS,
Frequency; XML? : Yes; Indexed? : Yes; Web- enabled?
: WIP.
2.7.7 English Dictionary
Status : Completed; Size : 80,000 plus; Fields : POS,
Frequency; XML? : Yes; Indexed? : Yes; Web- enabled?
: WIP.
2.7.8 Telugu Dictionary
Status : Completed; Size : 64,000 plus; Fields : POS,
Paradigm Class; XML? : Yes; Indexed? : Yes; Web-
enabled? : WIP.
2.7.9 Kannada Dictionary
Status : Completed; Size : 12,000 plus; Fields : POS;
XML? : Yes; Indexed? : Yes; Web- enabled? : WIP.
2.7.10 Kannada Thesaurus
St at us : Complet ed; Size : 12,000 plus; Fields :
Synonyms, POS, Sense; XML? : Yes; Indexed? : Yes;
Web- enabled? : WIP.
WIP: Work in progress
These dictionaries are all closely linked up with
corpora, morphological analyzers and generators,
spell checkers etc. Cross-validation and refinement
continue on a regular basis.
2.7.11 Tools
We have also developed a number of t ools for
developing elect ronic dict ionaries, for efficient
indexing, searching and other such operations on
electronic dictionaries, for formatting in XML or other
standards, for verification and validation, for web-
enabling and offering web based services et c. We
would be glad to host dictionaries developed by other
centres using our platform independent and secure
WILIO technology.
2.7.12 Automatic generation of Thesaurus from
Bilingual Dictionary
We have developed a unique tool that can generate a
thesaurus of sorts for any language in just a few
minutes starting form a suitable bilingual dictionary.
We would be glad to offer this service to any centre
that has a suitable bilingual dictionary. Thesauri
being extremely useful resources yet non existent for
many Indian Languages, we believe the contribution
of this tool is being appreciated very well from all
quarters hope this tool would be very useful.
2.7.13 Technology for hosting dictionaries on the web
We have the unique capability to place Dictionaries
on the web for efficient, secure, platform and browser
independent services. We would be glad to host any
ot her dict ionary developed by any ot her cent re
through our technology from our site.
2.8 Morphology
2.8.1 What is Morphology?
Morphology deals with the internal structure of
words. Morphology makes it possible to treat words
such as compute, computer, computers, computing,
computed, computation, computerize, computerization,
computerizable and computerizability as variants of the
same root rather than as different words unrelated
to one another. Morphology makes it possible to store
only the root words in the dictionary and derive other
variants through the rules of the morphology. It helps
us to understand the meaning of related words.
2.8.2 Indian Languages exhibit rich morphology
Morphology plays a much great er role in Indian
lan gu ages becau se ou r lan gu ages ar e h i gh ly
inflectional. While the English verb eat gives rise to
only a few variants such as eats, ate, eaten and eating,
the corresponding verb in Telugu can give rise to a
very large number of variants. Words in Dravidian
languages like Telugu and Kannada are long and
complex, built up from many affixes that combine
with one another according to complex rules of
saMd h i . For example,
nilapeT Tukooleekapoot unnaaDaa? which means
something like “Is it true that he is finding it difficult
to hold on to (his words/ something)?”
T el u gu i s b ot h h i gh l y i n fl ect i on al an d
agglut inat ive. Auxiliary verbs are used in various
combinat ions t o indicat e complex aspect s. Clit ics,
part icles, vocat ives are all part of t he word. Telugu
exhibit s vowel harmony - Vowels deep inside a
verb may change due t o changes at t he boundaries
of saMdhi. Ext ernal saMdhi bet ween whole
words and compounds also occur in t he language.
See t he references below for more on Telugu
morphology. One linguist put s t he number of
variant s for a single Telugu verb at nearly 200,000!
[ G . Uma Mah esh war a Rao, Per son al
Communicat ion.] The exact number of different
forms t hat a verb can t ake in a language like Telugu
is not yet clear. The growt h rat e analysis described
in t he sect ion on corpora clearly shows t hat t he 12
Million corpus available at present is not sufficient
t o give us even a single occurrence of many possible
words in t he language. While Indian languages
in gen er al ar e mor ph ologically r ich er t h an
languages like English, Dravidian languages
are a lot more complex. The 12 Million word
corpus of Telugu has nearly 20,000,000 different
words and t here will be many more as t he growt h
rat e curve indicat es. In cont rast , t he Indo-Aryan
languages have only about 1,50,000 t o 2,00,000
words forms in all. Dravidian languages
including Telugu, Kannada, Malayalam and Tamil
are among t he most complex languages of t he
wor ld an d can on ly be placed alon g wit h
languages such as Finnish and Turkish. Clearly,
t here is no way we can hope t o list all forms of all
words in a dict ionar y. We cannot build a spell
checker, for example, by simply list ing all forms
of all words. Morphology is not just useful but
absolut ely essent ial.
2.8.3 Design of Telugu Morphological analyzer
Building a morphological analyzer and generator for
a language like Telugu is thus a very challenging
task. Perhaps the only large scale system built for
Telugu is ours. Our Telugu morphological analyzer
has been built, tested against corpora and refined
over the past 10 years. This system uses a root word
dictionary of 64,000 entries and a suffix list categorized
int o a number of paradigm classes. T he basic
methodology is to look for suffixes, remove them taking
care of saMdhi changes and then cross checking
with the dictionary. Inflection, derivation, external
saMdhi are all handled. See the references below for
more technical details. There is also a separate
morphological generator that can put together the
roots and affixes to construct complete word forms.
2.8.4 Design of Kannada Morphological Analyzer
We have also developed a Kannada morphological
analyzer and generator using our own Network and
Process Model. A finite state network captures in a
declarative and bidirectional fashion all the affixes,
t heir orderin g an d t he various combin at ion s
permitted. The process component takes care of
saMdhi changes when affixes are added or removed.
T his model makes it possible t o develop a
morphological analyzer, test it against a corpus and
then we get a generator of comparable performance
with no extra effort since the same network is used
both for analysis and generation. In this model, a
complete and detailed analysis is made at the level
of each affix.
2.8.5 Tool for developing Morph systems for other
Morphological analyzers and generators for several
languages including Kannada, Tamil, Oriya etc. have
been built using this Network and Process model.
That a good Tamil morphological analyzer and
generator could be built within a week using this
system is a testimony to the quality of design and
implementation of the system. See references below
for more details.
As we build larger and more representative
corpora, further refinements to dictionaries as well
as morphological analyzers and generat ors will
2.9 Stemmer
As the above section on Morphology shows, it is
very difficult to build a high performance analyzer or
generator for Dravidian languages such as Telugu. An
alternative short-cut approach that can be used in
practice is stemming. Here a complete and detailed
morphological analysis is not performed. Instead, the
affixes are removed to obtain the root. For example,
the common prefix in the words compute, computer
and computing is comput and hence all these word
forms are reduced by removing the affixes to the
common stem comput. Note that comput is not a valid
linguistic unit at all. Yet, such stemming techniques
are useful and have been used in many areas
including Information Retrieval. Stemming can also
be used as t he secon d lin e of defen ce when
morphology fails.
A thorough study of various stemming techniques
have been conduct ed. Ingenious corpus based
statistical stemming techniques have been developed
for stemming in Telugu. Vowel changes, gemination
etc. need to be taken care of in building a stemmer
for Telugu. The stemmer has been compared with
t he full mor phological an alyzer an d var ious
combinations have been tried out for the purpose
of spelling error detection and correction in Telugu.
See the references below for more technical details.
2.10 Part of Speech Tagging
2.10.1 What is POS tagging?
A dictionary lists all possible grammatical categories
for a given word. The job of a Part of Speech
(POS) tagger is to identify the correct POS for a
given word in context. For example, the word thought
is a verb in the following sentence and a noun in
the sentence that comes next: I have thought about
it from various angles. Suddenly this strange thought
came to my mind.} POS tagging may be at the level
of gross grammatical categories such as verbs and
nouns or, more often, at a more fine grained
level of sub-categorization.
2.10.2 POS tagging techniques for English and
Indian languages
In English like positional languages, the category of
a word can be determined in terms of the categories
of t he preceding words. As such Hidden Markov
Models have been widely used. There are several
other techniques too. However, as far as Indian
languages are concerned, many of these sequence
oriented techniques are not very much applicable.
Our languages are characterized by free word order
and hence it does make much sense to depend so
much on previous or following few words. Instead,
our languages are characterized by a very rich system
of morphological inflection and it is here that we get
maximum information about the correct part of
speech of a word. The percent age of words t hat
occur in some inflected form rather than in the bare
stem form is far more for Indian Languages as compared
to English. Morphology holds the key for POS tagging
of Indian languages. In fact one may even go a step
further and argue that a POS-tagged corpus does not
make much sense. Whenever you process some text,
you will need to perform morphological analysis and
the job of POS tagger will be done there too.
However, developing robust morphological
analyzers for Indian languages in general and
Dravidian languages in particular have been difficult
challenges. The performance of any POS tagger
based on morphology would be limit ed by t he
performance of the morphological analyzer itself.
2.10.3 Degree and nature of lexical ambiguities
A systematic study of the degree and nature of lexical
ambiguities at the dictionary and corpus levels is
being conducted. Appropriate technologies for POS
tagging based on morphology are being developed.
The percentage of words that occur in some inflected
form rather than in the bare stem form is far more
for Indian Languages as compared to English. This
has serious implications for the degree and nature of
lexical ambiguities in running texts.
2.10.4 HMM system for POS tagging
In order to gain deeper understanding of POS
tagging for various languages including English, a
Tri-tag based HMM model has also been built and
tested on the SUSANNE corpus.
2.11 VIDYA: Comprehensive Toolkit for Web-Based
2.11.1 eLearning
eLearning helps to overcome the barriers of distance
and time in learning. Thrust is on learning, not
teaching. Students can thus learn whatever they like,
in whatever order they please and at a pace that is
best for them. Instead of teachers, there will only
be facilitators.
There are many tools for eLearning. Only some
of them are comprehensive tools that provide for
the entire gamut of facilities from pre-registration
counseling to maintenance of the alumni database.
Good ones are very cost ly and most educat ional
institutions in India will not be able to afford to
buy such tools. Many developments are taking place
in t his area but t he major usage so far has been
limit ed t o cor por at e t r ain in g in I n for mat ion
Technology and related areas. The focus is on adult
serious learners only. Profit seems to be the main
motive in many cases.
2.11.2 Web Based Education - Technology for Quality
Our view point is very different . Tradit ional
education is more teacher centric whereas eLearning
is learner centric. The idea is not to choose one or
the other but the right combination of both. Not all
students are mature and serious enough to learn on
t heir own. Teachers are required t o guide, inst ill
confidence and t o inspire. The need is t o look at
education at all levels in a holistic sense.
The primary issue in question is quality of education.
We produce very large number of BSc’s and MSc’s but
very few scientists. We produce very large number of
BE and BTech degree holders but very few engineers.
Primary school level is worse. There are several problems
and not all of them can be solved through technology.
The question is what is it that technology can do to
ensure quality education to everybody?
How do we ensure highest quality of education at
all levels without barriers of distance and time? Good
teachers are not always available. Distance and time
are not the only barriers. Cost and language are
bigger and more serious barriers. The ultimate
objective should therefore be to reach out to all
interested students and offer the highest quality of
education without any kinds of barriers - distance,
time, cost or language. Here is where technology
can bring the services of best teachers, best course
materials to every student in a cost effective manner.
Our technologies must be Indian language enabled.
2.11.3 VIDYA - a comprehensive suite of tools
Given t his scen ario, we st art ed developin g a
comprehensive suite of tools for web based education.
Our suite is called VIDYA. Indian languages can be
supported. Interactive web content can be created
using our WILIO and AKSHARA technologies.
VIDYA supports inter-student and student-teacher
interaction through email, chat, discussion rooms
and whit e boards. It encourages collaborat ive
problem solving and group activities.
VIDYA has t he unique facilit y t o link wit h auxiliary
servers for ext ra support such as for laborat ories.
For examp l e, you may u se VI D YA t o d o
programming in Java, C++, C, and Perl languages
wit hout t he need for t hese compilers on your
machines. Coding, edit ing, compiling, execut ing,
archieval are all support ed. Similarly, science labs.
And language labs. can be developed. Appropriat e
use of mult i-media and learning by doing makes
learning a pleasure and has much great er impact
t han reading t ext books and list ening t o class room
lect ures.
VIDYA support a wide variety of testing, evaluation
and reporting facilities. Adaptive testing, navigation
control, timing etc. are supported. Full range of
quest ion t ypes including mult iple choice, short
answer and essay type questions are permitted.
2.11.4 Status and Plans
VIDYA has been installed in several centres
including CIIL, Mysore. VIDYA is being regularly
used in University of Hyderabad for teaching courses
at MCA and MTech levels. It has also been used for
offering special courses to reputed industries. A
recent st udy has shown t hat it is suit able for
deployment in our distance education programme.
VIDYA: Comprehensive suite of Tools for Web Based
VIDYA could be used by schools, colleges,
universities, research laboratories etc. for regular
education, continuing education, part-time courses,
in-house t raining et c. Suit able mat erial can be
developed and shared with others so as to maximize
the impact. In particular, language teaching material
already developed or being developed by various
centres can be linked with with VIDYA to enable
various classes of language learners to get maximum
benefit . We would also be glad t o ent er int o
agreements for further collaborative development
of the tool itself.
2.12 Grammars and Syntactic Parsers
2.12.1 Computational Grammars for Indian
There are no large scale computational grammars for
any of the Indian languages. Computational grammars
and syntactic parsers are very much required for taking
Indian languages beyond the type-compose-print
paradigm that is holding back the country from
growing beyond using computers as some kind of
type-writers. All language engineering applications
including machine translation, information retrieval,
information extraction, automatic categorization
and automatic summarization would greatly benefit
from syntactic parsers.
2.12.2 UCSG system of Syntax
The UCSG system of syntax was developed by us
to place positional languages such as English on an
equal foot in g wit h I n dian lan guages t hat ar e
characterized by relatively free word order. A careful
study of both the Western grammar formalisms and
the paaNinian approach to syntax showed that none
of these would be equally suitable for positional
and free word order languages and hence a new
formalism had t o be developed. Comput at ional
grammar and Parser has been developed for English
and demonst rat ion level syst ems have also been
developed for Telugu and Kannada. UCSG uses a
combination of Finite State Machines, Context Free
Grammars and Constraint Satisfaction to achieve the
best overall performance. Grammars become simple
and easy to write, parsers become computationally
very efficient and the same basic framework works
for English and other Indian languages. UCSG works
from whole to part, rather than from left to right.
See the references below for more details.
Further development of the UCSG English parser is
going on. A much larger and more informat ive
dictionary has been built based on the analysis of
large scale corpora. Combination of linguistic and
statistical models are being used to enhance the
coverage and robustness of the system. Plans
in clude t he developmen t of comput at ion al
grammars for Telugu and other Indian languages.
2.12.3 Robust Partial Parsing
It is now well recognized that full syntactic parsers
are difficult t o build. H ence t here is increased
interest in robust but shallow or partial parsing. An
extensive study of parsing technologies have been made
and efforts are on to build a large scale robust partial
parsing system for English. Efforts are also underway,
in collaboration with linguists from CIIL to develop
computational grammars and shallow parsing systems
for Indian languages.
VIDYA: Interactive Multimedia Content
2.13 Machine Aided Translation
Automatic or Machine Translation is the one of
t h e wi d ely kn own appli cat i on s i n lan gu age
engineering. It has been recognized very well that fully
automatic high quality translation in open domains
is difficult to achieve. Either restricted domains of
applications with controlled language usage must
be considered or the translation process has to be
semi-automatic, the man and the machine doing what
they are good at and seeking help from each other in
other areas. Even with such restrictions, ensuring
quality of translation is a very challenging task.
Language is rich and varied in structure as well as
meaning. Since the machine cannot be expected to
“underst and” t he meaning of t he given source
language text in any real sense, the out of the machine
can at best be good, calculated guesses. The situation
is further complicated by the fact that expectations
of users are very high when it comes to translation.
2.13.1 English to Kannada Machine Aided
Translation System
A Machine Aided Translation system was developed
here for t he Govern men t of Karn at aka for
translating budget speech texts from English to
Kannada. English text is pre-processed and segmented
into sentences. Each sentence is syntactically parsed
using our UCSG English parser. The parsed
sentences are translated to Kannada in a whole-to-
part fashion using the bilingual English-Kannada
dictionary and Kannada morphological generator
developed by us. There is a powerful post processor
that is tightly integrated with the dictionary,
thesaurus, morphology and the translator. A full 150
page text is parsed and translated in just a couple of
minutes on a desktop PC. The output is post-
edited and then sent for final proof reading.
Although this was a very short project with a very
low budget, this project has demonstrated the merits
of a well designed and well engineered product .
There are several unique and powerful features in
this system. This system has become a very good
technology demonstrator and has inspired a lot of
serious work in various directions by several groups
across the country. See
in for more details.
The success of t he MAT syst em for English t o
Kannada translation has inspired further work on
dictionaries, morphology of Kannada, robust parsing
and word sense disambiguation. Research work has
also been taken up on a number of other specific
t opics such as corpus based machine learning
techniques for sentence boundary identification.
The MAT2 system proposed purports to combine
the best of linguistic theories, corpus based machine
learning algorithms and human judgement based on
world knowledge and commonsense to achieve
high quality translations in a semi-automatic setup.
A very useful by-product of this exercise will be a
high quality POS and sense tagged, parsed, aligned,
parallel corpus.
2.14 Tools
We have also developed a number of tools over the
past many years for our own use and some of these
tools could be useful to other groups as well. In fact
some of these tools have already been given to other
resource centres. Here we list some of the important
tools developed by us. It may be noted that not all
these tools were developed within the period of, or
with the support of, this specific project.
2.14.1 Font Decoding
Unlike English, Indian scripts are syllabic in nature.
The units of writing are akshara’s or syllables. The
total number of possible syllables is very large. Thus
fonts are developed using shape units called glyphs
which need to be composed to form complete
syllables. The mapping from syllables to glyph
sequences is complex. There is a proliferation of
non-standard and proprietary fonts. In fact there is
no font encoding standard as yet for Indian languages.
Thus documents encoded in some unknown font is
exactly like a coded message - one needs to do decode
t hem before t hey st art making any sense. Only
documents encoded in a standard character encoding
scheme such as ISCII or UNICODE can be
considered as text. Font encoded documents re not
texts at all. However a large number of documents
are available only in font encoded forms. Some
companies in fact use this as a means of achieving
some degree of security for their documents.
We have developed a set of tools through which
we can decode any unknown font and map it onto a
standard character encoding scheme such as ISCII
or UNICODE. This is a semi-automatic and
iterative process. With this tool, now it possible to
decode any unknown font and we hope this would
encourage commercial companies to become more
open and follow standards instead of pursuing myopic,
proprietary, and restrictive practices.
Some of t he ot her cent res have developed direct
mappings from one font to another. We believe that
the best way to handle font-to-font variations is to
go through a standard character encoding scheme.
All documents must be encoded and processed in a
character encoding scheme and mapping to fonts is to
be used only for the purposes of display and printing.
2.14.2 Web Crawler for Search Engine
A web crawler searches the whole web and builds
up an index of web pages that is structured and
classified in a way that enables search engines to search
the web efficiently. We have developed a basic web
crawler through which such an index can be built.
This tool can also be used to download whole web
sites, for archival of web sites etc.
2.14.3 PSA: A Meta Search Engine
A search engine searches the web for the documents
users request through a short query. There are several
good search engines but there is no single search engine
that is ideal in all cases. A Meta Search Engine accepts
a user query, fires search engines, obtains the results
and presents them to the user. Personal Search Assistant
(PSA) is on e such sear ch en gin e design ed an d
developed by us here.
PSA accepts user queries, formats them in the
manner required for various search engines and fires
the search engines accordingly. PSA can currently
handle upto eight different search engines
simult an eously. It is possible t o wor k in t h e
background mode so that users do not need to sit
in front of the machine and wait for results. Status
can be checked up at any given point of time. It is
possible to monitor the network load and adjust
accordingly. Results are collated, duplicates removed
and stored in local database where required. Unlike
a search engine a meta search engine can reside on
local machines and can be customized to suit
individual requirements. Some work has been done
on personalization of PSA. PSA is being used within
University of Hyderabad for several years now.
Plans include Indian language support for PSA.
Queries can t hen be posed in Indian languages.
Specialized versions of the web crawler can also be
developed to locate web pages in Indian languages
or web pages relating to India.
2.14.4 Corpus Analysis Tools
A number of tools have been developed for corpus
analysis. Some of these tools are being used by other
centres as well. It is planned to organize these tools
in the form of tool kit so that it will become more
convenient for others to use them.
2.14.5 Website development Tools
We have a whole ran ge of t ools in cludin g our
AKSHARA and WILIO systems to develop interactive
web sit es in Indian languages. For example, our
History-Society-Culture portal has nearly 500 pages
of Indian language content developed and hosted
through our technology. Interactive lookup services
of our dictionaries also use these tools and
2.14.6 Character to Font mapping Tools
As has been noted elsewhere in this report, mapping
between characters and fonts is a non-trivial process
in Indian languages. Some have used table look-up
method while others have used hand crafted rules.
None of t he syst ems seem t o be sat isfact or y.
Completeness, consistency, robustness, efficiency,
transparency, extensibility, ease of development
are some of the desirable features of such mapping
systems. Given this scenario we have explored the
possibility of developing good mapping systems using
t he Finit e St at e Machine t echnology. Suit able
extensions to the basic technology are proposed for
the purpose.
2.14.7 Dictionary to Thesaurus Tool
Give us any bilingual dictionary and we can give
you a kind of a thesaurus in a couple of minutes.
Our tool can do a clever reverse indexing on any
bilingual dictionary to identify closely related words
for any given word. While t his is not exact ly a
thesaurus in technical sense, the basic idea is the same
- given one word, t o ident ify ot her words in t he
language that are closely related to it in meaning.
The Kannada t hesaurus so developed has been
demonstrated and has been judged to be very useful.
Some researchers are already using this system and
there are plans to do more work on Kannada and
Telugu thesauri. We will be glad to develop thesauri
for any other language given the suitable bilingual
2.14.8 Dictionary Indexing Tools
Dictionaries need to be indexed for efficient
access. We have developed clever indexing schemes
for efficient indexing of large dictionaries on any
computer. Combinations of TRIE indexing, Hashing,
B-trees, AVL Trees etc. are used.
2.14.9 Text Processing Tools
A number of text processing tools for working
with word lists, dictionaries, etc. have been developed
over the past many years. These tools have been
found t o be very useful for linguist s and
lexicographers too. Some of these tools have been
integrated into AKSHARA.
2.14.10 Finite State Technologies Toolkit
Finite State technologies are increasingly being used
in language engineering as they are simple yet very
efficient. A full toolkit has been developed and tested
on large scale data. Now it is possible to work with
Regular Expressions, NFA, DFA etc. and perform all
t he usual operat ions wit hout having t o writ e any
program code. It is planned to integrate these tools
3. Services and Knowledge Bases
3.1 On-Line Literature
Telugu has a very rich literary tradition dating back
to around 10
Century. In today’s busy world where
reading habits seem to be depleting, here is an
attempt to bring some of the best works of Telugu
at your door steps. Making literature available on-
line means making it available anywhere, anytime.
You will not need to goto a book store or otherwise
order and buy a book. This service is absolutely
free of charge and so you spend no money.
More than 550 of the best works in Telugu have been
enlisted. We have obtained, by a great deal of
con vi n ci n g effor t , t h e r i gh t of elect r on i c
reproduction and web enabling these works from
the respective copy right holders. Not a single pie
was spent to obtain the rights though. About 225 of
these books have been converted into electronic form
and checked by a three stage proof reading and
cert ificat ion process by qualified, experienced and
professional proof readers. A panoramic selection of
these works will be made available from our web site
t h r ough our un ique WI LI O t ech n ology t h at
guarantees platform and browser independence as
well as some degree of security. It is planned to
add Roman transliterated versions too for the benefit
of those who know Telugu language but are not
very comfortable with the Telugu script.
3.2 History-Society-Culture Portal
India is a uniquely pluralistic society that is the
home of many religions, traditions and cultures. Here
is an attempt to bring to you nearly 500 pages of
authentic material on the society and culture of the
Telugu people and indirectly, on its history too. Get
to know more about temples, music, dance, folk arts
and many more items. Color photographs are included.
All the pages are available through our unique WILIO
technology that works across operating systems and
web browsers. Roman transliterations are also provided
for those who would have difficulties in reading the
Telugu script. Visit www.LanguageTechnologies
3.3 On-Line Searchable Directory
Given the importance of networking of individuals
and organizations with overlapping interests, it was
decided to develop an on-line searchable directory
of people and organizat ions int erest ed in various
aspects of language technology for Telugu. More
than 1200 relevant entries have been developed
and cross checked. The directory is available on
line. A flexible search facilit y has been included.
Kindly visit
3.4 Character encoding standards, Roman
Transliteration Schemes, Tools
There is quite a bit of confusion in the country
about t he exact nat ure of charact er encoding
schemes, fonts, rendering engines, character to font
mapping schemes etc. Many hasty decisions are
sometimes being taken without a full and in depth
understanding of all the issues concerned. Hence a
detailed article was written about the issues involved
in character encoding schemes and related issues. A
version of t he art icle has been published in t he
Vidyullipi journal.
3.5 Research Portal
Research and development requires time, effort,
money and other resources but as far as Language
Engineering in India is concerned, t he most
important resource required is adequate trained
manpower. Language Engineering is a highly multi-
disciplinary field - it borrows from such diverse
disciplines as Linguist ics, Psychology, Philosophy,
History, Society and Culture Portal: Temples
Logic, Artificial Intelligence, Cognitive Science,
Comput er Science, Mat hemat ics, St at ist ics and
Physics. Clearly, there are no experts who know
all these areas very well. This multi-disciplinary
nature of the subject makes it so much more
difficult to create quality training materials, books
etc. that are understood by people across so many
disciplines. Indian languages are also characterized by
certain unique features compounding the difficulties
in developing trained researchers and developers.
With this in mind, we have developed research
portals in selected areas of Language Engineering. In
one place, you will find basic and int roduct ory
material, tutorial and survey papers, classified and
structured collection of large number of relevant
research papers, point ers t o people, depart ment s,
institutions, conferences and other regular events and
so on. This would substantially reduce the time and
effort needed by newcomers to these research areas.
We could also develop t hese port als for news,
discussion and debate, collaborative development etc.
3.6 VAANI: A Text to Speech system for Telugu
Text-to-Speech systems convert given text into speech
form. Thanks to the maturity of the techniques and
availability of required tools, it is now possible to develop
minimal TTS systems in months. VAANI, our TTS
system for Telugu, is one such attempt. Di-phone
segmentation based approach has been used. Phonemes
in the language are identified and for each ordered pair
of phonemes, called di-phones, example words are
recorded in speech form. From this raw data, di-
phones are segmented using available tools. It would
then be possible to process this raw data further and
develop a database. To produce speech from any given
text, the text is initially parsed. certain pre-processing
steps are essential to handle numerals, homographs etc.
Then the text is segmented into diphones and the
corresponding speech units are concatenated to produce
speech output. Segmenting at di-phone boundaries
gives better continuity since the variations due to co-
articulation effects are least in the middle of a
phoneme as compared to its ends. Since the number
of phonemes in a language is usually a small and
closed set, this technology also leads to unlimited
vocabulary TTS technology. VAANI is thus an
unlimited vocabulary, open domain TTS system for
Telugu. The system has been tested for intelligibility
both directly and across telephone lines. Prosodic
features such as duration, pitch and intonation can be
added to make the sounds more natural. A substantial
amount of research has also been done on other
competing technologies and we would be able to
deliver high quality unrestricted TTS systems in future.
VAANI has already been into our AKSHARA -
Advanced Multi-lingual Text Processing system.
3.7 Manpower Development
Trained manpower is a crit ical issue in language
engineering in India. More than 100 students and
staff have worked in the project for periods ranging
from 3 months to three years on various research
and development activities. In the process they have
obtained significant theoretical knowledge as well
as practical skills in language technologies.
The LEC-2002 International Conference on Language
engineering included a full day of t ut orials by
distinguished experts from India and abroad. These
tutorials were free for students. Several hundred
students could benefit from these.
The first IL-OCR workshop organized by us here
attracted many interested students and researchers
from across the country. The detailed presentations,
demos. and discussions were very helpful.
The research portals being set up here will be of
value to beginners in language technology research.
A number of articles, technical reports and research
papers have eit her been published or are being
prepared for wider disseminat ion of t he ideas,
techniques and technologies. Our website is intended
to serve a similar purpose and is being enhanced
and updated accordingly.
A text book on Natural Language Processing with
specific emphasis and examples from Indian languages
is planned.
With this it should be possible now to organize training
programmes on specific topics to identified target
groups. Suitable course material can also be developed
specifically for the purpose.
4. Epilogue
4.1 Strengths and Opportunities
Our strength lies in our core research competence.
Our team has experts from linguistics, statistics,
computer science, artificial intelligence and cognitive
science. Each member of the team has a very rich and
varied experience. What binds us all is the common
research competence. The tools, techniques and
algorithms used in bio-informatics, image processing,
speech recognition and language technology have
many things in common. Our team emphasizes
this common core.
We had a one semester long seminar cum discussion
series on Markov Models. This semester we have
semester long seminar cum discussion series on feature
extraction and feature selection. See annexure for
det ails. Thorough invest igat ion of Word Sense
Disambiguation and Shallow Parsing techniques are
going on. T his in dept h underst anding of t he
technologies will enable to do world class research in
various areas and develop quality products and services.
We also have striven to develop large scale linguistic
data resources that are essential for further research
and development. Our centre is perhaps unique in
having developed a 10 million word corpora and more
than half a dozen dictionaries of significant size and
quality. Large and representative data and the right
kinds of tools will enable us to move much faster.
We have also striven to strike a good balance
between the pure research and publications on the
one hand and product development and technology
transfer to meet the needs the society on the
other hand. Many of our results are yet to be published
and we hope to bring out several publications soon.
We have also striven to strike a good balance between
long term and short term goals. Two years ago when
we started off almost from scratch on our OCR system,
other experts felt that as far as content creation is
concerned there is nothing better than simply typing
in texts. Today our OCR is one of the successful
ones in the country and combined with our spell
checker, we will soon be able to develop much large
corpora for Telugu and other Indian languages than
would have been possible otherwise.
We have struggled in this first phase of development,
to overcome the teething problems and look for long
lasting and permanent solutions rather than hop
onto short sighted and immediate solutions. This
approach has payed off and with the data and tools
we have with us now, we hope to be able to move
much faster.
Our future efforts will be in more focused areas of
language and speech engineering. We look forward
to meaningful collaboration with other leaders in the
world for research as well as technology development.
4.2 Outreach
The LEC-2002 Language Engineering conference
organized by us was quite successful and we hope to
be able to organize quality international conferences
in future as well. We also organized the first ever
OCR workshop for Indian scripts. This was a trend
setter of sorts. We laid bare our OCR system in full
detail and others followed. It was perhaps for the first
time that different research groups got to know about
each other’s approach is such great detail. The systems
developed by various groups were demonstrated and
t est ed publicly. The discussion and debat e t hat
followed helped all the centres to make further
progress. We have been the Indian coordinators for
a Indo-French research net work in comput at ional
linguistics and we plan to work more closely with other
groups across the world.
Within the country, we have been maintaining close
t echnical links wit h a number of organizat ions
including the Society for Computer Applications in
Indian Languages, Comput er Lit eracy H ouse,
Computer Vignannam, AP Press Academy, CMC,
Telugu University, Telugu Academy, etc. apart from
commercial companies such as C-DAC and
Modular Infotech.
A large number of very distinguished visitors have
visited our labs. and offered their appreciation as well
as very valuable suggestions.
We have been organizing seminars on a regular basis.
So far more than 25 seminars have been organized.
Some of t he dist inguished speakers include Prof.
Gerard Huet , Dr. Mark Pedersen and Prof. Rajat
5. Publications
1. C h akr avar t h y Bh agvat i , At u l N egi , an d
B.Chandrasekhar. Weight ed Fringe Dist ances
for I mp r ovi n g Accu r acy of a Temp l at e
Mat ching Telugu O CR Syst em. In Proc. of
IEEE TENCON2003, Bangalore, 2003.
2. Chakravart hy Bhagvat i, T.Ravi, S.M.Kumar,
and At ul Negi. Developing H igh Accuracy
O CR Syst ems for Telugu and ot her Indian
Script s. In Proc. of Lan guage En gin eerin g
Conference, Pages 18-23, Hyderabad, 2003.
IEEE comput er societ y Press.
3. R.L.Brown. The fringe dist ance measure: an
easily calculat ed image dist ance measure wit h
recognit ion result s comparable t o Gaussian
b l u r r i n g. I EEE Tr an s. Syst em man an d
Cybernet ics, 24(1): 111-116, 1994.
4. U. Garain and B.B. Chaudhuri. Segment at ion
of Touching Charact ers in Print ed Devnagari
and Bangla Script s Using Fuzzy Mult ifact orial
analysis. In Proc. of Int . Conf. on Document
Analysis and Recognit ion. IEEE Comp. Soc.
Press, Los Alamit os (CA), USA, 2001.
5. G. N agy, S. Set h , an d M. Vi sh wan at h an . A
prot ot ype document image analysis syst em for
t echnical journals. Comput er, 25(7), 1992.
6. At u l N egi , C h akr avar t h y Bh agvat i an d
B.Krishna. an OCR syst em for Telugu. In Proc.
I n t . C on f. on D ocu men t An al ysi s an d
Recogn it ion . I EEE Comp. Soc. Press, Los
Alamit os (CA), USA, 2001.
7. At ul Negi, Chakravart hy Bhagvat i, and V.V.
Suresh Kumar. Non-linear Normalizat ion t o
I mp r ove Telu gu O CR. I n Pr oc. of I n d o-
Eu r op ean C on f. on Mu l t i l i n gu al
Communicat ion Technologies, pages 45-57,
Tat a McGr aw H ill Book Co. , New D elhi,
8. K. Wong, R. Casey, and F. Wahl. Document
an al ysi s syst em. I BM J. Resear ch an d
Development , 26(6), 1982.
9. K. Narayana Murt hy, B B Chaudhuri (Eds),
LEC-2002: Language Engineering Conference,
IEEE Comput er Societ y Press, 2003
10. K V K Kalpana Reddy and Jahnavi A,
“Text t o Speech syst em for Telugu”, MCA
t hesis
11. K. Narayana Murt hy, “UNICODE: Issues in
St an d ar d i zat i on of C h ar act er En cod i n g
Schemes”, Vidyullipi, April 2002
12. K. Narayana Mur t hy, Nandakumar H egde,
“Some Issues relat ing t o a Common Script for
Indian Languages”, Int ernat ional Conference
on Indian Writ ing Syst ems and nagari Script ,
6-7 february 1999, Delhi Universit y, Delhi
13. K Narayana Murt hy, “An Indexing Technique
for Efficient Ret rieval fro Large Dict ionaries”,
N at i on al C on fer en ce on I n for mat i on
Technology NCIT-97, 21-23 December 1997,
14. P R Kaushik, K Narayana Murt hy, “Personal
Search Assist ant : A Configurable Met a Search
Engine”, Proceedings of t he AusWeb99 – The
Fift h Aust ralian World Wide Web Conference,
17-20 April 1999, Sout hern Cross Universit y,
Aust ralia
15. P R Kaushik, “PSA - A Met a Search Engine
for World Wide Web Searching”, M. Tech.
t h esi s, D ep ar t men t of C omp u t er an d
I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 1998
16. P Naga Samba Siva Rao, “Enhancement s t o
t he PSA model for Met a Search En gin es”,
M.Tech. t hesis, Depart ment of Comput er
an d I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 1999
17. K. Nar ayan a Mu r t h y, “MAT 2: En h an ced
Mach i n e Ai d ed Tr an sl at i on Syst em”,
ST RAN SS 2002 Symposiu on Tr an slat ion
Support Syst ems, 15-17 March 2002, Indian
Inst it ut e of Technology, Kanpur
18. K. Nar ayan a Mu r t h y, “MAT ” A Mach i n e
Assist ed Translat ion Syst em”, Fift h Nat ural
Language Pacific Rim Symposium, NLPRS-99,
5-7 November 1999, Beijing, China
19. K. Narayana Murt hy, “UCSG and Machine
Aided Translat ion from English t o Kannada”,
Indo-French Symposium on Nat ural Language
Processing, Universit y of Hyderabad, 21-26
March 1997
20. Kasi n a Vamsi Kr i sh n a, “ Wor d Sen se
Disambiguat ion : A St udy”, MCA t hesis,
Depart ment of Comput er and Informat ion
Sciences, Universit y of Hyderabad, 2003
21. D Madhusudhana Rao, V V Raghuram, “A
Generic Approach t o Sent ence Segment at ion”,
MCA t hesis, Depart ment of Comput er and
I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 2003
22. K. N ar ayan a Mu r t h y, “Un i ver sal C l au se
St ruct ure Grammar”, PhD t hesis, Depart ment
of C omp u t er an d I n for mat i on Sci en ces,
Universit y of Hyderabad 1996
23. K. Narayana Murt hy, A Sivasankara Reddy,
“Un i ver sal C l au se St r u ct u r e Gr ammar ”,
Special issue on Nat ural Language Processing
and Machine Learning, Comput er Science and
I n for mat i cs, Vol. 27, No. 1 Mar ch 1997
pp 26-38
24. K. N ar ayan a Mu r t h y, “Un i ver sal C l au se
St r u ct u r e G r ammar an d t h e Syn t ax of
Relat ively Free Word Order Languages”, Sout h
Asia Language Review, Vol VII, No. 1 Jan.
1997 pp 47-64
25. K. Narayana Murt hy, “Parsing Telugu in t he
UC SG For mal i sm”, I n d i an C on gr ess on
Kn owledge an d Lan gu age, Jan u ar y 1996,
Mysore, Vol 2 pp 1-6
26. Sreekant h D, “A St at ist ical Synt act ic
Disambiguat ion Tool”, M. Tech. T hesis,
Depart ment of Comput er and Informat ion
Sciences, Universit y of Hyderabad, 1998.
27. Ashish Gupt a, “Improvement s t o t he
UCSG English Parser”, M. Tech. T hesis,
Depart ment of Comput er and Informat ion
Sciences, Universit y of Hyderabad, 2001.
28. D. Srinivasa Rao and M. Suresh Babu, “Design
of a Web Based Educat ion Tool”, MCA t hesis,
Depart ment of Comput er and Informat ion
Sciences, Universit y of Hyderabad, 2000
29. B. P. V. Prasad and A. Rambabu, “Design of
VI RAT: A Pr ot oPU BLI SH ER Vi r t u al
Aut horing Tool for Web-Based Educat ion”,
M.Tech. Thesis, Depart ment of Comput er and
I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 2003
30. P. Uday Bhaskar and R. Krishna Kishore, “A
Tool for Web Based Educat ion”, M. Tech.
T h esi s, D ep ar t men t of C omp u t er an d
I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 2003
31. Raj esh V Pat an kar an d Tamman a D i l i p,
“VIRAT: An Aut horing Tool for Publishing on
t h e Web”, MC A t h esi s, D ep ar t men t of
C omp u t er an d I n for mat i on Sci en ces,
Universit y of Hyderabad, 2001
32. T. S. Vivek and J. Reddeppa Reddy, “Web
Based Educat ion Tool – O nline Test ing &
Evaluat ion Module, MCA t hesis, Depart ment
of C omp u t er an d I n for mat i on Sci en ces,
Universit y of Hyderabad, 2001
33. P. Siva Rama Krishna and G. Nagachandra
Sekhar, “Web Based Educat ion Tool (Inst ruct or
Mod u l e) ”, MC A t h esi s, D ep ar t men t of
C omp u t er an d I n for mat i on Sci en ces,
Universit y of Hyderabad, 2001
34. T. Ramesh and T. Phani Raju, “Web Based
Educat ion Tool (Soft ware Lab Module)”, MCA
t h esi s, D ep ar t men t of C omp u t er an d
I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 2001
35. G . Mu r al i Kr i sh n a an d P. Ravi Ku mar,
“Generic Framework for Developing Web-Lab
Experiment s”, M. Tech.t hesis, Depart ment of
C omp u t er an d I n for mat i on Sci en ces,
Universit y of Hyderabad, 2002
36. G. Vamsidhar, “Whit e Board Applicat ion - A
GUI Tool for Web Based Educat ion Tool”,
M.Tech.t hesis, Depart ment of Comput er and
I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 2002
37. M. Narsimhulu, “An Enhanced Archit ect ure
for Web Based Educat ion”, M.Tech. Thesis,
Depart ment of Comput er and Informat ion
Sciences, Universit y of Hyderabad, 2002
38. K Vasuprada, K. Narayana Murt hy, “Part -of-
Speech Tagging using a Tri-Tag HMM Model”,
Second Nat ional Symposium on Quant it at ive
Linguist ics, 28-29 Feb. 2000, Indian St at ist ical
Inst it ut e, Kolkat a
39. K. Vasuprada, “Part of Speech Tagging and
Syn t act ic D isambiguat ion usin g st ochast ic
p ar si n g t ech n i q u es”, M. Tech . t h esi s,
Depart ment of Comput er and Informat ion
Sciences, Universit y of Hyderabad, 1999
40. Ravi Mrut yunjaya, “Corpus-Based St emming
Al gor i t h m for I n d i an Lan gu ages”,
M.Tech.t hesis, Depart ment of Comput er and
I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 2003
41. K. N ar ayan a Mu r t h y, “T h eor i es an d
Techniques in Comput at ional Morphology”,
Pr oc. of t h e Nat i on al Semi n ar on Wor d
St r u ct u r e of D r avi di an Lan gu ages, 26- 28
N ovemb er 2 0 0 1 , D r avi d i an Un i ver si t y,
Kuppam pp 365-375
42. K. Narayana Murt hy, “A Net work and Process
Mod el for Mor p h ol ogi cal An al ysi s/
Generat ion”, Second Int ernat ional Conference
on Sout h Asian Languages ICOSAL-2, 9-11
Jan. 1999, Punjabi Universit y, Pat iala
43. Anil Kumar, “Morphological Analysis of Telugu
Wo r d s”, M. Tech . t h esi s, D ep ar t men t of
C omp u t er an d I n for mat i on Sci en ces,
Universit y of Hyderabad, 2003
44. K. Narayana Murt hy, “Elect ronic Dict ionaries
and Comput at ional Tools”, Linguist ics Today,
Vol 1, No. 1, July 1997 pp 34-50
45. A Sivasankara Reddy, K. Narayana Murt hy,
Vasu d ev Var ma, “O b j ect O r i en t ed
Mult ipurpose lexicon”, Int ernat ional Journal
of Communicat ion, Vol. 6, No 1 and 2, Jan-
Dec 1996 pp 69-84
46. K. Narayana Murt hy, “An Indexing Technique
for Efficient ret rieval from Large Dict ionaries”,
N at i on al C on fer en ce on I n for mat i on
Technology NCIT-97, 21-23 December 1997,
47. Vasu d ev Var ma, K. Nar ayan a Mu r t h y, A
Sivasankara Reddy, “elect ronic Dict ionaries: A
Model Archit ect ure”, Nat ional workshop on
Elect ronic Dict ionaries for second language
learners, Bharat iar Universit y, Coimbat ore, 29
Sept . t o 7 Oct . 1993
48. Surya Kiran Mamidi, “Developing keyboard
an d D i splay d r i ver s for I n d i an Lan gu age
Edit ors, MCA t hesis, Depart ment of Comput er
an d I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 2001
49. . Marut i Kumar and Y. N. V. Ganesh Kumar,
“Edit or for Indian Languages”, MCA t hesis,
Depart ment of Comput er and Informat ion
Sciences, Universit y of Hyderabad, 2001
50. V S S N eel i ma, “En h an cemen t s t o t h e
AKSH ARA Text Processing Syst em”, M. Sc.
Comput er Scien ce t hesis, K R R Vign an
d egr ee an d PG C ol l ege for Women ,
H yderabad, 2003
51. K. Narayana Murt hy, “UNICODE: Issues in
St an d ar d i zat i on of C h ar act er En cod i n g
Schemes”, Vidyullipi, April 2002
52. Karen Kukich, “Techniques for Aut omat ically
Correct ing Words in Text”, ACM Comput ing
Reviews, Vol 24. No. 4 December 1992
53. Narayana Murt hy, “Issues in t he Design of a
Sp el l C h ecker for Mor p h ol ogi cal l y Ri ch
Languages”, 3rd Int ernat ional Conference on
Sout h Asian Languages - I CO SAL-3”, 4-6
January 2001, Universit y of Hyderabad.
54. K. S. RajyaShree and K. Narayana Mur t hy,
“St at i st i cal Sp el l C h ecki n g”, I n d o- U K
Workshop on LESAL - Language Engineering
in Sout h Asian Languages”, 23-24 April 2001,
Nat ion al Cen t re for Soft ware Techn ology,
55. K. Nar ayan a Mur t hy, “Mar kov Models of
Syllables for Sp ell C h ecki n g”, St at i st i cal
Technqiues in Language Processing, 11-12 June
2001, cent ral Inst it ut e of Indian Languages,
56. Naresh Pidat al and Dhanjay Kumar Singh,
“Spelling error Det ect ion and Correct ion“,
MCA t hesis, Depart ment of Comput er and
I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 2003
57. Anil Kumar, “Morphological Analysis of Telugu
Wo r d s”, M. Tech . t h esi s, D ep ar t men t of
C omp u t er an d I n for mat i on Sci en ces,
Universit y of Hyderabad, 2003
58. Ravi Mrut yunjaya, “Corpus-Based St emming
Algorit hm for Indian Languages”, M. Tech.
T h esi s, D ep ar t men t of C omp u t er an d
I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 2003
59. K. Narayana Murt hy, “Design of a Spelling
Error Det ect ion and Correct ion Syst em for
Tel u gu”, Nat i on al Semi n ar on Lan gu age
Technology Tools: Implement at ion for Telugu,
Sept ember 17-19, Universit y of Hyderabad
60. T h esi s i n t h e ar ea of I n d i an Lan gu age
Technologies Thejavat h Ramdas Naik and
Manoj Kumar Pradhan, “Off-Line print ed
Devanagari page Layout Analysis and Script
Recognit ion”, MCA Thesis, Depart ment of
Compu t er an d I n for mat i on Sci en ces,
Universit y of Hyderabad, June 2000.
61. Buska Krishna, “Design and Implement at ion
of a Telugu Script Recogn it ion Syst em”,
M.Tech. Thesis, Depart ment of Comput er
an d I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, December 2000.
62. L. K. Pr ash an t Ku mar, “A St u d y of
Binarizat ion met hods for Opt ical Charact er
Recognit ion”, M.Sc.(Tech) Thesis, School
of Ph ysi cs, Un i ver si t y of H yd er ab ad ,
December 2000.
63. D. Vijaya Bhaskar, “Preliminar y St udy of
GOCR t o implement OCR for Telugu”, MCA
T h esi s, D ep ar t men t of C omp u t er an d
I n for mat i on Sci en ces, Un i ver si t y of
H yderabad, 1998-2001.
64. Ajay Koul, “A St udy of Gradient Based Feat ure
Ext ract ion for Telugu OCR”, M.Tech. Thesis,
Depart ment of Comput er and Informat ion
Sciences, Universit y of Hyderabad, December
65. G. Ragh ava Ku mar, “A St u d y of Near est
Neighbour Comput at ion Techniques and t heir
applicat ions”, M.Tech. Thesis, Depart ment
of Compu t er an d I n for mat i on Sci en ces,
Universit y of Hyderabad, December 2001.
66. Y. Pradeep Kumar and G. Kot eswar Rao, “ A
Tool-kit for Document I mage Processing”,
M.Tech. Thesis, Depart ment of Comput er
an d I n for mat i on Sci en ces, Un i ver si t y of
Hyderabad, January 2002.
67. V. V. Su r esh Ku mar, “N on - Li n ear
Normalizat ion Techniques t o improve OCR”,
M.Tech. Thesis, Depart ment of Comput er and
I n for mat i on Sci en ces, Un i ver si t y of
Hyderabad, January 2002.
68. Tan u k Ravi an d S. Mah esh Ku mar,
“Enhancement s and Redesign of DRISH TI,
An OCR Syst em for Telugu”, MCA Thesis,
Depart ment of Comput er and Informat ion
Sciences, Universit y of Hyderabad, June 2002.
69. Anand Kumar and Arafat h Pasha, “Soft -
ware for OCR”, BE Thesis, Depart ment of
Comput er Science, 2001-2002.
70. D. Bhavani, C. Madhuri and S. Shesha Phani,
“GUI for an OCR Syst em”, B. Tech. Thesis,
D epar t men t of Comput er Scien ce, Mar ch
71. I. N. Lekha, “ An Experiment al Evaluat ion of
Fr in ge D ist an ce Measur es for Malayalam
Print ed Text OCR”, M.Tech.(CT ) Thesis,
School of Physics, Universit y of Hyderabad,
May 2003.
6. The Team Members
Dr. K. Narayana Murt hy, Dr. Arun Agarwal, Dr.
B. Chakravart hy, Dr. S. Bapi Raju, Dr. At ul Negi,
Dr. G. Uma Maheshwara Rao, Dr. P. Mohant hy,
Dr. P. R. Dadegoankar, Raman Pillai Rajesh, K.
Rajini Reddy, P. Mut t aiah, K. Naga Sabit a, Ganesh
Raju, Mar ut hi, Naresh, Suresh Kumar, Anand,
Kwaja Sir ajuddin , B. Navat ha, P. RamaKr ishn a
Prasad, O. Bhaskar, Vaishnavi, K. Balaji Rambabu,
Suresh Kumar, P. Siva Ramakrishn a, M. Sur ya
Kiran, T. Ramesh, Ravi Raj Singh, Venu Gopal
Courtesy: Prof. K. Narayan Murthy
University of Hyderabad, Dept. of CIS,
Hyderabad -500046 (RCILTS for Telugu)
Tel: 00-91-40-23100500, 23100518
Extn. 4017, 23010374
Editorial Comment : Because of very large number of
publications by the Resource Centre & constraint of space
we could not include the publication details here. For getting
the publications please contact Prof. K. Narayan Murthy
Centre for Development of Advanced Computing
(Formerly ER&DCI(T)), Vellayambalam,
Thiruvananthapuram, Kerala, India.
Tel. : 00-91-471-2723333 Extn., 243, 303
E-mail :
Website : http:/ / hdg/ mrc.htm
http:/ /
Resource Centre For
Indian Language Technology Solutions – Malayalam
C-DAC, Thiruvananthapuram
Achievements Achievements Achievements Achievements Achievements
C-DAC, Thiruvananthapuram
C - D AC , T h i r u van an t h ap u r am, for mer l y
ER&DCI, Thiruvanant hapuram, is one of t he
t hirt een Resource Cent res (Resource Cent re for
Indian Language Technology Solut ions) set up
acr oss t h e cou n t r y b y t h e Mi n i st r y of
Communicat ions and Informat ion Technology,
Govt . of India under t he T DI L (Technology
Development for Indian Languages) programme.
These t hirt een Resource Cent res are aimed at
t aking IT t o masses in t heir local languages and
cat er t o all t he const it ut ionally recognised Indian
and some foreign languages. The Language of
focu s at C - D AC T h i r u van an t h ap u r am i s
Malayalam, t he official language of t he st at e of
The main object ives of t he “ Resource Cent re for
I n d i an Lan gu age Tech n ol ogy Sol u t i on s –
Malayalam” (RCILTS-Malayalam) are t o build
compet ence and expert ise in t he proliferat ion of
Informat ion Technology using Malayalam, t he
r egi on al l an gu age of t h e st at e of Ker al a.
D evel op men t of Mal ayal am en ab l ed cor e
t ech n ol ogi es an d p r od u ct s wou l d gi ve a
t remendous fillip t o IT enabled services in t he
st at e. The comprehensive IT solut ions developed
would enable t he cit izens of Kerala t o enhance
t heir qualit y of life using t he benefit s of modern
comp u t er an d commu n i cat i on s t ech n ology
t hrough Malayalam. This will help t hem t o bet t er
un der st an d t h eir own cult ur e an d h er it age,
int eract wit h government depart ment s and local
bodies more effect ively, besides obt aining a host
of ot her advant ages.
Now t hat t he Resource Cent re has complet ed
t hree years of funct ioning, we are happy t o not e
t hat we have achieved significant progress and
have been able t o complet e development of all
t he expect ed core deliverables of t he project wit h
in t he scheduled t ime.
We have been successful in developing a variet y
of t ool s an d t ech n ol ogi es for Mal ayal am
comput erizat ion and t aking IT t o t he common
Malayalee in his local language.
Man y of t h e pr odu ct s developed u n der t h e
Resource Cent re project are first of it s kind and
ar e si gn i fi can t for en ab l i n g Mal ayal am
comput erisat ion. T hey have got good market
p ot en t i al i n t h e p r esen t scen ar i o of
comp u t er i sat i on an d con ver si on of offi ci al
language t o Malayalam in t he st at e of Kerala.
Various Government depart ment s have purchased
our “Aksharamaala” soft ware for Malayalam word
processing. Ezuthachan, the Malayalam Tutor, also
h ave good d eman d amon g n on r esi d en t
Malayalees. We have already sold 25 copies of t he
same and are t rying for market ing t ie up wit h
some business houses.
The range of our product s include:
Knowledge Resources like Malayalam Corpora,
Trilingual ( English-Hindi-Malayalam) Online
Dict ionary and Knowledge bases for Lit erat ure,
Art and Cult ure of Kerala.
Knowledge Tools for Malayalam such as Port al,
Font s, Morphological Analyser, Spell checker, Text
Edit or, Search Engine and Code Convert ers.
Human Machine Interface systems comprising of
O p t i cal C h ar act er Recogn i t i on an d Text t o
Speech Syst ems
Services like E-Commerce applicat ion and E-Mail
Server in Malayalam and Language Tutors for
Malayalam and English.
In addit ion regular t raining courses are being
con d u ct ed as p ar t of Lan gu age Tech n ology
Human Resource Development . Also, t here is
r egular in t er act ion wit h t he Gover n men t of
Kerala for providing solut ions in t he area of
st an dar dizat ion , compu t er isat ion of var iou s
Government depart ment s and convert ing t he
official language t o Malayalam. We have been
p r ovi d i n g con su l t an cy t o i n d i vi d u al s an d
organisat ions regarding Language Technology
applicat ions. Given below is a det ailed descript ion
of t he product s developed and t he achievement s
of t he Malayalam Resource Cent re. Our Resource
Cent re developed t he following Technologies/
Product s.
1. Human Machine Interface Systems
1.1 NAYANA™ - Optical Character Recognition
System for Malayalam
The Malayalam OCR syst em convert s scanned
images of printed Malayalam documents to editable
text. It is a multi font system that works across a
range of font sizes. The system has a recognition
speed of fifty characters per second.
The System consists of a preprocessing module, the
OCR engine and a post processing module.
T he block diagram of t he syst em is given in
figure 1.
Figure. 1
The preprocessing t asks performed by t he first
module include noise removal, conversion of grey
scale image to binary, skew detection and correction
and line, word, and character segmentation. The
scanned images in grey tone are converted into two-
t one (binary) images using a hist ogram based
t hresholding approach (Ot su’s algorit hm). Skew
detection is done using the Projection profile based
technique. After estimating the skew angle the skew
is correct ed by rot at ing t he image against t he
estimated skew angle.
T h e O C R en gi n e ( C h ar act er Recogn i t i on
Module) is based on t he Feature Extraction
met h od of ch ar act er r ecogn i t i on . Feat u r e
ext ract ion can be considered as finding a set of
vect or s, wh i ch effect i vel y r ep r esen t t h e
informat ion cont ent of a charact er. The feat ures
ar e select ed in such a way t hat t hey help in
discriminat ing bet ween charact ers. A mult ist age
classificat ion procedure is used, which reduces t he
processing t ime while maint aining t he accuracy.
Aft er passing t hrough different st ages of t he
cl assi fi er, t h e ch ar act er i s i d en t i fi ed an d
cor r espon din g char act er code is assign ed. A
t raining module is incorporat ed in t he O CR
engine t o recognize charact ers, which are different
from normal charact ers in t heir shape and st yle
(example- decorat ive font s).
In t he post processing module, Linguist ic rules
are applied t o t he recogn ised t ext t o correct
cl assi fi cat i on er r or s. For examp l e, cer t ai n
charact ers never occur at t he beginning of a word
and if found so, t hey are remapped appropriat ely.
Similarly, dependent vowel signs can occur only
wit h consonant s or consonant conjunct s; if found
along wit h vowels or soft consonant s, t hey are
remapped int o consonant s/ conjunct s similar in
shape t o t he vowel sign. Independent vowels
occur only at t he beginning of a word and if
found anywhere else, t hey will be mapped int o a
consonant or ligat ure having similar shape.
Performance of the OCR
Developed on VC++ plat form, Our Malayalam
OCR runs on Windows 98/ 2000. It recognises
50 charact ers per second and gives an accuracy of
97% for good qualit y print ed document s. The
specificat ions and performance of t he syst em is
given below.
Skew det ect ion and : - 5 t o +5 degree
correct ion
Support ed image format s : BMP, TIFF
Image scan resolut ion : 300dpi and above.
Document Type : Single-font single
Supporting Fonts
Font s Names : C D AC Fon t s
(ML-TTKart hika,
Mathrubhumi Font,
Man or ama Fon t ,
Font s used by DC
Font Size : 12-20
Font St yles : Normal, BOLD
Support ed Code format : ISCII/ ISFO C
Support ed out put format : RTF/ H TML/
Character recognition accuracy (%)
Document Type Good qualit y Bad qualit y
Paper Paper
Comput er Print ed 97% 94%
Magazine 92% 90%
Newspaper 85% 82%
Books 95% 93%
Ext ensive t est ing has been done on approximat ely
500 pages of different qualit y print ed document s.
Table1 consolidat es t he result s of t est ing. The
syst em has undergone cert ificat ion t est ing at
ETDC Chennai.
The Malayalam OCR can be int egrat ed wit h a
Malayalam Text t o speech syst em t o get a Text
Reading Syst em for t he visually challenged. Ot her
appli cat i on ar eas i n clu de pu bli sh i n g sect or,
con t en t cr eat i on , d i gi t al l i b r ar y, cor p u s
development et c.
1.2 SUBHASHINI™- Malayalam Text to Speech
System (TTS)
T h e Mal ayal am Text t o Sp eech syst em
SUBHASHINI™ is a Windows based soft ware,
which convert s Malayalam Text files int o fairly
i n t el l i gi bl e sp eech ou t p u t . T h e soft war e i s
int egrat ed wit h a t ext edit or having bot h ISCII
an d I SFO C su p p or t . T h e ed i t or su p p or t s
INSCRIPT Key board layout .
T h e T T S i s b ased on Sp eech syn t h esi s b y
diaphon ic con cat en at ion an d con sist s of t he
following four modules
• Diaphone Dat abase
The concat enat ion of diaphones corresponding
t o t he t ext is done in t he Synt hesis module and
we get speech out put . We are using t he MBROLA
speech engine for speech synt hesis.
Text reading syst ems, announcement Syst ems, and
syst ems providing voice int erface.
2. Knowledge Tools
2.1 NERPADAM™ - Malayalam Spell Checker
Nerpadam is a soft ware subsyst em t hat can be
int egrat ed wit h Microsoft word as a macro or t he
Malayalam edit or st ylepad developed by us, t o
check t he spelling of words in a Malayalam t ext
fi le. Wh i le r u n n i n g as a macr o i n wor d , i t
funct ions as an offline spell checker in t he sense
t hat one can use t his soft ware wit h a previously
t yped t ext file only. Bot h off line and online
• Text Processing module
• Prosodic Rules Generat or
• Speech Synt hesiser
T h e D i ap h on e d at ab ase con si st s of 2 5 0 0
diaphones segment ed from recorded words. All
t h e common l y u sed al l op h on es ar e al so
The t ext -processing module organizes t he input
sent ences int o manageable list s of words. It also
ident ifies t he punct uat ion symbols, abbreviat ion,
acronyms and digit s in t he input dat a and t ags
t he input dat a. These are t hen processed and
convert ed t o phonet ic language – a language t hat
t he speech engine is able t o recognise.
Rules for adding prosody t o t he speech out put
are gen er at ed usin g t he Speech cor pus. T his
includes t he pit ch and delay informat ions for
different int onat ions.
checking are possible when it is int egrat ed wit h
t h e t ext edit or. It gen er at es su ggest ion s for
wrongly spelt words.
The syst em adapt s a rule cum dict ionary-based
approach for spell checking. It incorporat es a fully
d evel op ed Mor p h ol ogi cal An al yser for
Malayalam. This module splits the input word into
root word, suffixes, post posit ions et c. and checks
t he validit y of each using t he rule dat abase. Finally
it will check t he dict ionary t o find whet her t he
root word is present in t he dict ionary. If anyt hing
goes wrong in t his checking it is det ect ed as an
error and t he error word is reprocessed t o get 3
to 4 valid words, which are displayed as suggestion.
The user can add new words int o a personalised
d at a b ase fi l e, wh i ch can b e ad d ed t o t h e
dict ionary if required.
All Malayalam word processing jobs.
2.2 AKSHARAMAALA™ Malayalam Font
Package and Script Manager
T h e “AKSH ARAMAALA” soft war e package
consists of two sub packages, ”MSM“ a Malayalam
Script Manager and “ Vahini”, a Malayalam font
package. This package complies wit h t he st andard
INSCRIPT keyboard layout , Phonet ic Keyboard
Layout and t he ISFOC st andard. This package
enables t he use of Windows based applicat ions
for Malayalam dat a pr ocessin g. Some of t he
packages support ed by t his applicat ion are MS
O ffi ce, PageMaker, Ad ob e I l l u st r at or, MS
Fr on t page, Macr omedia Dreamweaver Corel
Draw, and Lot us Smart Suit e. This package is
int ended for use under Windows 95, 98, NT,
2000, ME and XP.
T he “MSM” package con sist s of a keyboar d
man ager, wh i ch su p p or t s t h e I N SC RI PT
keyboard overlay and a Phonet ic keyboard overlay
for t h e en t r y of Malayalam ch ar act er s. T h e
Manager can opt ionally swit ch on t he ent ry of
Malayalam or English charact ers wit h t he help of
a swi t ch over key. T h e ap p r op r i at e key
comb i n at i on s au t omat i cal l y r en d er t h e
Malayalam charact ers, such as conjunct s and soft
consonant s. The keyboard manager is designed
t o work wit h Malayalam ISFOC, monolingual as
well as bi-lingual fonts. This manager also supports
t he use of Web font s for dat a ent r y. Different
opt ions are provided in t he package t hat t urns on
t hese feat ures.
The “Vahini” package is a collect ion of Malayalam
font s t hat can be used for word processing, web
publishing, dat a processing et c. and for use in MS-
Windows applicat ions. There are number of high
qualit y True-t ype font s . The font s comply wit h
t h e I SFO C st an d ar d an d con si st of b ot h
monolingual and bilingual font s. A few web font s
are also provided for use in web pages. The bi-
l i n gu al fon t s ar e t h ose t h at con t ai n b ot h
Malayalam and English charact ers. The webfont s
are monolingual TTF font s specially designed for
t he use wit h browsers.
We have also developed Dynamic (PFR) font s,
which can be used in web page development so
t hat t he user can view t he web cont ent wit hout
t he font being act ually inst alled in t he machines.
These font s are plat form independent and will
work in bot h I.E and Net scape.
A Un icode fon t in open Type format is also
G over n men t of Ker al a i s con si d er i n g t h e
AKSH ARAMAALA package t o t each Malayalam
Dat a ent ry t o t he beneficiaries of “Akshaya”, a
project aimed at providing Comput er lit eracy t o
at least one person per family and also in t he E-
Governance applicat ions of t he government of
The AKSHARAMAALA package can be used with
st an d ar d con t en t cr eat i on t ools t o d evelop
cont ent s in Malayalam.
2.3 Text Editor: A basic Malayalam Text Edit or
“STYLEPAD” is developed. It incorporat es all t he
facilit ies available in Not epad t oget her wit h
provision t o save Malayalam document s in ISCII
format and read ISCII files.
2.4 Code Converters: C on t en t cr eat i on i n
Malayalam is accomplished using different font s
by different organisat ions/ individuals. Most of t he
On-line Malayalam Newspapers have t heir own
propriet ary font s. Lack of st andards in font coding
makes dat a ret rieval a difficult t ask. In order t o
allevi at e t h i s t ask, we h ave d eveloped Fon t
Con ver t er s for most of t h e common ly used
Malayalam font s. Given below is a list of t he Font
Convert ers developed by us.
2.5 ANWESHANAM™ – Malayalam Web based
Search Engine
“Anweshanam” is a direct ory based search engine,
wh i ch sear ch es for Malayalam con t en t an d
informat ion in web pages. This solut ion provides
a Malayalam int erface t hat helps t he user t o search
for informat ion quickly and easily on t he web. It
searches for specific Malayalam keywords and
generat es a list of links t o web pages cont aining
t he searched informat ion. It searches for specific
cont ent in t he pages on t he server. This solut ion
was developed using Java Server pages.
The int erface is in Malayalam and includes easy
input facilit ies wit h keyboard driver and float ing
charact er map. The keyboard driver support s t he
INSCRIPT keyboard wit h aut omat ic format ion
of conjunct s. The applicat ion provides facilit ies
for sear ch i n g ei t h er En gl i sh or Mal ayal am
keywords. The result pages cont ain links t o t he
web pages cont aining t he searched informat ion
along wit h t heir brief descript ion in Malayalam.
A form is also provided whereby new pages and
sit es can be added t o t he dat abase wit h det ails of
t he cont ent .
The applicat ion can be expanded t o funct ion as a
mult ilingual search engine and also as a meaning
based search engine.
This application is beneficial to Multilingual portal
developers and is useful for any one looking for some
particular Malayalam content on the web.
T his applicat ion is pr esen t ly bein g r un as a
ser vi ce on ou r web si t e an d p or t al ,
2.6 Malayalam Portal
We have designed and developed a Malayalam Web
port al, www.malayalamresourcecent alias Some of t he facilit ies/
cont ent s provided in t he port al are:
• Malayalam version of t he Constitution of
• A Newspaper “Pradeepam”, published from
• A knowledge base of t radit ional Kerala Art
forms and Cul ture ( Bot h En gl i sh an d
• A Knowledge base of Malayalam Lit erat ure.
• Malayalam Literary classic “KRISHNAGATHA”
by Cherusseri and t he Grammat ical classic,
• Fu l l t ext of San skr i t Ayu r ved i c C l assi cs
Charakasamhi ta an d Susrutasamhi ta
(t ranslit erat ed in Malayalam and t en ot her
l an gu ages) , al on g wi t h t h ei r Mal ayal am
int erpret at ions.
• PRAKES (Prakrut hi est imat e) - An Int eract ive
soft ware package for est imat ing t he Prakruti
(const it ut ion) of a person based on Ayurvedic
con cep t s. Avai l ab l e i n b ot h En gl i sh an d
• A dat abase of forms commonly employed by
Govt . of Kerala (Cit y Corporat ions, Mot or
Vehicles, Revenue, Civil Supplies Dept s., et c.)
Al t oget h er 7 6 for ms fr om d i ffer en t
government depart ment s put on t he web.
• SSLC (Mathematics and Science) Question
papers and Answers for t he last five years in
• A knowledge base in Malayalam for rubber
cult ivat ors.
• A tourist aid package named “Explore-Kerala”. The
can be accessed from WAP enabled
mobile phones.
• Malayalam brochures on Cancer Awareness
I n addit ion , some of t h e t ools/ t ech n ologies
d evel o p ed ( Sa n d esa m , An wesh a n a m ,
Dict ionary and E commerce applicat ion) have
b een i n t egr a t ed i n t h e p o r t a l f o r
demon st r at ion .
The cont ent s of t he websit e are cont inuously
being upgraded and enhanced.
The number of visit ors t o t he sit e has exceeded
20,500 since June, 2002.
3. Services
Sandesam is a solut ion for a Web based mail
service in Malayalam. The back-end mail server
comprises of, or is based on t he Qmail, Vmailmgr
and Courier-IMAP running on Redhat Linux.
The Qmail server support s t he SMTP and POP3
services, t he Vmailmgr program performs t he
management of user account s effect ively, while
t he Courier –Imap server support s t he IMAP
service. The web int erface is developed using Java
Server Pages and t he Java mail API and does not
use Java on t he client side. The dat abase server
used is Post gres SQL
This is a light weight and a fast solut ion t o provide
an easy-t o-use web int erface in Malayalam. The
st orage of mailboxes is based on maildir st ruct ure
and t he mail is read direct ly from t he disks. The
administ rat ion of user account s is done t hrough
t he Vmailmgr program. Aut hent icat ion is also
done via Vmailmgr for bot h Qmail and Courier
servers. The service gives t he user complet e access
t o his POP3 or IMAP mailboxes via an easy-t o-
use web int erface.
Some of t he facilit ies provided in t he service are
t he IMAP support wit h user manageable folders,
ext en si ve mi me su p p or t for at t ach men t s.
T he in t er face is in Malayalam an d in cludes
easy input facilit ies wit h keyboard drivers and
float ing charact er maps. T he ser vice provides
faci li t i es for sen d i n g an d r ecei vi n g mai l i n
Malayalam, st orage of addresses in address book
wi t h Mal ayal am n ames an d d escr i p t i on .
T h e ser vi ce also pr ovi d es faci li t i es for u ser
configurat ion like changing t he password, set t ing
quot a for mailboxes et c.
This solut ion can be expanded t o support any
IMAP client mail server running on Linux or
Windows plat form.
This solut ion is beneficial for Small and medium
I SPs, b u si n ess or gan i zat i on s, Gover n men t
Depart ment s Mult ilingual port al developers.
This solution is presently being run as a service on our
website and portal,
with the domain id
3.2 Malayalam E-com Application
An e-commerce applicat ion in Malayalam has
been developed t o help comput er lit erat es not t hat
proficient in English t o purchase goods online.
This solution is developed using Java Server pages.
This solut ion is simple using pure ht ml and JSP
and is viewable in any browser.
The web int erface is in Malayalam and basically
cont ains a window wit h display of product s for
sale wit h t heir descript ions in Malayalam. A
shopping cart is provided whereby goods t o be
purchased can be added and removed according
t o will. The online bill for t he cart wit h price
det ails can be viewed during t he process and t hen
t he purchase can be finalized by filling up and
submit t ing an order form. The payment can be
made by cheque or DD giving it s det ails in t he
order form and sending t he inst rument by post .
The applicat ion also considers invent ory det ails
by displayin g available st ock, aut omat ically
updat in g t h e st ock aft er each pur ch ase an d
prompt ing cert ain funct ions on reaching cert ain
limit s. The facilit ies provided in t he web int erface
are easy input facilit ies like keyboard driver and
float ing charact er map for t he ent ry of Malayalam
dat a and acknowledgement of t he purchase order
by e-mail.
The applicat ion can be ext ended t o support secure
online payment wit h credit cards.
The applicat ion is useful for small or medium
organizat ions t o increase t heir sales coverage
t hrough t he Int ernet . It is useful for mult ilingual
port al developers and is beneficial t o comput er
lit erat es not t hat proficient in English but familiar
wit h Malayalam.
The application has been presently employed in our
website and portal for
the sale of the products developed at the centre.
4. Knowledge Resources
4.1 Trilingual Dictionary
An Online Trilingual (English-Hindi-Malayalam)
Dict ionary is developed. It Cont ains 50,000 plus
words in each Language. English based Search and
advanced search facilit ies are implement ed. Search
b ased on t h e ot h er t wo l an gu ages b ei n g
implement ed.
The main feat ures of t he Dict ionary are:
• Port able in XML format .
• ISCII Based.
• Ret rieval based on Part s of Speech
• (POS)
• Word search can be made in all t hree languages.
(Implement at ion in progress)
• Descript ion of every word wit h example.
• Advanced search facilit ies.
• Ext remely fast processing.
The Dict ionary can be int egrat ed wit h any ot her
applicat ion or web port al. It can also be used as
an aid for t ranslat ion – bot h Machine aided and
It helps t he st udy of t he concerned languages wit h
relat ive ease.
5. Language Tutors
5.1 Ezhuthachan- The Malayalam Tutor
A Mal ayal am Tu t or Package,
“EZHUTHACHAN”, which is aimed at t eaching
Malayalam t o foreigners and second generat ion
Keralit es living abroad. It is basically a mult imedia
package wit h animat ions, which show t he met hod
of wr it in g t he let t er s an d soun d givin g t he
pronunciat ion of charact ers and words. A writ ing
pad is also provided, where a shaded model of
each charact er appears and t he user can pract ice
writing over this using the mouse. This application
list s t he commonly used expressions in Malayalam
( wor d s for d ai l y u se, kn owi n g someb od y,
numbers, days, mont hs, colours, animals, birds
et c. ), wit h t h eir pr on un ciat ion an d En glish
meaning. The cont ent s are format t ed int o well
structured chapters. A test module is also provided
at t he end. The package also cont ains an English-
Malayalam Dict ionary of 2000 words.
T he Honourable Minist er of St at e for Urban
Development & Pover t y Alleviat ion, Shri. O .
Rajagopal (present ly Minist er of st at e for defence)
formally released t he package on 19
t h
Oct ober,
2002. T he Chief Execut ive of Non Residen t
Keralit es Welfare Associat ion (NORKWA), Shri.
Sat ish Namboot hiripad, IAS received a copy of
EZ H U T H AC H AN fr om t h e H on ou r ab l e
Mi n i st er. N O RKWA i s con si d er i n g t h e
“EZ H U T H AC H AN ” p ackage for t h ei r
programme of on-line t eaching of Malayalam for
Non resident Malayalee children. We are receiving
lot of enquiries and have already sold some copies
of t he soft ware.
5.2 English Tutor
English tutor is a mult imedia-based applicat ion
intended to help, nursery,/ primary school students
or in general, any one int erest ed in learning
English, t o learn t he basics of English t hrough
Malayalam. The cont ent s of t he applicat ion are
organised int o different modules t hat comprise
of int eract ive learning programmes t o develop
skills for reading and writ ing alphabet s, words and
sent ences. The animat ions, pict ures and sounds
t each t he associat ion of object s and words along
wit h t heir spelling and pronunciat ion
6. Other Activities
6.1 Providing Technology Solutions
C-DAC, Thiruvanant hapuram has ent ered int o
a cont ract wit h Govt . of Kerala t o be a Tot al
Solut ion Provider for I T implemen t at ion in
Gover n men t . T h e Gover n men t h as alr ead y
implement ed “Project Grameen” for effect ive
disseminat ion of IT t o grass root level. C-DAC,
T hir uvan an t hapur am t oget her wit h Libr ar y
Coun cil h as set up Commun it y I n for mat ion
Cent res in fourt een dist rict s in Kerala. C-DAC,
Thiruvanant hapuram is t he execut ing agency of
t he E-Governance project s of Government of
6.2 Interaction with State Government
We have been act ive member of st at e-level Expert
Commit t ee for st andardisat ion of Malayalam
Keyboard and Charact er Encoding. The same
commit t ee has recommended t he modificat ions
t o be made in t he Malayalam UNICODE. The
Commit t ee has submit t ed it s final report and t he
Government have approved it .
Recent ly Government of Kerala has launched a
programme called “Akshaya” which is aimed at
making at least one person per family in Kerala
comput er lit erat e. C-DAC, Thiruvanant hapuram
is a member of t he commit t ee set up for evaluat ing
t he st udy mat erial for t his programme.
C-DAC, Thiruvanant hapuram is Member of t he
Gen er al Coun cil, I T @School Project of t he
Government of Kerala.
6.3 Training Activities
C-DAC, T hir uvanant hapuram, t oget her wit h
Cent re for Development of Imaging Technology
(C-DIT) has conduct ed a t wo day work shop on
Font Design in April 2002.
The ongoing comput erisat ion of various st at e
government depart ment s, along wit h increased
use of Malayalam as official language has creat ed
a need for a large number of comput er lit erat e
manpower wit h proficiency in Malayalam dat a
p r ocessi n g. A mod u l e on Mal ayal am wor d
processing t ools has been incorporat ed int o t he
PGDCA course and ot her public and corporat e
t r ai n i n g p r ogr ammes offer ed b y C - D AC ,
Thiruvanant hapuram.
C-DAC, T hiruvanant hapuram has formed a
COnsort ium of Indust ries Working in t he area
of Malayalam Comput ing (COWMAC). T he
Consort ium aims at mut ual int eract ion among
t he part icipat ing agencies, so t hat st andardisat ion
in t h e ar eas of fon t codin g, I T Vocabular y,
Keyboard overlay, t ranslit erat ion schemes et c. can
be don e effect ively. It also aims at avoidin g
duplicat ion of development act ivit ies by way of
closer int eract ion bet ween developers.
7. Publications
1. Malayalam Spell Checker - presented in the
I n t er n at i on al Con fer en ce on “Un i ver sal
Kn owledge an d Lan gu age” i n Goa i n
November 2002.
2. Opt ical Charact er Recognit ion Syst em for
Printed Malayalam Documents and
3. Text t o Speech Syst em for Malayalam –
presented in the SAP workshop at the Centre
for Appli ed Li n gu i st i cs an d Tr an slat i on
St u d i es( CALT S) i n t h e Un i ver si t y of
Hyderabad, in March 2003.
4. Text Reading System for Malayalam - Selected
for pr esen t at i on i n t h e I n t er n at i on al
Con fer en ce, “I n for mat i on Tech n ology :
Prospects and Challenges in the 21
in Kathmandu, Nepal in May, 2003.
We have also prepared t he final proposal for
modifications to be made in the next version of
Malayalam Unicode and t he Malayalam Design
guide for TDIL.
8. Expertise Gained
The Malayalam resource Cent re consist s of a core
t eam of D esign en gin eer s, pr ogr ammer s an d
l i n gu i st s p r ofi ci en t i n n at u r al Lan gu age
processing. We have gained expert ise in t he areas
of Image Processin g, Speech Syn t hesis, Fon t
Design, Phonology of Malayalam language and
Morphological Analysis of Malayalam language.
We have also built up t echnical capabilit y in t he
use of Dat a base packages like MS Access , XML,
Post gr e SQ L Scr ipt s, Lan guages an d t ools :
HTML, Java Server Pages, Java , and JavaScript ,
VC++, C++, VB, HTML, DHTML, Diaphone
St udio, Macromedia Flash, C-DAC Iplugin &
Leap O ffice, Macr omedia Fon t ogr apher an d
Bit st ream Webfont wizard 1.0
Expertise in configuring server side soft wares are
also available at the Centre.
Web Servers : Linux Apache Server, Tomcat server
Mail Server : Qmail, Send mail
9. Future Plans
We plan t o t ake up development of t he following
product s as second phase of t he Resource Cent re
Project .
• O n lin e Char act er Recogn it ion Syst em for
• Lexical Resources for Machine Translat ion
• Speech t o Text
• Malayalam Word Net and
• Port ing of t he various applicat ions developed
t o Linux plat form.
10. The Team Members
Ravindra Kumar R
Sulochana K.G
Jithesh .K
Jose Stephen
Santhosh Varghese
Shaji V.N
Vipin C Das
Keith Fernandes
Sanker C. S
Sunil Babu
Hari Kumar
Neena M.S
Praveen V.L
Mithra R.S.
Praveen Kumar
Hari .K
Dr. V.R Prabhodhachandran Nayar
Dr. Usha Nambudripad
Courtesy: Prof. Ravindra Kumar
C-DAC, Vellayambalam
Thiruvananthapuram-695 033
(RCILTS for Malayalam)
Tel: 00-91-471-2723333, 2725897, 2726718
E-mail :
School of Computer Science and Engineering
Anna University, Chennai-600025 India
Tel. : 00-91-44-2351265 Extn : 3340
E-mail :
Web Site: http:/ / rctamil
Resource Centre For
Indian Language Technology Solutions – Tamil
Anna University, Chennai
Achievements Achievements Achievements Achievements Achievements
Anna University, Chennai
T h e Resou r ce C en t er for I n d i an Lan gu age
Technology Solut ions – Tamil, Anna Universit y
has been act ively working in t he area of language
t echnology wit h special emphasis on t he following
• Linguistic Tools
• Language Technology Products
• Content Development
• Research On Information And Knowledge
O u r web si t e h t t p : / / an n au n i v. ed u / r ct ami l
sh owcases t h e h i gh li gh t s of t h ese act i vi t i es.
Downloadable demos are available for select ed
product s.
A det ailed report on t he act ivit ies of t he cent er is
present ed below.
1. Knowledge Resources
Screen shots of Knowledge Resources
The following act ivit ies have been undert aken for
development of knowledge resources.
" Online Dictionary
" Corpus Collection Tool
" Content Authoring Tool
" Tamil Picture Dictionary
" Flash Tutorial
" Handbook of Tamil
" Karpanaikkatchi - Scenes from Sangam
" District Information
" Educational resources
" Appreciation of Tamil poetry
" Geography for You
" Chemistry for X Standard
1.1 Online Dictionary
Online Dict ionary is a language-orient ed soft ware
for t h e r et r i eval of l exi cal en t r i es fr om
monolingual and bilingual dict ionaries by many
users simult aneously. It has been designed t o
int eract wit h various language-processing t ools –
mor p h ol ogi cal an al yzer, p ar ser, mach i n e
t ranslat ion et c.
The Tamil dict ionary cont ains 20,000 root words.
Each ent r y in t he dict ionary includes t he Tamil
root word, it s English equivalent , different senses
of t he word, and t he associat ed synt act ic cat egory.
The root words are classified int o 15 cat egories.
The special feat ures of t his Online Dict ionary are
t he following :
• It is an inflect ional or derivat ional dict ionary.
That is t he search can be based on an inflect ed
word. Given an inflect ed word, it calls t he
Morphological analyzer t o get t he root word
and searches t he dict ionary. In addit ion t o
providing informat ion about t he root word
fr om t h e d i ct i on ar y, i t also p r ovi d es t h e
analyzed informat ion about t he inflect ion.
• It support s search based on synt act ic cat egory.
• It also support s search based on part s of a word.
• Tamil equivalent words of given English words
can also be obt ained.
Status : Dictionary is available with 20,000 root
words. In effect, since it is an inflectional dictionary,
it is possible to find the meaning of over 1 lakh
words. It is to be enhanced to 35,000 root words.
1.2 Corpus Collection Tool
A corpus wit h 12 lakh words has been built from
Tami l n ews p ap er ar t i cl es, u si n g a C or p u s
Collect ion t ool. This t ool aut omat ically collect s
art icles from different web-sit es and st ores t hem
int o a corpus dat abase. The corpus dat abase is
organized int o a word dat abase and a document
dat abase. T h e dat abases h ave t h e followi n g
informat ion : t he art icle, cat egory of t he art icle,
key words, t it le, dat e, aut hor, source, unique
words and t heir frequencies. The t ool collect s
art icles available in different font s, and convert s
t hem t o a st andard font , before st oring t hem in
t he dat abase.
Status : Available for use with 12 lakh words.
Collection is an on-going process.
1.3 Content Authoring Tool
This is a t ool used t o creat e and deliver cont ent ,
int egrat ing t ext , sound, video, and graphics. The
design of t he t ool is generic so t hat it can be
adapt ed t o any domain. The informat ion can be
organized hierarchically wit h mult iple levels.
Cont ent can be updat ed at any t ime, wit h new
levels bein g added. T her e ar e pr imar ily t wo
different modes in which t his operat es – aut horing
mode, used by t he cont ent developer t o organize
and present t he cont ent , and t he viewing mode
used by t he end-user t o view t he cont ent s. The
aut horing mode st ores t he cont ent in a dat abase,
and t he viewing mode serves t he pages using JSP.
The user can select t he sect ions t hat he/ she want s
t o view.
Status : Testing in progress.
1.4 Tamil Picture Dictionary
This is a pict ure book for children. It has about
t hree hundred words - organized as nouns and
verbs. Each word has on associat ed pict ure, an
explanat ion, t he English equivalent and a small
poem t o illust rat e it s meaning.
Status : Available on our website with 300 words.
Audio capability is to be added.
1.5 Flash Tutorial
This t ut orial in Tamil, provides an easy way t o
learn Macromedia Flash, a cont ent development
t ool. Flash is t he most popular mult imedia t ool
t o enhance or design web pages or CBT packages.
This t ut orial cont ains a set of t echnical keywords
along wit h t he cont ent . It st art s wit h t he basics of
Flash and met hodically moves up t o an advanced
level. The component s available in t he t oolbox
are explained. Creat ion of movies and animat ed
images are also explained. The major concept s like
Tweening and Symbol conversion are discussed
i n d et ai l. Sn ap sh ot s ar e p r ovi d ed wh er ever
necessary. Plent y of examples are also provided.
The English equivalent for t he keywords are given
wit hin bracket s.
Status : Available on our website
1.6 Handbook of Tamil
This e-handbook gives an overview and special
feat ures of Tamil grammar and lit erat ure. The
h an d b ook i s d i vi d ed i n t o fou r sect i on s -
Int roduct ion, Lit erat ure, Grammar and ot hers.
In t he ‘int roduct ion’ sect ion, hist ory of Tamil
language and det ails about t he Tamil speaking
people are given. The second sect ion provides an
over view of Tamil lit erat ure st art ing from t he
Sangam age up to the modern times. The grammar
sect ion out lines t he classificat ion schemes used in
Tamil grammar and highlight s import ant feat ures.
T h e l ast sect i on p r ovi d es mi scel l an eou s
in format ion such as n ames of days, mon t hs,
measures, t radit ional art forms et c. in Tamil.
Status : Available on our website
1.7 Karpanaikkatchi - Scenes from Sangam
This is a visualizat ion of scenes from Sangam
lit erat ure. Select ed poems depict ing various
moods pert aining t o Sangam period have been
chosen . T he poems, t heir in t erpret at ion an d
specially creat ed imagery pict urising t he mood
conveyed by t hese poems, have been present ed.
Status: Development in progress
1.8 District Information
• Gives st at ist ical, in format ive an d special
feat ures about dist rict s of Tamil Nadu
• User can browse t hrough t he informat ion
dist rict wise or subject wise
Status : Under development
1.9 Educational Resources
• Appreciation of Tamil Poetry
• Geography for You
• Chemistry for X Standard
1.9.1 Appreciation of Tamil Poetry
T he aim of t his package is t o in t r oduce t he
n u an ces of Tami l poet r y t o you n g ch i ldr en
t hrough music, pict ure, animat ion and games.
This package consist s of 25 lessons t hat t ake t he
children from t he basics of Tamil poet ry t o an
advan ced level of appr eciat ion . Each lesson
con si st s of t h e p oem, segmen t at i on of t h e
comp ou n d wor d s, wor d - b y- wor d mean i n g,
int erpret at ion of t he poem, relat ed poems, and
aut hor informat ion, and exercises.
1.9.2 Geography for You
This package is available bot h in Tamil and in
En gl i sh . T h e ai m of t h i s p ackage i s t o
syst emat i cal l y i n t r od u ce t h e con cep t s of
Geography, cat ering t o st udent s st art ing from
primary t o higher secondary st ages. A t ot al of 33
lessons have been organized accordingly. T he
t opics covered include at mosphere, universe, solar
syst em, eart h, nat ural regions of t he world, and
the cont inents. Images, audio, and animation have
been added t o make t his appealing t o st udent s.
1.9.3 Chemistry for X Standard
This package has been specifically targeted at tenth
st andard st udent s. H ere again t he focus is on
present ing concept s, and it is present ed bot h in
Tamil and English. The package consist s of eight
t opics – periodic classificat ion, at omic st ruct ure,
chemical bonding, phosphorus, halogens, met als,
organic chemist r y, and chemical indust r y. I n
addit ion t o innovat ively using images, and audio,
t he highlight of t his package is simulat ion of
experiment s using animat ion.
2. Knowledge Tools
T he following t ools have been developed for
language and informat ion processing.
• Language Processing Tools
• Morphological Analyser
• Morphological Generator
• Text Editor
• Spell Checker
• Document Visualization Tool
• Utilities
• Code Conversion Utility
• Tamil Typing Utility
2.1 Language Processing Tools
Tamil like man y ot her In dian lan guages is a
morphologically rich language. Format ion of
words occurs by means of agglut inat ion. Thus
each word consist s of a root word and one or
more suffixes. H ence morphological analysis of a
word and morphological generat ion of a word
are primary and challenging t asks. Hence, t he core
of t he language processing t ools developed are t he
Morphological Analyser and Morphological
Generator. They are used t o build ot her language
p r ocessi n g t ool s an d p r od u ct s, l i ke p ar ser,
t ranslat ion syst em, spell checker, search engine,
dict ionary et c.
• Morphological Analyser
The morphological analyser t ool t akes a derived
word as input and separat es it int o t he root word
and t he corresponding suffixes. The funct ion of
each suffix is indicat ed. It has t wo major modules
– noun analyser and verb analyser. A rule-based
approach is used. The number of rules used is
approximately 125. Heuristic rules are used to deal
wit h ambiguit ies. A dict ionary of 20,000 root
words is used. The design of t he analyser is such
t hat t he number of t imes t he dict ionary is accessed
is minimized. The st eps involved are as follows:
1. Given an input st ring, it st art s scanning t he
st ring from right t o left t o look for suffixes. A
list of suffixes is maint ained.
2. It searches for t he longest mat ch in t he suffix
list .
3. It t hen removes t he last suffix and det ermines
it s t ag and adds it t o t he word’s suffix list .
4. It checks t he remaining part of t he word in t he
dict ionary and exit s if t he ent ry is found.
5. According t o t he ident ified suffix, it generat es
t he next possible suffix list .
6. T he process repeat s from st ep 2, wit h t he
current suffix list .
Screen shot of Morphological Analyser
Status : Currentl y, in the tes ting phas e. A
preliminary version with over 70% accuracy is
available. It is being enhanced by adding rules, to
improve the accuracy.
• Morphological Generator
The morphological generat or does t he inverse
process of t he morphological analyser. It generat es
words when Tamil morphs are given as input . It
also has t wo major modules: - noun generat or and
verb generat or. In t he noun sect ion, a root noun,
plural marker, oblique form, case marker and
post posit ions are given as input s. In t he verb
sect i on a r oot ver b, t en se mar ker s, r elat i ve
part iciple suffix, verbal part iciple suffix, auxiliary
verbs, number, person and gender marker are
given as input s. At a t ime, four auxiliary verbs
can be added t o t he main verb. In addit ion t o
t hese, adject ives and adverbs are also handled.
It generat es t he word using t he int ernal sandhi
r u l es an d t h e gi ven i n p u t s. I t u ses 1 2 5
morphological rules. For a given verb, wit h all
possible combinations of tense markers and person
number gender markers nearly 30 forms can be
generat ed. Wit h t he auxiliary verbs added t o t he
mai n ver b , ab ou t 2 0 0 ver b for ms can b e
generat ed. In t he same way for a given noun, wit h
all possible combinat ions of case markers and
post posit ions nearly 68 forms can be generat ed.
Screen shot of Morphological Generator
Usin g t his gen erat or, an in t eract ive sen t en ce
generat or has been designed. Given t he root noun
an d associ at i ve n u mb er, case, ad j ect i ve,
post posit ion, et c. , and t he root verb and t he
cor r esp on d i n g au xi l i ar y, t en se, ad ver b,
p ost p osi t i on , et c, t h i s t ool wi l l gen er at e
mor p h ol ogi cal l y an d syn t act i cal l y cor r ect
sent ences. In order t o generat e a simple sent ence
at least a subject , verb and t he t ense should be
Status : Preliminary version is available. It is being
enhanced by adding rules to take care of special cases.
2.2 Text Editor
The Tamil t ext edit or is aimed at naïve Tamil
users. It is plat form-independent ( works on bot h
Windows and Linux machines). It provides basic
facilit ies for word processing bot h in Tamil and
English. It doesn’t have any special file format .
Files of bot h r ich t ext for mat an d t ext -on ly
format s can be creat ed or edit ed.
Status : Available for use. Can be downloaded from
the web site.
2.3 Spell Checker
Tamil Spell Checker is a t ool used t o check t he
spelling of Tamil words. It provides possible
suggest ion s for t he wr on g wor ds. T his spell
checker is based on t he dict ionary, which cont ains
Tamil root words. At present t he size of t he
dict ionary is 20,000 words.
The spell checker uses t he morphological analyser
for split t ing t he given Tamil word int o t he root
word and a set of suffixes. If t he word is fully
split by t he morphological analyser, it is assumed
t hat t he given word is correct . Ot herwise it goes
for t he correct ion process. The following t ypes
of errors are handled :
1. Errors in case endings and PNG markers: In
t his case, if t here are any errors in t he case
endings like il, aal, iTam, uTan, uTaiya, aan,
aaL, arkaL, et c …, t hey are correct ed and given
t o t he suggest ion generat ion phase.
2. Errors due t o similar sounding charact ers, (eg.
manam, maNam,palam, paLam, pazham, et c.)
: In t his case, if any let t er in t he erroneous word
can have similar sounds, t hen it is replaced wit h
equivalen t let t er s an d checked again st t he
dict ionary.
3. Adjacent key errors : Adjacent keys are based
on Tamil 99 st andard keyboard. In t his case,
each let t er is replaced wit h adjacent keys and
checked against t he dict ionary.
Aft er t he correct ion phase, all t he correct ed case
endings, PNG markers, root words are given t o
t he suggest ion gen er at ion pr ocess. T he spell
checker makes use of t he Morphological generat or
t o generat e all possible suggest ions. User has t he
provision t o select t he suggest ion among t he list ,
ignore t he suggest ion or add t he part icular word
t o t he dict ionary.
Screen shot of Tamil Spell Checker
Status : Version 1.0 is available. Will be enhanced
by increasing the root-word dictionary size to 35000.
2.4 Document Visualization
D ocu men t vi su ali zat i on ai ms at p r esen t i n g
abst ract ed informat ion ext ract ed from raw dat a
files visually. Inst ead of forcing a user t o read a
whole document just t o grasp cert ain informat ion,
document visualization offers very important clues
about t he document . It is a t ool t hat is capable of
visualizing any document .
The t ool has various cont rol point s for t he user
t o specify his opt ions. The t ool t akes any t agged
d ocu men t as i n p u t an d ext r act s t h e t ag
informat ion and uses t hat for visualizat ion. The
various views present ed are scat t er plot view, t o
show relat ive t ag posit ions in a t wo dimensional
plane, document zoom view, t o show t humb views
of pages, st at ist ical informat ion view, t o show
st at ist ics about input document and synchronous
dept h view. These views t oget her make up t he
ent ire t ool.
Scatter Plot view : Here t he document is mapped
on t o a graph. This view gives t he relevant posit ion
of t he t ag in a part icular page. It normalizes t he
t ag posit ion according t o t he display. The posit ion
of t he t ags is marked wit h circles of predefined
size and color. Their shapes dist inguish nest ed and
normal t ags. Apart from displaying t he t ags, it
gives t he relevant informat ion about a part icular
t ag in a separat e screen, when t he user clicks on it
and, if t he select ed t ag is of nest ed t ype, it also
drops a line from t he st art ing posit ion t o t he end
of t he t ag.
Zoom view : The document zoom view prepares
t h e i n t er med i at e fi l e, wh i ch h as a p r op er
alignment of cert ain number of lines per page.
Each page is shown as a small t humbnail. The user
can click on t he appropriat e page t o view t hat
page. Search can be done in zoom view.
Statistical Information view : This view present s
t he frequencies of t ags ident ified in t he input file
in a bar chart form. It is used t o show t he legend
about t he t ags present in t he document .
Synchronous Depth view : This view project s t he
input document ont o t he 3-dimensional plane.
The whole document is considered t o be an object
in 3- dimensional plane. This view helps t o get
t he pict ur e of how t he t ags spr ead over t he
document .
Display of Zoom and Scatter Ploter View
Status : A demo version is available. Currently being
2.5 Utilities
In order t o facilit at e language processing in Tamil,
a set of common ut ilit ies have been developed.
These include Code conversion utility, and Tamil
typing utility.
Code Conversion Utility
There are a plet hora of font s in use on various
web si t es i n Tami l . H en ce a gen er i c cod e
conversion ut ilit y has been developed. Given a
source and a t arget -coding scheme, t his ut ilit y will
generat e t he code for t he conversion. This code
can b e i n t egr at ed as p ar t of an y l an gu age
processing t ool or applicat ion. Exist ing st andard
coding schemes are already built in. Any ot her
coding scheme has t o be explicit ly specified.
Status : Available for use.
• Tamil Typing Utility
T his is an API , which in dir ect ly ser ves as a
keyboard driver. It allows t he user t o t ype t he
Tamil t ext in bot h t ranslit erat ed mode and using
t he Tamilnet 99 st andard keyboard. This can be
in cluded as par t of an y lan guage pr ocessin g
applicat ion t hat requires Tamil input .
Status : Available for use.
3. Translation Support Systems
The following are t he component s t hat have been
developed t o aid machine t ranslat ion :
• Tamil Parser
• Universal Networking Language (UNL) for
• Heuristic Rule Based Automatic Tagger
3.1 Tamil Parser
T h e Tami l p ar ser i d en t i fi es t h e syn t act i c
const it uent s of a Tamil sent ence. It out put s t he
parse t ree in a list form. The free word order
feat ure of Tamil makes parsing a challenging t ask,
since t here is a need t o associat e and link synt act ic
component s t hat are not always adjacent t o each
ot her. T his is don e by associat in g posit ion al
informat ion wit h t he words. The parser t ackles
bot h simple sent ences having a verb, a number
of n ou n p h r ases, an d si mp l e ad ver b s an d
adject ives, and complex sent ences wit h mult iple
adject ival, adverbial, and noun clausal forms.
The processing uses a phrase-st ruct ure grammar
t oget her wit h a limit ed amount of look-ahead t o
handle free word order. The words of t he sent ence
ar e fi r st an al yzed u si n g t h e mor p h ol ogi cal
analyzer, and t he root word along wit h t he part -
of-speech t ag and suffix informat ion obt ained is
input t o t he parser. However, t he analyzer does
n ot u n ambi gu ou sly gi ve t h e p ar t - of- sp eech
informat ion for all words. In t hese cases, linguist ic
based heur ist ic r ules ar e used t o obt ain t he
associat ed t ags. Current ly, 15 heurist ic rules are
used. For sent ences wit h mult iple clauses, t he
parser first ident ifies t he cue words of t he clauses
and synt act ically groups t he clauses based on t he
cue words and phrases.
Screen shot of Tamil Parser
Status : At present it can handle simple sentences, and
complex sentences with a noun clause, with multiple
adjective clauses, with multiple adverb clauses, and
both multiple adjective and adverb clauses.
3.2 Universal Networking Language (UNL) for Tamil
Universal Net working Language (UNL) is an
int ermediat e represent at ion for int er-lingua based
machine t ranslat ion. The Universal Net working
Language (UNL) is an int ermediat e language
based on semant ic net work, and used by machines
t o exp r ess an d exch an ge n at u r al l an gu age
informat ion.
U N L p r ovi d es sen t en ce- b y- sen t en ce
represent at ion of t he semant ic cont ent . Sent ence
information is represented as a hyper-graph having
Universal Words (UWs) as nodes and binar y
relat ions as arcs. Binary relat ions are t he building
blocks of UNL sent ences. They are made up of a
relat ion and t wo UWs. Each relat ion is labeled
wi t h on e of t h e p ossi bl e l abel d escr i p t or s.
Relat ions t hat link UWs are labeled wit h semant ic
roles of t he t ype such as agent, object, experiencer,
time, pl ace, caus e, wh i ch ch ar act er i ze t h e
relat ionships bet ween t he concept s part icipat ing
in t he event s. This project focuses on developing
t he Tamil EnConvert er (Tamil t o UNL format )
and Tamil DeConvert er (UNL format t o Tamil).
Bot h t he EnConvert er and t he DeConvert er use
l an gu age d ep en d en t r u l es, an d a l an gu age
dict ionar y t hat cont ains Universal Word (UW)
for iden t ifyin g con cept s, wor d h eadin g an d
synt act ical behavior of words.
An EnConvert er is a soft ware t hat aut omat ically
or int eract ively enconvert s nat ural language t ext
int o UNL. EnConvert er is a language dependent
parser t hat provides synchronously a framework
for mor p h ol ogi cal , syn t act i c an d seman t i c
analysis. EnConvert er uses t he following sequence
of operat ions:
• Convert or load t he rules
• Input a Tamil sent ence
• Ap p l y t h e r u l es an d r et r i eve t h e wor d
dict ionary
• Out put t he UNL expressions
EnConvert er analyses a sent ence using t he word
dict ionar y, knowledge base, and enconversion
rules. It first uses t he morphological analyser t o
analyze t he given word. It ret rieves relevant
dict ionary ent ries from t he language dict ionary,
operat es on nodes in t he Node-list by applying
en con ver si on r u les, an d gen er at es seman t i c
net works of UNL by consult ing t he knowledge
A DeConvert er is soft ware t hat aut omat ically or
i n t er act i vel y d econ ver t s UN L i n t o n at u r al
lan gu age t ext . Tami l D eCon ver t er u ses t h e
morphological generat or t o generat e t he Tamil
sent ence.
The informat ion needed t o const ruct t he UNL
st ruct ure and generat e Tamil sent ence from UNL
st ruct ure is available at different linguist ic levels.
Tamil being a morphologically rich language
allows a large amount of informat ion including
syn t act i c cat egor i zat i on , an d t h emat i c case
relat ion t o be ext ract ed from t he morphological
level. Endings in Tamil words wit h synt act ic
funct ional grouping and semant ic informat ion are
used t o ident ify t he correct binary relat ion. For
syn t act ical fun ct ion al gr oupin g, in for mat ion
about relat ing concept s like verbs t o t hemat ic
cases, ad j ect i val comp on en t s t o n ou n s an d
ad ver bi al comp on en t s t o ver bs ar e n eed ed .
Synt act ical funct ional grouping, has been done
by t he specially design ed par ser, t akin g in t o
con siderat ion , t he requiremen t s of t he UN L
st ruct ure.
Status : At present, all the relations have been
identified and simple sentences can be handled.
3.3 Heuristic Rule Based Automatic Tagger
This t agger aut omat ically t ags t he words of a given
documen t wit h par t s of speech t ags. I t uses
linguist ically orient ed heurist ic rules for t agging.
I t d oes n ot u se t h e d i ct i on ar y an d t h e
Morphological Analyser.
The part of speech (POS) tagger uses heuristic rules
t o find t he nouns, verbs, adject ives, adverbs and
post posit ion s. T his is don e, by checkin g t he
appropriat e morphemes of t he words. Clit ics are
removed from t he word before it goes t hrough
rule verificat ion.
Some of t he heurist ic rules used are as follows.
1. If a word cont ains PNG markers and t ense
markers, t hen it is t agged as verb.
2. If a word cont ains case marker t hen it is t agged
as noun.
3. St an dalon e wor ds : Some common ly used
words are cat egorized and st ored in list s. If a
word belongs t o a part icular list t hen t he word
is t agged wit h t hat cat egory.
4. Fill in rule: If t he unknown word comes in
bet ween t wo nouns, t hen it is t agged as a noun.
5. Verb Terminat ing: If t he unknown word comes
at t he end of a sent ence, t hen it is t agged as a
verb. This is due t o t he fact t hat normally Tamil
sent ences end wit h a verb.
6. Bigram : The unknown word is ident ified using
t he cat egory of t he previous word.
Status : About 83% accuracy is obtained when tested
on a sample corpus.
4. Human Machine Interface Systems
Two packages t h at ser ve as h uman -mach in e
int erfaces, have been developed. They are :
• Text To Speech (Ethiroli)
• Poonguzhali (A chatterbot)
4.1 Text To Speech (Ethiroli)
Et hiroli is a Text t o Speech engine for Tamil. The
engine has been so designed t hat it can be plugged
int o any applicat ion requiring a t ext t o speech
component . When Tamil t ext is given as input , it
is pre-processed, and t ranslit erat ed using linguist ic
rules, t o a phonet ic represent at ion in order t o
remove homographic ambiguit ies. H omographic
ambiguit y arises in Tamil since t he same charact er
can have differ en t soun ds depen din g on t he
posit ion in which it occurs. For example, t he
vallinum let t ers (è, ê, ì, î, ð, ø) in Tamil have more
t h an on e sou n d based on t h ei r p osi t i on of
occurrence wit hin a word. Each word in t his
t r an sl i t er at ed t ext i s t h en sp l i t i n t o i t s
corresponding phonemes using t he cvcc model.
Once t he phonemes are ident ified, t he sound files
corresponding t o t he phonemes are concat enat ed
and played in sequence t o give t he speech out put .
Current ly, 4000 syllables are used t o generat e
Tamil speech using t his met hod.
Status : A preliminary version is available. Efforts
are on to improve the quality of speech produced.
4.2 Poonguzhali ( Chatterbot )
Poonguzhali is a generic Tamil chat t erbot t hat
allows t he user t o chat wit h t he syst em on any
t echnical t opic. A chat t erbot is an AI program
t hat simulat es human conversat ion and allows for
Nat ural Language communicat ion bet ween man
and machine.
A quest ion or a st at ement is t aken as input from
t he user. The funct ion of t he syst em is t o generat e
an appropriat e response, based on t he cont ext of
t he input . The user can choose any exist ing t opic
for conversat ion and ask quest ions t o t he syst em
in Tamil. Provision for input in t ranslit erat ed
for m is available. T h e syst em iden t ifies t h e
minimal cont ext of t he input as t o what t he user
is t rying t o ask, no mat t er in what way t he user
frames his quest ion. This is done using a set of
decomposit ion rules. The response is t hen formed
using a set of reassembly rules t hat reside in t he
knowledge base. This response is t hen reframed
t o mat ch t he way in which t he user had framed
his quest ion. The syst em is also versat ile enough
t o init iat e a conversat ion based on t he earlier
sequence of t he conversat ion, when t here is no
input from t he user.
The soft ware offers t he facilit y of an opt ional
alt ernat e explanat ion t o t he previous one, in case
t he user doesn’t underst and t he earlier answer
given by t he syst em. The pronoun references like
t he pronoun “idhu” in t he second quest ion relat ed
t o t he first one are ident ified. The Yes/ No t ype of
quest ions and compound t echnical t erms are also
Current ly t here are t wo domains available. Any
n u mber of domai n s can be added. For t h i s
purpose, t he t ool provides a separat e int erface for
adding domains and for updat ing t he knowledge
base. This has t hree separat e opt ions - for ent ering
t he knowledge base, non-t echnical words and
t echnical t erms. The input key is ent ered wit h it s
response and alt ernat e response in t he knowledge
base. The input key can be more t han one word
wit h t he ‘+’ separat or.
Screen shot of Poonguzhali (Chatterbot )
Status : A version with knowledge base for two
domains is available.
5. Localization
A complet e office suit e for processin g Tamil
informat ion has been developed. This works on
bot h Windows and Linux plat forms. A search
engine t o search Tamil websit es has also been
• Tamil Office Suite
• Tamil Word Processor,
• Presentation Tool for Tamil,
• Tamil Database
• Tamil Spreadsheet
• Tamil Search Engine
5.1 Tamil Office Suite (Aluval Pezhai)
T h e Tamil office suit e con sist s of four t ext
processing applicat ions, namely :
• Tamil word processor
• Present at ion t ool for Tamil
• Tamil Dat abase
• Tamil Spreadsheet .
This office suit e is specifically designed for Tamil
users. However, it can be easily localized for any
• Tamil Word Processor (Palagai)
The Tamil word processor is aimed at Tamil users.
It provides basic facilities for word processing both
in Tamil and English. There are t wo versions of
t he word processor available. One wit h t he spell-
ch ecker an d gr ammar - ch ecker, an d an ot h er
wit hout t he spell checker and grammar-checker.
Bot h t hese versions include an e-mail facilit y. The
grammar-checker checks for person, gender and
number compat ibilit y bet ween t he subject and t he
main verb of t he sent ence.
The word processor does not have any special file
format . Files of bot h rich t ext format and t ext -
only format s can be creat ed or edit ed. H TML
files can be viewed wit h t his word processor. E-
mails can be sent using t his word processor if t he
curren t syst em is con n ect ed wit h t he SMT P
server. Email at t achment s are also handled.
Status :Version 1.0 is available for use.
• Tamil Presentation Tool (Arangam)
Aran gam is a presen t at ion t ool for Tamil. It
organizes a present at ion wit h slides consist ing of
text, pictures and images. The presentation is saved
in a propriet ary format .
Once a file is creat ed one can add, remove or
edit slides wit h common operat ions like adding
t ext boxes, pict ures and general shapes. There are
t hree modes of operat ion:
Edit Mode (Slide)
Preview Mode (Slide Sort er)
Present at ion Mode (Slide Show)
The edit mode is t he one t hat allows working on
one select ed slide. Preview mode gives a preview
of number of slides (there are different possibilities
2x2, 3x3 or 4x4 slides in one page). Present at ion
mode is t he regular slideshow.
Screen shot of Tamil Presentation Tool (Arangam)
Status : Version 1.0 is available for use.
• Tamil Database (ThaenKoodu)
The dat abase t ool helps a user t o st ore Tamil dat a
and also provides various means of ret rieving it
u si n g qu er i es, for ms an d r epor t s. T h e h elp
document for t his t ool is provided in bilingual
format (Bot h Tamil & English). All t he dat a are
st ored in MS-Access format .
This t ool is divided int o five major modules. They
are Dat abase, Table, Query, Forms and Report s
modules. In t he Dat abase module, a new dat abase
can be creat ed or exist ing dat abases can be edit ed.
The Table module allows t he user t o creat e t he
t ables in t wo different ways. One is t he design
mode where t he t able name, field name, dat a t ype
and t he const raint s (null, primary key and so on.)
are t o be ent ered. The ot her is t hrough t he Table
creat ion wizard, where t he sample t able names
along wit h t heir fields are provided t o creat e t he
t able. Tables can be modified and t he st ruct ure of
t he t able can also be alt ered. New fields can be
added t o t he exist ing t able at any t ime.
T he Q uer y module han dles quer ies in t hr ee
different modes. In t he first , t he t able name and
field name should be ent ered t o query t he t able.
The second mode uses t he wizard, where t he
required t able name and t he corresponding field
name can be select ed. The t hird one is, where t he
Tamil query can be ent ered in t he space provided,
t o ret rieve t he result .
The Forms module generat es forms. The forms
can be built wit h t he help of a wizard and can
also be saved for lat er use. Report module allows
t he user t o generat e report s. The report can also
be saved and print ed.
Screen shot of Tamil Database (ThaenKoodu)
Status : Version 1.0 is available for use.
• Tamil SpreadSheet (Chathurangam)
Tamil Spreadsheet allows one t o easily ent er or
edit t he dat a in t he desired format , save dat a, and
vi ew u si n g var i ou s ch ar t s. Mat h emat i cal
expressions are also handled. The dat a is saved in
t he propriet ary format (“.t ss” Tamil spread sheet
format ).
It con sist s of t h r ee modules viz. Wor ksh eet
module, Expression Module and Chart Module.
The default number of worksheet s available is
t hree. More number of worksheet s can be added.
T he wor ksheet s can be delet ed or r en amed.
Worksheet module allows t he user t o ent er t he
dat a in t he required format . The basic operat ions
like cut , copy, past e, find, replace, delet e sheet ,
print t he sheet , insert row/ column and delet e row/
column are available in t his t ool.
The Expression Module evaluat es t he expressions.
Addit ion (Sum), Count , Maximum, Minimum
and Average are t he funct ions support ed in t his
The Chart Module allows t he user t o draw t he
chart s, which are visually appealing and make it
easy for users t o see comparisons, pat t erns, and
t rends in dat a. This module represent s t he dat a in
t he form of pie, horizont al bar and vert ical bar
chart s. It can be saved as jpeg (.jpg) file format .
The chart t it le and t he scale t o draw t he chart s
can also be cust omized.
Status : Version 1.0 is available for use.
5.2 Tamil Search Engine (Bavani)
A search engine is a soft ware package t hat searches
for document s in t he Int ernet dealing wit h a
part icular t opic. The Tamil Search Engine aims
at looking up Tamil sit es for informat ion sought
by a user. It searches for Tamil words in Tamil
web sit es available in popular fon t en codin g
schemes. Current ly 22 font -encoding schemes are
support ed. An import ant feat ure of t he search
engine is t he int egrat ion of t he morphological
analyser, so t hat t he keywords are always reduced
t o t he root form.
The syst em gat hers informat ion from t he Int ernet
by t he process of crawling. The informat ion is
gat hered and st ored in t he dat abase. An int erface
is provided t o t he user t o ent er t he quer y. The
query is analyzed wit h t he informat ion from t he
dat abase, and t he informat ion t hat mat ches t he
query is ret urned t o t he user.
The following are t he import ant modules handled
in t he search engine:
Crawler: This is t he kernel part of t he design.
St art ing from a given URL and domain, t he
document will be downloaded and parsed. It will
t hen ret rieve t he URLs from t hat document and
insert t hese URLs in t he URL t ree. Aft er parsing
each document , t he next URL in t he t ree will be
t aken out for furt her crawling.
Database: A dat abase will be maint ained t o st ore
each word and it s corresponding appearance in
all t he document s. These words are st ored as a B-
t ree for efficient indexing. The posit ion of t he
word in various H TML t ags is maint ained and
used for ranking analysis.
Searcher: Searcher accept s queries from t he user,
and t hen split s t he queries int o words. The given
words are searched in t he BTree and finally t he
result s are sent back t o t he user aft er ranking is
done. The ranking is performed in t he order of
n u mb er of ou t goi n g l i n ks, n u mb er of
occurrences, descript ion (met a), head and t it le.
The result s are displayed in t he order of AND,
OR operat ions. The document , which has t he
maximum number of query words, get s a higher
Status : Available for use.
6. Language Technology Human Resource
Man y act i vi t i es wer e u n d er t aken t o cr eat e
awareness and develop human resources in t he
area of Language Technology. The act ivit ies can
be grouped as follows:
• Creating awareness among the general public
• Language Technology Training
• Co-ordination with linguistic and subject
• Co-ordination with Government and industry
6.1 Creating Awareness Among The General
One of t he priorit ies of t he cent er has been t o
cr eat e awar en ess about t he n eed for con t en t
development among t he public. Working t owards
t his goal, we t arget ed school and college t eachers
and st udent s and t he following workshops have
been conduct ed:
• Workshop on Mult imedia Tools for Effect ive
Teaching: Conduct ed a t wo day workshop on
Mult imedia Tools for effect ive Teaching for
college and school teachers in Feb 2001 – 100
teachers attended the workshop (200 person days).
• Workshop on Mult imedia Tools for cont ent
cr eat or s C on d u ct ed a t en - d ay h an d s- on
workshop on Mult imedia t ools for cont ent
creat ors in May 2001 – 13 t eachers at t ended
t he workshop (130 person days)
• Effect ive Teaching t hrough comput er
A present at ion was given by RCILTS-Tamil
members at S. B. O . A School, An n a Nagar,
Chennai. 20 t eachers at t ended.
A present at ion was given by RCILTS-Tamil
members, at Teachers t raining school, Saidapet ,
Chennai. 100 t eacher t rainees at t ended.
• WWW-2001
Organized a Creat ive writ ing compet it ion for
school children on Indian cult ural aspect s in
Tamil an d En glish, t it led “ WWW 2001,
Weave your Words t o t he Web”, in Sep 2001
t o cr eat e an awar en ess ab ou t con t en t
development . 40 st udent s part icipat ed.
• InSight
“InSight – Scient ific Thought , t he Indian way”,
an I ndian sciences appreciat ion course for
school children, is being conduct ed every 6
mont hs. (500 person days )
6.2 Language Technology Training
Language t echnology t raining has been impart ed
in t wo ways. O ne is by conduct ing int ensive
wor ksh op s t o en cou r age r esear ch an d
development in language t echnology, and t he
ot her is by carrying out development project s in-
• Workshop conduct ed:
• Organized a workshop titled “Corpus-based
Nat ural Language Processing” t arget ed at
bot h comput er scient ist s and linguist s
• Pre workshop Tut orials – Dec 14 –16,
• Dat e: Workshop Dec 17, 2001 – Jan 2,
• No. of part icipant s: 40
• C on t en t s: Pan i n i an Gr ammar,
D ep en d en cy Gr ammar, Mach i n e
Tr an sl at i on , St at i st i cal N LP, TAG
grammar et c.
• In-house Training
More t han 15 project associat es have been act ively
working in t he area of language t echnology for
t he past t hree years and have gained expert ise in
t his area. Specifically t hey are well versed in
comp u t at i on al mor p h ol ogy, syn t ax, l exi cal
format s, and font and coding issues.
A lot of ent husiasm and int erest has been creat ed
among st udent s t o work in t he area of language
t echnology. Over t he last t hree years about 30
major st udent project s have been carried out in
language t echnology. The st udent s are bot h from
under graduat e (B.E – 36 st udent s – 12 bat ches)
and post graduat e (M.E and M.C.A – 18 st udent s)
st reams. Some of t hese project s have been furt her
developed int o product s.
A n u mber of r esear ch sch olar s ar e act i vely
persuing resear ch in t his area. T hree M. S by
research scholars are working in t he areas of
lan guage t r an slat ion , speech pr ocessin g, an d
informat ion ret rieval. Three Ph.D scholars are
working in t he areas of informat ion ext ract ion,
Indian logic based knowledge represent at ion, and
aut omat ic soft ware document generat ion.
6.3 Co-ordination with Linguistic and Subject
The cent er has been coordinat ing wit h various
expert s and agencies bot h for linguist ic expert ise
an d for con t en t d evel op men t . N u mer ou s
int eract ions have been organized for t his purpose.
• Semi n ar con d u ct ed b y RC I LT S- Tami l
members wit h S.B.O.A School, Anna Nagar,
Chennai, for cont ent development in Tamil
Lit erat ure.
• Semi n ar con d u ct ed b y RC I LT S- Tami l
memb er s wi t h MO P Vi sh n awa C ol l ege,
Chennai, for cont ent development in Tamil
Cult ure.
• Discussion wit h Tamil scholars for effect ive
con t en t d evel op men t i n cl assi cal Tami l
lit erat ure.
• Semi n ar gi ven by D r. Su b ai ya Pi l l ai an d
Dr. D . Ran gan at han at An n a Un iversit y for
“Sandhi in Tamil”.
• A one day workshop t it led “Language and
Comput at ional Issues” for int eract ion wit h
Tamil Linguist s
Dat e: Feb 15, 2002
No. of part icipant s: 30
• Discussion wit h expert s from C-DAC, ASR,
Melkot e, CIIL Mysore, Annamalai Universit y,
Mad r as Un i ver si t y r egar d i n g l an gu age
processing t ools developed.
• D emon st r at i on of p r od u ct s d evelop ed at
various forums.
6. 4 Co-ordination with Government and
The product s developed have been demonst rat ed
t o indust ries working in t he area of language
t echnology. The out come of t his int eract ion is
t wo-fold. One is t he signing of memorandum of
un der st an din g wit h in dust r ies like Modular
I n fot ech , Pu n e, Apple Soft , Ban galor e, an d
Chennai Kavigal, Chennai for product s such as
sp el l ch ecker, mor p h ol ogi cal an al yser an d
generat or, and t ext t o speech engine. The ot her is
t he proact ive int erest shown by Kanit hamizh
Changam, a consort ium of indust ries working on
Tamil comput ing t o work in close coordinat ion
wit h t he cent er.
The office suite will be made available free of cost
t o int erest ed users, indust ries and Government
organizat ions. It is proposed t o publicize t he
availabilit y of t he product s and t ools t hrough
int eract ion wit h appropriat e st at e government
authorities. Core technology such as morphological
analyser, generator, and spell checker will be made
available after signing of appropriate MOUs. The
educat ional resources developed will be made
available freely to government schools.
7. Standardization
All t he soft ware developed cat ers t o bot h t he
prevailing Tamil Nadu Govt. standard (TAM/ TAB
font encoding scheme), and t he ISCII st andard.
Test ing for Unicode support is in progress. The
cent er, act ively co-ordinat es wit h t he Tamil Nadu
Govt , in defining t he Tamil Unicode scheme.
Available st andards for t ranslit erat ion have been
adhered t o, wherever necessary.
All t he development work has been carried out
in Java, t o provide cross-plat form operabilit y. All
t he soft ware has been t est ed on bot h Windows
and Linux plat forms. Code, t echnical and user
d ocu men t at i on h ave been d on e for al l t h e
product s.
8. Research Activities
The following research act ivit ies have been going
on, in t he areas of informat ion ret rieval from
document s and knowledge represent at ion.
• Text Summarization
• Latent Semantic Indexing
• Knowledge Representation System Based
On Nyaya Shastra
8.1 Text Summarization
Document summarizat ion plays a vit al role in t he
u se an d man agemen t of i n for mat i on
disseminat ion. This project invest igat es a met hod
for t he product ion of summaries from arbit rary
Tamil t ext . The primar y goal is t o creat e an
effect i ve an d effi ci en t t ool t h at i s ab l e t o
summarize large Tamil document s. Summarizer
proceeds t o produce a meaningful ext ract of t he
or i gi n al t ext d ocu men t u si n g t h e sen t en ce
ext ract ion model. Lexical chains are employed t o
ext ract t he significant sent ences from t he t ext
wit hout complet e semant ic analysis of t he original
document . Segment at ion is done init ially which
helps t o maint ain t he coherence in t he generat ed
T h e t h r ee i mp or t an t st ep s i n vol ved i n
su mmar i zat i on ar e : Segmen t at i on of t h e
document , Lexical chain creat ion and significant
sen t en ce ext ract ion for summary gen erat ion .
Tamil WordNet is used as knowledge source for
summar izat ion , which helps t o fin d out t he
relat ions t o build lexical chains. It is able t o
incorporat e t he relat ionship among words and at
t he same t ime account for t he various senses of
t he words.
Segmentation: Text segment at ion is a met hod for
par t it ion in g full-len gt h t ext documen t s in t o
mu lt i - par agr aph u n i t s t h at cor r espon d t o a
sequence of sub t opical passages. This met hod is
based on Mu lt i - p ar agr ap h segmen t at i on of
exposit ory t ext . It has now been implement ed for
Lexical Chain Creation : Lexical chains are formed
by d i sambi gu at i n g t h e sen ses of t h e wor d s
occurring in t he original t ext and chaining t he
relat ed t erms t oget her. T hough repet it ion of
wor d s i t sel f car r i es i n for mat i on al val u e,
collocat ion of t erms serves t o recognize t opics
precisely. Ranking of t he lexical chains and
select ing t he st rong chains are done using Lengt h
and H omogeneit y index (t he number of dist inct
occurrences divided by t he lengt h.).
Sentence Extraction : Once t he chains have been
select ed, t he n ext st ep of t he summarizat ion
algorit hm is t o ext ract full sent ences from t he
original t ext based on chain dist ribut ion. There
are met hods, which involve choosing t he sent ence
t hat cont ains t he first appearance of a chain
member in t he t ext . Depending on t he request ed
summary rat io, t his algorit hm can dynamically
change t he block size and t oken-sequence size in
t he segment at ion st ep so t hat t he number of
segmen t s r ecogn i zed i s ei t h er d ecr eased or
in cr eased. T h e n umber of segmen t s in t ur n
determines the number of lexical chains thus either
increasing or decreasing t he lengt h of t he result ing
summary. For convenience, t he percent age size
of t he document is represent ed in t erms of t he
number of sent ences.
Status : A demo version for sports domain is
8.2 Latent Semantic Indexing
Lat ent Semant ic Analysis is a fully aut omat ic
mat hemat ical/ st at ist ical t echnique for ext ract ing
and inferring relat ions of expect ed cont ext ual
usage of words in passages of discourse. Lat ent
Semant ic Indexing (LSI) is chosen t o index t he
documents, for it uses the higher order associations
bet ween t he words in a passage or document . This
informat ion is furt her represent ed mat hemat ically
for easier manipulat ion. Tamil language follows
free word order, and Lat ent Semant ic Indexing
does not consider word order t o ret rieve t he
semant ic space of t he document s. H ence, t he
choice of LSI for indexing Tamil document s.
In LSI, t he raw t ext is represent ed as a mat rix in
which each row st ands for a unique word and each
column st ands for a t ext passage or ot her cont ext .
T h e Mat r i x p l ays a vi t al r ol e i n LSI . I t i s
considered t o be t he semant ic space. Each cell
cont ains t he frequency wit h which t he word of
it s row appears in t he passage. Singular value
decomposit ion is applied t o t he mat rix.
The following applicat ions have been done using
t he LSI t echniques:
Sentence Comparison : The similarit y of mult iple
sent ences in Tamil is assessed. A similarit y score
for each submit t ed sent ence , will be comput ed
and given t o t he user.
Essay Assessor : This applicat ion compares t wo
essays. The essays are accept ed as input , separat ed,
and conver t ed as vect ors. Then t he similarit y
bet ween t hem is comput ed, and given as a score
t o t he user.
Document Retrieval: Relevant document s are
ret rieved using Lat ent Semant ic indexing. In t his
applicat ion, t he query words are t he input , t he
relevant document is ret rieved based on t he query.
The result is ranked and present ed t o t he user.
Status : A demo version is available.
8.3 Knowledge Representation System Based On
Nyaya Shastra
World knowledge represent at ion for t he purpose
of language underst anding is a challenging issue
becau se of t h e amou n t of kn owled ge t o be
considered and t he ease and efficiency of ret rieval
of appropriat e knowledge. An import ant issue in
t he design of an ont ology dealing wit h world
kn owled ge i s t h e p h i losop h y on wh i ch t h e
classificat ion scheme is based. In t his work, an
ont ology based on Nyaya shast ra, an Indian Logic
Syst em h as been design ed an d a kn owledge
r epr esen t at ion syst em called KRI L has been
implement ed. The met hodology is based on t he
n ot ion of associat in g qualit ies an d values t o
concept s and adding different kinds of negat ion,
wh i ch b r i n gs a n ew p er sp ect i ve t o t h e
in t er pr et at ion of kn owledge. T he kn owledge
hierarchy and t he t echniques t o int erpret t he
knowledge have been adapt ed from Nyaya shast ra.
Associat ion of qualit ies and values t o concept s and
addit ion of new relat ionships based on t hese
associat ions and new relat ions bet ween concept s
t hemselves adds a new dimension t o t he reasoning
To adapt t he classificat ion scheme and reasoning
met hodology of Nyaya shast ra, descript ion logic
h as been ext en d ed by ad d i n g op er at or s for
concept definit ion wit h qualit ies and for new
t ypes of negat ion. H ence, t he building blocks of
t his model are t he concept , inherent associat ion
of qualit ies, values, relat ionship bet ween concept s
and relat ionships bet ween concept and qualit ies.
The represent at ion of knowledge at various levels
includes t ime variant and invariant component s.
The special feat ure of t his model is handling of
prior and post erior absence of concept s t hrough
addit ional negat ion operat ors. This model can be
used as a base for various applicat ions like nat ural
l an gu age p r ocessi n g, n at u r al l an gu age
underst anding, machine t ranslat ion et c.
Screen shot of KRIL-system
Status : An implementation of this model to represent
the knowledge about the dairy domain has been
9. Publications
1. G . V. Uma, an d T. V. G eet h a ( 2 0 0 1 ) ,
“Generat ion of nat ural language t ext using
perspect ive descript ors in frames”, I ET E
Jou r n al of Resear ch - sp eci al i ssu e on
Knowledge and Dat a Engineering, January-
April 2001, Vol.47 No.1&2 pp43-56.
2. G. V. Uma, an d T. V. Geet h a, “Soft p lan a
p l an n er for au t omat i c soft war e
document at ion”, Nat ional Conference on
D ocu men t an al ysi s an d r ecogn i t i on ,
Mandya, 2001, pp 266-271.
3. G.V.Uma, T.V.Geet ha, “Aut omat ic soft ware
document at ion using frames and causal link
represent at ion” VIVEK journal – A quart erly
in Art ificial Int elligence, 2001, vol.14, No.3
pp 3-13.
4. D . Man j u l a an d T. V. G eet h a, “Message
Opt imizat ion by Polling for Text Mining”,
Nat ional Conference on Document analysis
and recognit ion, Mandya, 13-14 July 2002,
pp. 199-202.
5. D . Man j u l a, P. Mal l i ga an d T. V. Geet h a,
“Seman t i c b ased Text Mi n i n g”, Fi r st
Int ernat ional Conference on Global Word
net , Mysore, 21-25 January 2002, pp. 266-
6. D.Manjula, and T.V.Geet ha, “ Dist ribut ed
semant ic based t ext mining”, To appear in
t he proceedings of t he Third Int ernat ional
Conference on Dat a mining met hods and
dat abases for engineering, finance and ot her
fields, Bologona, It aly, 25-27 Sep 2002.
7. G . Agh i l a, Ran j an i Par t h asar at h i an d
T.V.Geet ha, “Design of concept ual ont ology
based on Nyaya t heor y and Tolkappiam”,
T hird Int ernat ional conference on Sout h
Asian Languages (ICOSAL – 3), Jan 4 – 6
Universit y of Hyderabad, Hyderabad, 2000,
8. G . Agh i l a, Ran j an i Par t h asar at h i an d
T.V.Geet ha, “Indian Logic Based Concept ual
O n t ol ogy Usi n g D escr i p t i on Logi cs”,
Nat ional Conference on Dat a Analysis and
Recognit ion, Mandya, 2001, pp.279-286.
9. Devi Poonguzhali, P.Kavit ha Noel, N.Preeda
Laksh mi , T. V. Geet h a, A. Man avazh an ,
“ Tami l Wor d n et ”, Fi r st I n t er n at i on al
Conference on Global Wordnet , 21-25 Jan
2002, pp 65-71.
10. G. V. Uma, and T.V.Geet ha, “A knowledge
based approach Towards aut omat ic soft ware
document at ion”, Spring 2002, Int ernat ional
Journal of Trends in Soft ware Engineering
Process Management , Malaysia.
11. G . V. Uma, N . S. M. Kan i mozh i , an d
T. V. G eet h a, “Au t omat i c soft war e
document at ion in Tamil”, present ed in t he
Int ernat ional conference on sout h Asian
l an gu ages I C O SAL- 3 ”, Un i ver si t y of
Hyderabad, 1999.
12. Madh an Kar ky. V, Su dar sh an an . S,
T h ayagar ajan . R. , T. V. Geet h a, Ran jan i
Parthasarathy, and Manoj Annadurai, “Tamil
Voice Engine”, Present ed at Tamil Inayam,
Malaysia, 2001.
13. P. An an dan , T. V. Geet h a, an d Ran jan i
Parathasarathy, “Morphological Generator for
Tamil” Presented at Tamil Inayam, Malaysia,
14. G.V.Uma, and T.V.Geetha, “A knowledge based
appr oach Towar ds au t omat i c soft war e
document at ion”, Spring 2002, Int ernat ional
Journal of Trends in Soft ware Engineering
Process Management, Malaysia.
15. G.V.Uma, N.S.M.Kanimozhi, and T.V.Geetha,
“Automatic software documentation in Tamil”,
presented in the 3
International conference
on sou t h Asi an lan gu ages I CO SAL- 3”,
University of Hyderabad, 1999.
16. D. Manjula and T.V. Geetha, “ Semantic based
text mining”, of the International conference
on Global WordNet, Pg 266-270, Mysore, Jan
21-25, 2002.
17. D. Manjula, A. Kannan and T. V. Geet ha
“Semantic Information Extraction and Query
processing from world wide web”, Bombay,
Dec18-21, KBCS 2002.
18. D . Man ju la an d T. V. Geet h a, “Message
Optimization using polling for distributed text
mining,” National Conference on Analysis and
Recognition, Mandya, 2001.
19. P. Malliga, D.Manjula and T.V. Geet ha, “
Boost ext er for Tami l Docu men t
Categorization”, International Conference on
South Asian Languages .
20. Siva Gurusamy, D. Manjula and T.V. Geetha,
“Text Mi n i n g i n r equ est for commen t s
Documen t Ser ies”, LEC 2002 Lan guage
En gi n eer i n g Con fer en ce, D ec 13- 15,
Hyderabad, 2002.
21. T. D h an abalan , T. V. Geet h a, “ UN L –
En Con ver t er for Tami l”, I n t er n at i on al
Con fer en ce on Sou t h Asi an Lan gu ages
(ICOSAL) – 4, Dec 3-5, Annamalai University,
TamilNadu, 2002.
22. T. Dhan abalan , K. Saravan an , T. V. Geet ha,
“Tamil to UNL EnConverter” International
Conference on Universal Knowledge and Lan-
guage (ICUKL) – Nov 25-29
, Goa, India.
23. G. S. Mahalakshmi, G. Aghila, T. V. Geet ha,
“Multi-level Ontology representation based on
Indian Logic system” International Conference
on South Asian Languages (ICOSAL) – 4, Dec
3-5, Annamalai University, TamilNadu, 2002.
24. P.Devi Poongulhali, N. Kavitha Noel, R. Preeda
Lakshmi, Manavazhahan and T.V. Geet ha,
“Tamil Text Summarizat ion” Int ernat ional
Conference on Knowledge Based Computer
Syst ems, Mumbai, India, December 18-21,
10. The Team Members
Dr. T. V. Geetha
Dr. Ranjani Parthasarathi
Dr. K. M. Mehata
Dr. Arul Sironmoney
Ms. V. Uma Maheswari
Ms. N. Anuradha
Ms. S. Chithrapoovizhi
Ms. J. Deepa devi
Mr. T. Dhanabalan
Mr. T. Dhinakaran
Ms. T. Kalaiyarasi
Mr. A. Manavazhahan
Mr. G. PalaniRajan
Mr. R. Purusothaman
Mr. K. Saravanan
Ms. M. Vinu krithiga
Mr. V. Venkateswaran
A. Kavitha
M. Kavitha
Courtesy: Dr. T.V.Geetha/Ms. Ranjani Parthasarthi
Anna University, School of Computer Science &
Engineering, Chennai - 600 025
(RCILTS for Tamil)
Tel: 00-91-44-22351723,22350397
Extn- 3342/3347, 24422620, 2423557
Centre for Development of Advanced Computing
Pune University Campus, Ganesh Khind, Pune-411007
Tel. : 00-91-20-5694000-2 E-mail :
Website :
Resource Centre For
Indian Language Technology Solutions – Urdu, Sindhi, Kashmiri
C-DAC, Pune
Achievements Achievements Achievements Achievements Achievements
RCILTS-Urdu, Sindhi & Kashmiri
Centre for Development of Advanced
Computing, Gist, Pune
“Digital unite & Knowledge for all the vision of TDIL”
MC&IT wishes to make a difference to the quality
of life of people of India through its TDIL mission,
and is investing significant amount of resources
towards fulfilling this wish. One such effort was the
establishment of “Resource centers” for design,
development & deployment of tools & technologies
for Indian languages.
“Dissolving the language barrier the vision of
C-DAC has pioneered the Graphics and Intelligence
based Script Technology (GIST) which facilitates
the use of Indian languages in IT. GIST is geared
for the Internet enabled world where all activities
are gradually going online. In its endeavor to stay
abreast with technologies worldwide, GIST has been
adopting the latest concepts to be able to stay tuned
with the changing IT scenario. C-DAC, GIST
through its continuous R&D efforts has to its credit
many innovative products & continues to be leaders
in the language technology field.
MC&IT, through its TDIL Programme has
awarded C-DAC, GIST with the project “Resource
centre” for development of tools & technologies for
Perso-Aarbic languages.
Overall Goal of the Resource Centre
To empower people of India t hrough t he use of
In for mat ion Techn ology solut ion s in In dian
Project Purpose
To improve t he qualit y of life of people of India
by enabling t hem t o use informat ion t echnology
in Indian languages.
To d evel op n ew p r od u ct s an d ser vi ces for
processing informat ion in Indian languages.
To conduct research in comput er processing of
Indian languages.
Broad areas of work under the Perso-Arabic
Resource center
GIST workshop for usage of GIST t ools for
organizat ions concerned wit h development &
deployment of Indian language processing systems.
Creation of a network of persons and organizations
working on IT applications in Perso-Arabic( PA)
languages namely Urdu, Sindhi & Kashmiri.
Development of Language tools & applications in
PA scripts.
Con d u ct i on of wor ksh op t o d i scu ss on
st an d ar d i zat i on of PA scr i pt s d at a & fon t
Deployment of the developed technology, products,
& services through a country-wide network.
Services and Knowledge Bases
• Creation of a web site for language tools, solutions
and knowledge available in PA language.
• Imparting training for the usage of PA.
• language applications on IT.
• Establishments of Translation & Subtitling
services in PA languages.
• High quality PA fonts.
• Dictionary, Spell Checker tools in PA languages.
• Word processor package on PA languages.
• Transliteration tool between Hindi & Urdu.
• Prototype of Pocket translator with Urdu support.
• Product for Subt it ling, Charact er Generat or,
Teleprompter, DVD in PA scripts.
Activities Undertaken Under The Project
GIST workshop was conducted at C-DAC Pune
from 17th to 22nd July 2000, for usage of GIST
series of Indian Language Technologies, tools and
product s for Resource Cent ers concerned wit h
development and deployment of Indian Language
solutions. A team of 23 members from 8 Resource
centres have attended.
1. ER & DC, Trivendrum
2. Jawaharlal Nehru University, New Delhi
3. Indian Institute Of Technology, Kanpur
4. O r r i sa Compu t er Appli cat i on Cen t er,
5. Thapar Inst it ut e Of Engg. And Technology,
6. Anna University, Chennai
7. M. S. University, Baroda
8. Indian Statistical Institute, Calcutta
Covering over five days and total of 12 interactive
sessions, the subjects covered were
1. Building dictionary & Designing spell checker
2. Keyboard driver
3. GIST software development kit
4. Multilingual web tools
5. Script code, Font storage & Design.
During the workshop training kits were distributed
which contained work manuals on following topics;
1. Designing Indian Language Spell Checkers,
2. Myths about Standards for Indian Languages,
3. Enabling Indian Languages on Web,
4. ISCII- Foundations & Future,
5. GIST Software Development Kit,
6. Using GIST SDK in VB,
7. Int ernet First St eps: Upgrading an Exist ing
ActiveX Control.
8. Inscript Keyboard layout,
9. Typing in Phonetic Keyboard.
Two CDs on GIST s/ w and brochures on various
language products & fonts type face catalogue were
1. GIST based S/ W Developer CD.
2. GIST Products & Channel Information.
A visit was organized to various departments of C-
DAC, GIST Laborat or y, Nat ional Mult imedia
Resource Cent re, and Nat ional PARAM Super
Computing facility.
Evolving Storage, Font & Inputting Standards
T he main t ask before C-D AC was t o evolve
st andards for st orage, font s & input t ing for t he
Perso-Arabic languages. Development of USCII
(Urdu standard for Information Interchange) was
carried out well before awarding of the project. GIST
Terminals designed & developed by C-DAC are
based on these standards. The support for the GIST
t er mi n als was li mi t ed t o on ly t h e Naskh
implementation of the fonts.
Accordingly with lots of efforts and valuable inputs
from exper t s all over, C-DAC draft ed PASCII
standards for storage, keyboard standards & font
standards. These proposed standards are published
in t he T D I L magazin e – Vishwa Bhar at for
comments / suggestions from experts all over India.
1. The PASCII standard - The Perso-Arabic
Standard for Information Interchange
There are some peculiarities that can be seen in
Perso-Arabic scripts:
These script s join let t ers wit h each ot her, and
therefore letters have different forms as per their
position in a ligature (the positions are beginning,
middle, ending).
Some shapes do not have middle shape i.e. they do
not join at both the ends. For example alif, vao,
daal, etc. Every letter also has a standalone form
1.1 Characteristics of Proposed Standard
It s an 8-bit st andard. Support s let t ers for Urdu,
Arabic, Sindhi, Kashmiri.
Defines PA alphabets in the upper ASCII leaving
lower ASCII free for English. Bilingual support
Defines numerals other than ASCII numbers (48
t o 57).(This may help support ing bot h Arabic
Numerals 0-9 and language specific numerals).
Maintains the order of alphabets for perso-arabic
languages. Khate-kasheed is given lower value than
alphabets. This will sort the words correctly even
though they have khate-kasheed in between.
Alphabets for different languages are placed in their
ascending order. Letters like “bhey” are not provided
for URDU but kept for languages like SINDHI.
The URDU may make use of “be” and “choTi-hai”
for that.
Minimal erabs are provided. Tanveen ,for example
do-zabar, can be formed wit h t he help of t wo
consecutive zabar.
Place for superscripts like khaRa-alif is provided.
Place for superscripts for ARABIC is provided.
Place for superscript s like “re-ze”, “ain”, et c. is
Numerals are placed after erabs and super scripts.
(This is provided only t o support display for
language specific numerals and standard numerals
i.e. the ASCII numerals are available).
Only a few control characters are defined.
1.2 Standardization of Urdu Keyboard
There is no standard keyboard available for URDU
language. There are vendor specific keyboards
available which all differ in their layouts. There are
lot of keyboards designed by different companies
wh o pr ovi d e Per so- Ar abi c su ppor t i n t h ei r
applications. Unfortunately, every company has its
own Keyboard layout that entirely differs from the
other’s Keyboard Layouts. For example Microsoft
has it s own Keyboard layout s for Perso-Arabic
Drawbacks of Available Keyboards (in view of
URDU language)
No standard keyboard available.
All vendor specific keyboards differ in their layout.
Most of t he keyboards have been designed for
Although Arabic and Urdu have similar script, they
are both different languages, and hence a Keyboard
designed for Arabic may not be that useful for Urdu
language. CDAC has taken this care of languages to
design t he keyboard for URDU, SINDH I &
KASHMIRI and tried to come up with an optimum
(Details of these standards can be found in TDIL
Vishwabharat magazine.)
1.3 Standardisation of Fonts
Characterstics of Perso-Arabic languages:
Urdu has traditionally been written in the Nastaleeq
script. Although the script employs the basic letters
of the language, the rendering of these letters in a
word is ext remely complex. The reason for t his
complexity is that Urdu text has traditionally been
composed through calligraphy, a medium whose
precept s are based on t he aest het ic sense of t he
calligrapher rather than on any formula. So great is
the variation in calligraphy that many times it is
difficult to recognize the letters in a constituent
word. This is because, in their calligraphed form,
the individual letters partially or completely fuse into
each other thereby losing their identity. A degree of
fusion is purposely introduced to make the resulting
fused glyph visually appealing.
Another characteristic of the Urdu is the existence
of diacritics. Diacritics, although sparingly used, help
in the proper pronunciation of the constituent word.
The diacritics appear above or below a character to
define a vowel or emphasize a particular sound. They
are essent ial for removal of ambiguit ies, nat ural
language processing and speech synthesis.
A Nastaliq style font – designed, to be used for
printing and display. The font is based on traditional
Nastaliq style written in India
Arabic Font – A high quality Arabic Font is designed
to compliment religious text in Urdu.
Considerations for Digital Font Design
Following is a group-wise list of characters that
were taken into consideration while designing
fonts for the Perso-Arabic languages.
1. Alphabet.
2. Numerals.
3. Special Characters.
4. Diacritics.
5. Religious and linguistic Symbols.
6. Control characters.
The 16-bit Naskh & Nastaliq Font
The Fonts developed by CDAC are 16-bit and are
also defined in the User Area of the Unicode range.
The ASCII range is not used and can be used for
different purposes (can be used to English support
for example).
• all the basic shapes.
• all the starting shapes and variations.
• all the middle shapes and variations.
• all the ending shapes and variations.
• levels for erabs (short vowels).
• complete ligatures.
• Beginning ligatures.
• Middle ligatures.
• Ending ligatures.
• missing character glyph.
Glyph Standards decided for FONT
Urdu & Kashmiri- Naskh & Nastaliq script
Sindhi - Naskh script.
• Basic letters (Vowels, Consonants).
• Full shape letters.
• Beginning.
• Medial.
• Final.
• Ligatures.
• Graphic components.
• Numerals, Signs and Symbols, etc.
2. Script Core Technology & Rule Engine
Because of the complex nature of the Nastaliq script
which is written right to left and top to bottom
diagonally, it was required to make certain databases,
and use them along with the rule engine, to make
display possible. Following are some tools which were
developed to create these databases.
2.1 Glyph Property Editor
In Urdu and all other Perso Arabic languages each
character has many different shapes depending
upon the position of that character, 1) the stand
alone 2) the starting position 3) the middle position
4) t he ending posit ion et c. So according t o t he
position of the character and the letters with it its
joining that particular shape is to be displayed. This
is an application that allows to store the data and
the compositional information for various shapes
in a font file for proper alignment of characters for
display. ( Urdu requires a larger number of glyphs
than other scripts) This takes care of the horizontal
and vertical movement for Nashtaliq script.
2.2 Rule Engine
A module that containing rules for each shape. The
rules indicate how and which shape will join with
other shape or shapes based on the succeeding and
precedin g shapes t hat occur in Nast aliq t ext
composition. This facilitates more accurate and faster
display of the shape composition in the Nashtaliq
text composition.
2.3 Testing version of Urdu Editor
A simple text editor is ready to test the Rule Engine
and check the various rules formed for displaying
the glyphs. This editor allows to write the text in left
to right, and handles all keyboard mappings from
ASCII to USCII, and Urdu text is displayed.
2.4 Font Glyph Editor
In Nastaleeq script (for Urdu), the shapes join from
right to left, and within a ligature they also shift
vertically. This forms a ligature. So in a ligature,
shapes join right to left and top to bottom. The
vertical positioning information cannot be put in
the font itself. Hence it was decided to make a utility,
which will let one set this information in separate file.
Example of Nastaleeq joints: (“sehan” => suad baRi-
he noon):
To select a glyph from the given font and character,
and define on it following information:
1. Starting point (SP): the point where a glyph ends.
This is generally on baseline or zero.
2. Ending point (EP): the point where a glyph joins
with another glyph on its starting point (SP of
another glyph). This point is less than or equal to
the glyph width.
3. Dot points (UP and DP): dot points where a dot
or diacritic mark will be positioned on the glyph.
2.5 The Font Editor Utility
T he Font Edit or ut ilit y let s us place t he dot
information for each glyph. The file thus generated
is used for retrieving dot information at run time.
We’ll call it a dot file.
The following figure depicts a glyph with dot and
caret information. As shown in the figure, the utility
helps one define following for a glyph:
1. Font Name.
2. Unicode value of the glyph.
3. Glyph Index in dot file (this is the unique id of
the glyph).
4. Start and End points.
5. Dot positions.
6. Caret position.
Fonts : following are samples of fonts developed
(Ghalib Bold)
3. Devanagari/Gurmukhi to Urdu Transliteration
A computer based transliteration allows a user to
convert a given text (say, in storage format - ISCII)
of language ‘A’ to text in another language ‘B’ (in
some storage format, say ASCII) on the basis of the
phonetic rules that govern languages A and B.
A soft war e ( au t omat i c) t ool p r ovi d i n g
t ranslit erat ion from one language t o anot her is
an effect ive t ool which reduces t he amount of
effort s required t o creat e large dat abases of names
in mult iple languages.
Once a dat abase in one language is creat ed, t he
t ool can be used t o generat e t he same dat a for
ot her lan guages. T his result s in con siderable
savings in t ime and money.
It is a useful t ool for applicat ions which require
t o convert dat abase of names, addresses, phone
numbers or elect oral rolls from one language t o
anot her.
T his can also be useful for convert ing ent ire
dat abases from one language t o anot her wit hout
re-ent ering t he dat a. This ut ilit y can convert dat a
in English from a dat abase direct ly t o ISCII.
This will facilitate automated transliteration across
dat abases.
A soft war e t ool for t r an slit er at in g t ext fr om
Devanagari & Gurumukhi (ISCII) t o Urdu has
been developed. This t ool is also available as a
separat e module, and also been implement ed in
t h e Text Ed i t or Test Ap p li cat i on , wh i ch i s
3. 1 Fol l owi ng i s a sampl e of Hi ndi text
Transliterated into Urdu (UTRANS)
Using this automated tool – UTRANS, we have
transliterated the book “Meri Ekyavan Kavitayen”
written by our Hon. Prime Minister Shri. Atal
Bihari Vajpayee.
3.2 Following is a sample of Punjabi text
Transliterated into Urdu (UTRANS)
4. Dictionary Development Tool
Nat ural language processing t ools such as lookup
dict ionaries, t hesaurus, spellchecker, grammar
checkers, part of speech t aggers, machine aided
t ranslat ions require language specific dict ionaries.
Several ot her domains such as speech processing,
exp er t syst ems r eq u i r e b acken d l an gu age
dictionaries to be able to process and generate valid
ou t pu t s. Each of t h ese appli cat i on s r equ i r e
dict ionaries t o be creat ed t o cat er for t he specific
requirement s.
The Generic Dict ionary Development Tool will
allow user s t o cr eat e dict ion ar ies specific t o
applicat ion domains. As part of t his development
t h e for mat of a gen er i c d i ct i on ar y sh ou l d
st andardized. The t ool will provide facilit ies for
obt aining views of dict ionaries t hat are subset s of
t he generic dict ionary.
This tool, though lot of refinement is required, is
in the form of a standalone desktop utility which
allows t he user t o creat e dict ionaries cont aining
words for Perso-Arabic languages. The user is
provided with an interface to define inflection and
root word rules, add suffix, prefix, grammar tags,
domain for a given word entry.
In order to build spellchecker & domain specific
dictionaries following dictionary creation work was
under taken and completed
1. Official Hindi dictionary - Hindi 50,000 words,
equivalent 50,000
2. Urdu to English : urdu words 17000, grammer
tag 17000
3. Hindi Urdu Shabdh Kosh Hindi words 10,000,
Equivalent 10,000
4. Dictionary Urdu words : 10,000 grammer tag
5. Dict ionary of Urdu & Hindi names : 3200,
Equivalent : 3200
6. Technical dictionary Hindi words : 1500, equiva
lent : 1500
7. Muslim n ames Dict ion ar y : H in di 1500,
equivalent : 1500
9. Hindi - English - Urdu Hindi words : 8000,
equivalent : 8000
Following is the typical GUI of the dictionary
development tool for URDU.
5. Nashir - Wordprocessor– Publishing Made Easy
The Nashir is designed to be easy to use and create
documents in Perso-Arabic languages, and at the
same t ime powerful enough t o layout complet e
n ewspaper s an d magazin es in Urdu, Sin dhi,
Kashmiri, Arabic and Farsi.
Each document of the Nashir consists of a number
of pages. On each page of the document, you can
place items like text blocks, graphics, etc.
The Nashir is kind of an Word processor & also
best suit ed for publishing segment . Spellchecker
support is added.
Apart from t he base dict ionary for spellchecker,
various domain specific dict ionaries addit ion is
Salient Features of Nashir
1. Support Nastaliq True Type fonts. (presently 2
2. Supports Naskh fonts. (presently 12 fonts).
3. Fonts for Sindhi, and Kashmiri added.
4. Supports C-DAC & Phonetic Keyboards.
5. User defined keyboard support available.
6. Drawing objects provided.
7. OLE Automation supported.
Other Features
Bilingual Support
Support s Urdu, Sindhi & Kashmiri along wit h
Tr an slit er at ion en gin e (uT RAN S) has been
implemented in the Nashir. One can insert an “.aci”
(ISCII file) file into Nashir and see the transliterated
version in Urdu script (Naskh or Nastaliq). Rule
based transliteration developed for Hindi & Punjabi.
Save As HTML
The user can save the document as HTML page,
and thus Naskh as well Nastaliq scripts can be viewed
on the Internet.
A Table con t r ol is pr ovided t o feed t abular
informat ion in t he document .
Phonet ic keyboard support ed. A Float ing key
board will be support ed soon.
Tool tip
Tool t ips are provided for t he user
Master Page
A mast er page is provided wit h each document t o
provide t he user wit h t he header, foot er and
similar set t ings. What ever is designed on Mast er
page appears on t he pages.
Kerning feat ure is provided t o manually adjust a
t ext . Bot h H orizont al and Vert ical kerning is
Variable Font Sizes & Colors
Use variable font sizes and color.
The Nashir has a number of ot her feat ures like
H or i zon t al an d Ver t i cal r u ler s i n t h e GUI ,
dynamic font set t ings for t he Urdu and English
fon t s. I n den t s an d par agr aph set t in gs, page
set t ings, et c.
Nashir Screen Shots
Following screen shot shows "Text wrapping around
object" feature of Nashir :
Spellchecker Support in Nashir
The full version of Nashir is now equipped with the
spell checker facility.
The Spellchecker program works as a word-level
error detection and correction (single or multiple
correction) tool. Following consideration were given
while designing the spellchecker module
• Percentage of invalid words pass through it
• Average suggestion rate
• Intended suggestion is given
• Domain the spellchecker is serving
• Dictionary structure
Error Detection/Correction Approaches
1. Non-word error detection
2. Isolated word error detection
3. Context based word correction
Non-Word Error Detection
Typographic errors (typing mistake)
Cognitive errors (invalid entry by user)
Phonetic error
Real word error
1. Detection: Block invalid words to pass through it
2. Correction: Suggest alternate valid words that can
pass through it
Error Detection Techniques
Lookup Dictionaries
• Structure : Hash, Tries, etc.
• Dictionary size
• Inflection, euphony, assimilation: (Morphological
• Domain
Bi-gram / Tri-gram approach (based on conditional
probability approach)
A t wo dimensional mat rix having condit ional
probability state of letters occurring next to each
other. Bi-grams are searched and if a bi-gram is not
in t he dict ionar y, t hen t he words are t ermed
6. The Urdu SDK Controls
The GIST Urdu SDK is a software development
tool for Urdu Language on MS Windows.
This is a set of software components that enables
software developers to facilitate the use of Urdu
scripts with MS Windows applications. Unlike other
solutions, the significant feature of GIST Urdu SDK
is that it enables the MS Windows application to
process PASCII data directly.
The GIST SDK uses Act iveX Technology from
Microsoft and provides a seamless, transparent and
self contained Indian language layer for data entry,
storage, retrieval and printing in Indian scripts for
your MS Windows applications.
These controls are bilingual and also support English
language. The package contains editing control that
support the complex scripts like Nastaliq. Following
is a listing of controls that are part of the Urdu SDK
controls package version 1.0 which are developed
• UEdit
• UListbox
• UCombobox
• UStatic
• UButton
• UCheckbox
• URadiobutton
With the above set of controls, one can create all
sorts of applications that support Urdu language.
The UrduEdit OCX is multi-line text editor control
that supports Perso-Arabic Languages. This version
(1.0) of the control supports URDU as the primary
language. The control supports Naskh as the default
script, but can also be used to depict Nastaliq script.
The control has a number of features to provide
with most of the functionalities needed for a Text
editing control.
1. Read & Write in Naskh / Nastaliq script
2. Bilingual text support i.e. English and Naskh
3. Copy paste text glyphs into external applications
4. Use for database applications
5. Transliterate (import) Hindi (ISCII) documents
into Urdu
6. Use as simple Editor with fixed font and size
7. Use Rich Edit text facilities (e.g. multiple colors,
font sizes, multiple fonts)
8. A number of Naskh fonts provided along with
the control
Sample Urdu Active-X control in a form
Applications using GIST URDU SDK
Following are some examples of set of applications
where the UrduEdit OCX can be useful:
The UrduEdit Control can be connected to any type
of database, and store text as data. You can store (or
load) text from a database from (or into) the control.
Typical applications that can be created are mail
merge, report generators, and all types of business
applicat ions where document s are creat ed from
information stored in a database.
Desktop Applications
Wit h UrduEdit Cont rol you can easily creat e
applications that need text fields for Perso-Arabic
languages. The control can be used as single line or
multiline control. You can also use Rich Edit feature
of the control to display text in different fonts, colors,
and sizes.
Web Applications
Create user controls with Visual Basic, use UrduEdit
Control in web pages.
Sample Database application form using Urdu
Active-X Control
7. Pocket Translator
Mult ilingual Pocket Translat or was developed
keeping int o mind t he foreigners t raveling t o
India. Even it is useful for persons t raveling wit hin
India (Int er St at e). Similar product s available
worldwide but not for Indian Languages.
1. English t o Urdu & vice a versa t ranslat ion.
2. Fixed messages for 5 different cat egories such
as Social, Travel, Emergency, shop & rest au
rant , bot h for Urdu & English. 500 messages
in each cat agory.
3. Messages can be browsed in Urdu & English.
4. Speech out put for Fixed messages only.
5. Single Toggle key for select ion of various cat
6. Arrow keys for scrolling through the messages/
dict ionary.
7. Script t oggle key for changing t he script .
8. 4 lines for English message display.
9. 2 lines of Urdu message display.
10. Talk key for speech output.
11. English - Urdu dictionary.
12. Displays nearest matching word in English if
the typed word is not available in Dictionary.
13. Translation of the same to Urdu by pressing
Script key.
Hardware Details
1. Developed on Motorola 68EC000 (PQFP pack
age) processor running at 10 MHz.
2. 2 MB of EPROM for program, font s, fixed
messages, dictionary & speech o/ p.
3. 16 KB of SRAM for Temporary variables.
4. 64 keys Matrix QWERTY type keyboard
5. 100 x 32 Dot matrix LCD display panel.
6. 4-bit ADPCM decoder for speech output.
Block Diagram of Pocket Translator
8. Multiprompter
An i d eal syst em for n ews r ead i n g an d
document ary product ion. Support for Naskh &
Nast aliq script s provided.
Read news or your dialogues wit hout t ears. You
look int o t he camera and get t o se t he t ext rolling
a a speed t hat suit s your reading pace. And t hat
t oo in India, Arabic or European languages. IT
can have mult iple mechanical at t achment s wit h
opt ical glass for mount ing on camera t ripods. On-
line skipping of st ories.
Salient Features
Provide support t o cont rol t he srcolling speed. a
min. t o 9 min. max and 0 for pause.
1. Support spellchecker.
2. Support Indian or European languages
3. Upgr adeable t o News Romm Aut omat ion
Syst em.
4. Works under Windows environment .
5. Available in t wo models : Deskt op & port able
based on Lapt op.
Following is t he screen shot of t he GUI (Graphical
User Int erface) of t he Teleprompt er applicat ion
r un n in g on t he win dows plat for m. Used for
creat ion of t he news and document aries which
are t hen prompt ed on a special equipment having
monit or, mirrored glass & camera.
9. LIPS Advanced Creation Station & LIPS for
LI PS is a r evolut ion ar y, mult ilin gual syst em
meant for Elect ronic as well as fused subt it ling.
It consist es of highly product ive soft ware and cost
effect ive hardware t o creat e subt it les in Indian as
well as foreign languagess along wit h t he t ime
LIPS for DVD allows you t o creat e DVDs wit h
mult ilingual subt it les in various Indian as well s
foreign languages. It convert s t he LIPS subt it ling
files int o a DVD compat ible format . This syst em
support s Diakin and Sonic format s. Generat es t he
header indicat ing t he background and foreground
colors, edge color, t he in-t ime and out -t ime as
well as t he subt it ling image file.
Saient Features
1. Suppor t s all I n dian lan guages in clusive of
URDU, Sindhi & Kashmiri.
2. Naskh & Nast aliq support provided.
3. Support TIF and BMP format s
4. Support different font s and font sizes
5. Edge color for subt it les can be specified
6. Allows posit ioning of subt it les on t he video
7. Provision for increment and decrement of in
and out t ime
8. Compat ible wit h PAL as well NTSC format s
10. Perso-Arabic Web Site Development
Under t he Perso-Arabic resource cent re one of
t he major deliverables was t o creat e t he web sit e
in PA script s & especially in t he Nat aliq script .
Following was t he met hod adopt ed for design &
d evel op men t of t h e websi t e n amed h t t p : / /
Designed a special act iveX cont rol (UEdit Web)
for web t hat could display Nast aliq.
This cont rol support s H TML like t ags t o set
various at t ribut es.
Convert ed USCII dat a int o UEdit Web format .
UEdit Web dat a was st ored in files which were put
in Dat abase.
Server side script s were writ t en t o generat e Urdu
cont ent for display.
Ten books will be added t o make t he sit e cont ent
Following screen shot s shows t he main page &
t he page cont aining informat ion relat ed t o t he
resource cent re websit e.
Courtesy : Shri. M. D. Kulkarni
Chief Investigator - Resource Centre &
the Resource Centre Team, GIST
Centre for Development of Advanced Computing
(C-DAC), Pune University Campus,
Ganeshkhind, Pune 411 007
Tel : 00-91-20-5694000/01/02, 5694092
Fax : 91-20-5694059
E-mail :
Resource Centre For
Indian Language Technology Solutions – Sanskrit, Japanese, Chinese
Jawaharlal Nehru University, New Delhi
Achievements Achievements Achievements Achievements Achievements
Jawaharlal Nehru University, New Delhi-110067
Tel. : 00-91-11-26704772
E-mail :
Website : http:/ / rciltsjnu
RCILTS-Sanskrit, Japanese, Chinese
Jawaharlal Nehru University, New Delhi
RCILTS, JNU is a resource cent er for Sanskrit
language of DIT, Government of India. At JNU
work st art ed in t hree languages viz., Sanskrit ,
Japanese, and Chinese. It took lot of time and effort
to assemble a cohesive team for the development of
language technology.
I n t he beginning lot of problems were faced
regarding development tools, software professionals,
as well as language professionals. Due to the socio-
economic nature of the IT industry the attrition of
the IT professionals was very high, so work suffered
a lot. Till today it is becoming very difficult to get
good linguists (computational linguistics) who could
carry forward the task of future development.
In the first stage it was decided that we should not
be reinventing the wheel. We shall try to develop
language technology and resources that is not being
addressed at other RCILTS. India has rich cultural
heritage and time proven scientific knowledge. All
this is largely in the form of Sanskrit literature. So it
was decided that we would develop a web based
Sanskrit Language Learning System (as a second
language primarily). It would be of great use to those
scholars who look to our heritage knowledge for
designing and developing Knowledge based systems.
Keeping the above goal in mind at RCILTS, JNU,
we have designed various modules that teach Sanskrit
language in an Asynchronous fashion. In the process
of development of these language resources we have
kept in mind the various standardization aspect of
creat ing t hese resources. Here we would like t o
mention specially the Unicode. All the web content
that is available on RCILTS JNU site is Unicode
In the early phases of development we used lot of
tools developed at C-DAC. So lot of content was in
ISFOC/ ISCII based. Earlier all the processing was
done on ISCII data. There was no tool where we
could create content that were Unicode compliant.
At a very later stage of development when Unicode
tools/ techniques were developed and few of them
bought, then we could convert all the content in
Un icode. Dur in g t his con ver sion pr ocess we
developed Web Content Unicode Converter. This
converter works with ISFOC/ ISCII data.
The RCILTS, JNU website is available at http:/ / rciltsjnu
The various modules that have been developed at
RCILTS JNU, is presently integrated for learning
the language. With the development of new modules
this language learning system shall be used for various
kinds of language processing (text and speech). The
various software modules and language resources that
are available are
Learning materials
• Sanskrit Lessons
• Sanskrit Exercises
• Sanskrit Script Learning System
• Sanskrit – English Lexicon
• Sanskrit – English Lexicon of Nyaya terms
• English – Sanskrit Lexicon
Sanskrit Grammar
• Panini’s Asthadhyayi
• Laghu Sidhyant Kaumudi
• Dhatu Roop to Dhatu
• Dhatu to Dhatu Roop
• Pratipadik Roop
• Shabd Roop
• Upsarg
• Sandhi
• HTML Content Converter
• ISCII to UNICODE Database Converter and vice versa
• Devnagari Unicode Keyboard Driver for Browsers
Sanskrit Grammar is primarily based on Panini’s
Ashtyadhyayi and SidhyantKaumudi. We have already
developed modules for Dhatu Roop, Sabd Roop, and
Sandhi. After learning all these modules a learner
knows the basics of Sanskrit language.
We plan to develop Sandhi Vichhed System. With
the development of Sandhi Vichhed system as tool it
shall be very easy for t he scholars t o read and
understand the age old Sanskrit manuscripts which
are treasures of time proven scientific knowledge.
With the proper understanding of these manuscripts
it would be easy for them to interpret the scientific
knowledge in their original context and apply them
in t he present cont ext for t he development of
knowledge based software systems. Knowledge based
Systems would aid in the process of Administration,
Business, Management, and Education.
Since JNU has a rich language school, in t he
beginning work was also carried out in the area of
Japanese and Chinese. But after some timework on
Chinese language discontinued due to paucity of
Chinese language professionals. But work in the
Japanese language is still on. We have developed lot
of language learning resources for Japanese too.
1. Sanskrit Lessons and Exercises
San skrit Lesson s are useful t o learn San skrit
Language. There are 21 lessons and the number will
increase. Lessons are designed so that, a person who
doesn’t know Sanskrit at all but knows either English
or Hindi can go through these lessons and learn.
Sanskrit contents of the Lessons are described both
in English and Hindi.
The basics of the Sanskrit Language are covered
elaborat ely. And t hen t he grammar part of t he
Sanskrit language is explained. Nearly all topics of
grammar are covered wit h lot of examples and
explan at i on . Speci al car e i s t aken for t h e
pronunciat ion of t he Sanskrit alphabet s and t he
words. The pronunciat ion is based on st andard
phonetic notations. One can go to the ‘Sanskrit
script learning module’ from the Lesson to learn
the writing process and pronunciation of alphabets.
Some ext r a facilit ies ar e given t o lear n an d
remember easily. All t he Sanskrit words, used
t hrough out t he lessons are available t o t he user
wit h t heir root s and meanings. These words are
also connect ed t o t he Lexicon and accessible from
any page of any lesson. User can also see t he new
words on a part icular page t hat he is reading wit h
t he facilit ies explained above. The Navigat ion
But t ons bellow t he each page is self-explanat ory.
The exercises are based on t he lessons. For each
Lesson more than two exercises are available. These
are object ive t ype quest ions. Four opt ions are
given for each quest ion. As t he opt ion is chosen,
result is given prompt ly t o t he user t hat it is
correct or not . The Correct answers of a part icular
quest ion will be given in t he t able bellow aft er
at t empt ing t he quest ions. So user can see t he
answer that he has given as well as the actual answer
of t hat quest ion just choosing an alt ernat ive.
Exercises are obviously helpful t o pract ice what a
user learns on t hat Lesson.
2. Devanagari Script learning and Character
Recognition System
Scripts are the basics of a Language. And writing of
these scripts are very important to learn a Language.
Using t his module t he user learns Devnagari
alphabets and learns how to write and pronunce
them. Writing strokes for a character are shown with
stroke orders with four different speeds. It takes care
of all characters, conjuncts, matras and numerals
also. And t here is audio support t o list en t he
pronunciation of the alphabet. Recognize a character
after hearing its pronunciation is also over. It is also
hyper-linked with Sanskrit Lessons. User can access
this module from Sanskrit Lesson.
Example- The Sanskrit Vowel “Riri” is written in
four stroke with four different speed. Except Basic
alphabets there are numbers of conjuncts characters
in Sanskrit Language. Their format ion and t he
writing process in the proper stroke orders are given.
Phonetic notations are given in the left-top corner
so that a user can pronounce the alphabet easily.
Numerals are also import ant t o learn as t heir
frequent use in the language. Writing process of
Sanskrit numeric numbers are shown wit h t heir
stroke order. If there is more than one style of writing
of numeric numbers t hen t he alt ernat ives are
provided bellow.
Pronunciation of alphabet is important to speak a
language. The attention is given for the right and
clear pronunciat ion of t he alphabet s and audio
supports are provided. There are some exercises also
for the users in recognizing alphabets according to
the pronunciation.
3. Sanskrit-English Lexicon
In t his module a user can get t he meaning, it s
grammat ical cat egory and usages of Sanskrit word.
It will also display t he phonet ic not at ion of t he
word. Grammat ical cat egory and usage can be
displayed if available.
The E-R Diagram of Sanskrit English Lexicon is
given below. All the Lexicons follow the same E-
R Diagram
The facilit y for get t ing input from t he user is
pr ovi ded i n bot h ways. User s can cli ck t h e
beginning let t er of t he word. It will display all
words, which st art wit h t hat let t er in t he right
side window. On clicking t he part icular word t he
mean i n g wi t h p h on et i c n ot at i on an d t h e
grammat ical cat egory are displayed in t he left
5. Panini Astadhyayi
In t his module a user can get t he Panini Sutra,
Sutra wit h Anuvrtti, it s Sanskrit Explanat ion and
English Explanat ion aft er giving Astadhyayi Sutra
number or Siddhanta Kaumudi Sutra number.
The user will also get t he informat ion t hat how
many t ime t he Astadhyayi Sutra occur in t he
Siddhanta Kaumudi.
The user can select Astadhyayi Sutra or can ent er
t he Siddhanta Kaumudi Number (If t he user gives
bot h Astadhyayi Sutra and Siddhanta Kaumudi
Number t hen t he syst em will accept Astadhyayi
Example- For t he in put t ed Astadhyayi Sutra
Number 2.2.23, t his Sutra occurs in t wo places
(Kaumudi Serial Number 529 and 829). If t he
user clicks any of t he Kaumudi Sutra Number t hen
all t he informat ion of t hat Sutra will be displayed.
If t he user clicks Kaumudi Prakaran t hen all t he
sut ra of t hat Prakaran will be displayed.
6. Dhatu to DhatuRoop
Usi n g t h i s mod u l e a u ser can get al l t h e
informat ion of a dhatu, i.e. a user can get dhatu’s
meaning, it s gan name, pad name and t he roops
aft er select ing t he lakar. For giving input , it is
necessary t o click on t he beginning let t er of a
word in t he left window. It will display all words,
which are st art ed wit h t hat let t er in t he right side
User can writ e t he input word wit h t he help of
Phonetic Keyboard in the text box. On pressing the
submit button the meaning with phonetic notation
and the grammatical category are displayed for that
inputted word.
4. Dictionary of Nyaya Term
This dict ionary of Nyaya t erms is a part of an
ambitious project of preparing an encyclopedia of
Indian logic. The main purpose is to resolve the
confusion and make uniformity in translating the
Nyaya terms in English. The stress has been given
to the clarity of concepts than on literal translation.
The root and meanings with phonetic notation of
Sanskrit Nyaya term is provided here.
For giving input , it is necessary t o click on t he
beginning letter of the word in the left window. It
will display all words, which are start with that letter
in the right side window. On clicking the particular
word the root and meanings with phonetic notation
are displayed in the left window.
Example- T h e in pu t t ed dhatu ‘ni’ an d t h e
information of the dhatu ‘ni’ is displayed. The 10
lakar name is displayed the user can select any lakar
and get the roops of that lakar. As this dhatu is an
Ubhyapada Dhatu, the user has to select the pad also
in the next page.
7. Dhatu Roop to Dhatu
This module is the reverse of the Dhatu to DhatuRoop
module. In this module a user can get the dhatu
and its attributes from a dhatu roop.
Example- For the given DhatuRoop “Bhavati”, User
can get all t h e in for mat ion s r elat ed t o t h at
Dhaturoop like Dhatu Name, Dhatu Arth, Dhatu
Gana, Dhat u Pada, Dhat u Lakar, Pur ush and
8. Sandhi
Sandhi means the coalescence of two words coming
in immediate contact with each other. Using this
module t he user can get t he informat ion about
Sandhi r ules an d processes. Sutra n umber in
Astyadhayi and its description is displayed. User can
learn three type of Svara Sandhi, Vyanjan Sandhi,
Hal Sandhi through this Sandhi module Data is in
Unicode. Sandhi exceptions and options are also
This is the Flowchart for this module. These are the
procedures written to make this module user friendly
and more descriptive.
Input Procedure : This procedure is writ t en t o t ake
t he input from t he user in t he form of Unicode
dat a. Phonet ic keyboard is used for t his purpose
by which everybody can use t his module easily.
CheckSandhi procedure: This procedure is used t o
find t he Sandhi t hat appears wit h t he given input .
Panini Ashtadhyayi Sutra Number: This procedure
connect s wit h Asht adhyayi module. Click on
Sut ra number displays t he sut ras used.
This module t akes t wo words as input . First word
cannot be null but second word can be. A user
can input t he t wo words and submit t he form t o
get t he result of the given input.
a. Fill the first word in the given textbox captioned
‘Input First Word’.
b. Fill t he secon d word in t he given t ext box
captioned ‘Input Second Word’.
c. Click the Submit Button.
Example- For the inputted first word “Dyata” and
second word “ari” It shows all the explanations
related to this Sandhi like its Sutra definition, which
applies in process and the Sutra. It is also displays
the Sutra number in Ashtadhyayi. This hyperlink
leads to the more details of this sutra.
First input word + Second input word = resultant
9. Japanese Lessons
The Japanese Lessons are very useful to introduce
the Japanese Language. These are mainly for the
Indians who know Hindi well can learn Japanese as
the contents in the lessons are described in Hindi.
Few Screen shots are given below.
Japanese Language contains a rich set of alphabets.
It is important for the learner to be familiar with
the set of alphabets. Keeping this in mind the most
of fourteen lessons describe the basics of the Japanese
Language with simple examples and explanations.
Grammars of the Japanese Language are discussed
not elaborately but lightly to give idea of verb, tense,
nouns, adjective etc. Care is taken not to make thinks
Kanji is another thing that adds a different value to
the Japanese language. The language has a rich set
of Kanji. Kanji’s are explained wit h t heir st roke
Lots of facilities are presented to make the learning
process easy and attractive. User can see all the words
of the total lessons from any page. Again the new
words of a section are also provided to the user. From
these words list a user go to the Japanese-Hindi
Lexicon for the details of the word. From the lessons
of alphabet learning a user can go to the Japanese
script learning module that see how the alphabets
can be written and pronounced. These script learning
includes all alphabets with Kanji.
10. Japanese Script learning and Character
Recognition System
Scripts are the basic of a Language. And writing of
these scripts are very important to learn a Language.
Using this module the user learns Japanese alphabets
and learns how t o writ e and pronounce t hem.
Writing strokes for a character are shown with stroke
orders with four different speeds. It takes care of all
characters with Kanji. Recognize a character after
hearing its pronunciation is also over. It is also hyper-
linked with Japanese Lessons. User can access this
module from Japanese Lesson.
There are different types of alphabets in Japanese
Language. All these are covered here. The writing
process is made user friendly. There are four options
to choose the speed of writing for an Alphabet. And
there is audio support to listen the pronunciation of
the alphabet. Emphasize is given on the stroke order
according t o which t he alphabet s are correct ly
Except Basic alphabets there are numbers of Kanji
in Japanese Language. Their format ion and t he
writing process in the proper stroke orders are given.
Pronunciation of alphabet is important to speak a
language. The attention is given for the right and
clear pronunciat ion of t he alphabet s and audio
supports are provided.
11. Japanese-Hindi Lexicon
In this module a user can get the attributes of a
Japanese word in Hindi. It is very helpful to the
user who know Hindi and wants to learn Japanese
Language. The facility for getting input from the
user is provided in both ways. Users can click the
beginning letter of the word. It will display all words,
which start with that letter in the right side window.
On clicking part icular word t he meaning wit h
phonetic notation and the grammatical category are
displayed in the left window.
12. HTML Content Converter
This converter converts ISCII/ ISFOC based HTML
cont ent t o Unicode based H TML cont ent . It
preserves the structure of the HTML content. It to
a larger extent also preserves the presentation of the
HTML content. It also incorporates the content
encoding directive to UTF-8. So that the resultant
output of the converter is complete in all respect.
Higher level design of the converter:
1. The converter reads the input file
2. Looks for the presence of ISFOC string inside
the content
3. Whenever it finds an ISFOC string it converts
them to ISCII
4. Thereafter it converts the ISCII string to Unicode
5. It writes the resultant Unicode data to the output
6. It also updates the content encoding inside the
output data
This is t he flow chart of t he H TML Cont ent
Convert er.
Other Converters
1. ISCII Text t o UNICODE Text Convert er
2. Unicode Text t o ISCII Text Convert er
3. I SCI I D at abase t o UN I CO D E D at abase
Convert er
4. UN I CO D E D at abase t o I SCI I D at abase
Convert er
5. Devanagri_Unicode (Phonet ic) Keyboard
6. Devanagri_Unicode (Inscript ) Keyboard
7. Japanese_Unicode Keyboard drivers (under
development )
13. Current/Future tasks
1. The ut ilit ies developed, being developed and
t o be developed at RCILTS, JNU will be
available on t he web.
2. RC I LT S, JN U wi l l p ar t i ci p at e i n t h e
developmen t of at least t wo D evan agar i
Unicode Font s (1 t rue t ype and 1 open t ype)
locally, by procuring or by out sourcing.
3. The cent er will collect and make a reposit ory
of relat ed work being done at nat ional and
int ernat ional cent ers. Such as C-DAC, IIT
Kanpur, IISC Banglore, JET RO et c. T he
cent er will est ablish act ive cont act s wit h t hese
n at i on al an d i n t er n at i on al b od i es for
acquisit ion of public domain mat erial on t he
subject .
4. The format of t he Lexicons will be advert ised
and discussed wit h ot her RCILTS
5. The cent er will collect and place on web some
more t ext s such as H it opdesh et c.
6. Cent er must enhance exist ing syst ems such
as lexicons and developed syst ems such as
Sandhi and Sandhi Vicched Modules leading
t o t he development t ranslat ion aid syst em for
Sanskrit t o ot her Indian Language and Voice-
Versa. T he cen t ers must acquires Hin di/
Sanskrit Morphological Analyzers and t hen
improve upon t hem.
7. Cent er will also enhance OCR performance
for Sanskrit .
8. The Cent er will also enhance Spell Checkers,
Basic Word Processors and Text Edit ors for
Sanskrit .
9. The Cent er will procure a Message Server.
10. The t ask on OCR, Font s, Spell Checker, Text
Edit or, Word Processor and Message Server
will also be t aken up. However, some of t hese
act i vi t i es may b e t aken u p r i gh t away
depending on t he current ly available funds
wit h t he resource cent er.
Following work being done at the center would
be available fully to users on Internet ending
September 30, 2003
Sanskrit Modules
1. Asht adhyayi of Panini wit h Sidhant Kaumudi
of N agesh Bh at t a ( En gl i sh T i ka) an d
Pr ab h akar i T i ka ( H i n d i ) . I n ad d i t i on
Kat yayan’s Vart ik on Asht adhyayi will also be
2. Dhat u Rat nakar wit h all forms of all Dhat us.
3. Prat yahar Module.
4. Ten Lessons on Sanskrit Sambhasn.
5. Samgya word forms (Prat ipadic) wit h 1,000
6. Sarvnam Module wit h all Pronouns.
7. Sanskrit -English Lexicon ( Word Meanings
only) wit h 30,000 words.
8. English-Sanskrit Lexicon ( Word Meanings
only) wit h 30,000 words.
9. 30 Sanskrit Lessons wit h Exercises.
10. Sandhi Prakaran wit h except ions and opt ions.
11. Li st s of ot h er San skr i t wor d s, su ch as
adject ives, conjunct ions et c.
12. D evan agr i Scr ipt Lear n in g Module wit h
Mat ras and Conjunct s.
Japanese Modules
1. 21 lessons wit h exercises. These lessons shell
con st it ut es a cour se which is mor e t han
equivalent t o one year diploma course in
2. A si mple H i n d i t o Japan ese t r an slat i on
exercise module.
3. Japanese-Hindi Lexicon ( Word Meanings
only) wit h 6,000 words.
4. Hindi-Japanese Lexicon ( Word Meanings
only) wit h 6,000 words.
5. Hiragana Script Learning Module.
6. Kat akana Script Learning Module.
7. Kanjhi Script Learning Module.
In addition to adding more lessons the center plans
t o augment t he lexicons wit h audio visual support ,
usage of words for meanings and add phrasal
grammat ical cat egories of words. Sandhi Vicched
Module will be developed which will aid t he
d evel op men t of si mp l e t r an sl at i on syst ems
(Sanskrit t o ot her Indian Languages and voice-
versa). A simple Hindi t o Japanese and Japanese
t o Hindi t ranslat ors is also planned. The Cent er
also plans t o develop cont ent s on H it opdesh kind
of mat erial bot h for Sanskrit and Japanese.
There is non-availabilit y of good expert ise in t he
field of Chinese language. Cent re is t herefore of
t he opinion t hat t he cent er may st op it s effort s in
developing syst ems for Chinese language and
concent rat e it s energy on Sanskrit and Japanese.
Courtesy: Prof. G.V. Singh
Jawaharlal Nehru University
School of Computer and Systems Sciences
New Mehrauli Road
New Delhi – 110 067
(RCILTS for Sanskrit, Chinese & Japanese)
Tel: 00-91-11-26107676,26101885
6. TDIL Associates Achieve Honours…Congratulations !
Congrats Dr. Vikas!
On being awarded VIGYAN BHUSHAN by
Hon’ble Prime Minister in a ceremony
organized by UP Hindi Sansthan at Lucknow
on 21 May, 2003.
About Dr. Om Vikas (
[National Coordinator of TDIL Mission]
B.Tech, M.Tech & Ph.D from I.I.T Kanpur. Fellow of IETE and
member of IEEE. Recipient of Fellow of Russian Academy of
Informatization of Education, Vigyan Bhushan, Atmaram, Vishisht
Padak, Indira Gandhi Rajbhasha awards.
After M.Tech, he served in TCS. Joined NIC in 1977 as Senior
Systems Analysis and rose to the position of Senior Director in 1998.
Served on deput at ion as visit ing professor at IIT Kanpur and
NCERT; and as counselor for Science & Technology in Indian
Embassy at Tokyo, Japan. Under UNDP fellowship studied large
database designs in USA, Canada & Europe (1980).
H is research int erest s include comput er archit ect ure, dat abase
design, Language informatics. He had been on program committees
of various international conferences.
H e has significant ly cont ribut ed t owards promot ing Language
Informat ics, Comput er Manpower Development , Int ernat ional
Cooperation and Technical Hindi.
Sensitivity to society is evident from his initiatives of publishing
ALOK newsletter in Hindi during his B.Tech; founding People’s
Science Society (Lok Vigyan Parishad) in 1986; launching TDIL
(Technology Development for Indian Languages) in 1990s and
coordi n at i n g i t i n mi ssi on mode. Fou n der edi t or of
Current ly, Senior Direct or / Scient ist ‘G’ and Head, T DI L
(Technology Development for Indian Languages) Mission, Min. of
Communication & IT, New Delhi - 110003.
Congrat s Prof. Dhande !
On becoming Direct or of Indian Inst it ut e of
Technology, Kanpur
About Prof. Sanjay Dhande (sgd@iit
[Chief invest igat or of RC for ILTS]
Ph.D Mechanical Engineering, IIT Kanpur
and current ly he is Direct or, Indian Inst it ut e
of Technology, Kanpur & Professor in Mechanical Engineering
and Comput er Science Depart ment , IITK.
H e also held t he post of Asst . Professor (1979 –85), Professor
(1985 t o Till Dat e), Visit ing Professor in Virginia Tech, Virginia,
USA (1992), H ead (Mechanical Engg. ) (1993 – 95), Dean
(Research and Development ) – (1999 -2001), Direct or (2001
– t ill dat e).
H e h as p u b l i sh ed n u mer o u s p ap er s i n N at i o n al an d
Int ernat ional Journals. H e has also writ t en books : Kinemat ics
an d Geomet r y of Pl an ar an d Sp at i al C am Mech an i sms”,
published by Wiley East ern Lt d., “Comput er Aided Design and
Man ufact ur e”, published by t he Commit t ee on Scien ce &
Technology in Developing Count ries (COSTED), Singapore and
“Comput er Aided Engineering Graphics and Design “ is under
publishing by Wiley East ern Publishers Lt d.
Congrats Prof. Sinha
for Successfully launching on-line Machine
Translation System (http:/ /
About Prof. R.M.K Sinha (
[Chief investigator of MT Projects]
Ph.D (CS) IIT Kanpur M.Tech in indust rial
Electronics (Electronics & communication Engineering) Indian
Institute of Technology, Kharagpur, 1969, M.Sc.Tech in Electronics
and Communication, University of Allahabad, 1967.
He has taught at many Institutions/ Universities in India and has
also been visiting professor/ faculty to USA, Canada and Bangkok
Institutions/ Universities. Currently he is Professor at Dept. of CS
and Dept. of EE IIT Kanpur. He has published very exhaustively
especially in the area of Indian Language Technologies in various
National and International journals.
He received “Man of the Year Award” from American Biographical
Institute, USA. He is fellow IETE, India, Senior member IEEE,
USA and serves many other organizations in various capacities.
Development of Indian Language Technologies are his main area of
int erest which includes AI, NLP, MT, SST, OCR, Document
processing et c. He is credit ed wit h development of INSCRIPT
Keyboarding and Coding schemes (ISCII), Integrated Devanagari
Computer Terminal, GIST Terminal, Transliteration among Indian
Languages, Spell-Checker Design, OCR, Anglabharat i Machine
Aided Translation System for English to Hindi and English to Urdu
for windows and Solaris/ Linux OS. HindiAngla the Hindi to Angla
MAT is in advanced stages of development. He is also working on
Developing MAT system for all Indian Languages, Speech to Speech
Translation System and Lexical Knowledge –Base development named
Congrats Prof. Lehal !
On becoming professor at Punjabi University,
[Chief Investigator of RC-ILTS (Punjabi)]
About Prof. Gurpreet Singh Lehal
Ph.D (CS) Punjabi University, M.E (CS) TIET
Patiala, M.Sc. Math’s (Honours) from Punjab University.
He has attended, presented and published many technical papers
in the area of Indian Language Technology. He has been working for
more than five years on different projects related to computerization
of Punjabi and guiding many M.Tech. and Ph.D scholars on different
topics related to technological development of Punjabi.
As chief Investigator of the project “Resource Centre for Indian
Language Technology Solutions- Punjabi(April 2000 - June 2003)”
his cont ribut ions are Gurmukhi OCR, Punjabi Word Processor,
Gurmukhi to Shahmukhi Transliteration system (in collaboration
with CDAC, Pune), Punjabi Spell checker, Punjabi Sorting Program,
Punjabi font converter, On-line Punjabi-English, English-Punjabi
and Hindi-Punjabi Dictionaries and Web site for On-Line teaching
of Punjabi
The Gurmukhi OCR and Punjabi word processor are ready for
commercializat ion and already t here is a big demand for t hese
products in India as well as abroad in UK, USA and Canada. MOU
for transfer of technology of the Punjabi spell checker is being finalised
with M/ S Modular Systems, Pune.
I. Indian Institute of Technology, Kanpur (Hindi,
1. Machine Translation
2. Speech-to-Speech Translation
3. Lexical Knowledge-Base Development
4. Optical Character Recognition
5. Transliteration
6. Spell Checker Design
7. Knowledge Resources
7.1 Gitasupersite
7.2 Brahmasutra
7.3 Complete Works of Adi Sankara
7.4 Ramcharitmanas
7.5 Upanishads
7.6 Kavi Sammelan
7.7 Bimari-Jankari
7.8 Paramhans Ram Mangal Das Ji
7.9 Nepali Texts
8. Technical Issues
9. Details of Texts/ Commentaries for each of the
10. The Team Members
II. M.S. University of Baroda, Vadodara (Gujarati)
1. Knowledge Resources
1.1 Gujarati WordNet
2. Knowledge Tools
2.1 Portal
2.2 Multilingual Text Editor
2.3 Code Converter
2.4 Gujarati Spell Checker
3. Translation Support Systems
3.1 Machine Translation
4. Human Machine Interface Systems
4.1 OCR for Gujarati
4.2 Text To Speech (TTS)
5. Localization
5.1 Language Technology Human Resource
6. Standardization
6.1 Unicode
7. Products which can be launched and the service
which the Resource Centre can provide to the
St at e Government and t he Indust ry in t he
7.1 Technical Skills
7.2 Products Which can be Launched
7.3 IT Services
8. IT Services - Multimedia CDs
8.1 Learn Gujarati Through English
8.2 Adolescent Health
8.3 Bibliography of books and journals
8.4 Classic Knowledge-base
9. The Team Members
III. Indian Institute of Technology, Mumbai
(Marathi, Konkani)
1. Lexicon And ontology
1.1 Lexicon
1.2 Ontology
2. The Hindi Wordnet
3. The Marathi Wordnet
4. Automatic Generation of Concept Dictionary
& Word Sense Disambiguation
5. Hindi Analysis and Generation
6. Marathi Analysis
7. Speech Synthesis for Marathi Language
8. Project Tukaram
9. Au t omat i c Lan gu age I den t i fi cat i on of
Documents using Devanagari Script
10. Object Oriented Parallel & Distributed Web
11. Designing Devanagari Fonts (Three Types )
12. Low Level Auto Corrector
13. Font Converters
14. Marathi Spell Checker
15. IT Localisation
16. Publications
17. The Team Members
IV. Thapar Institute of Engineering & Technology,
Patiala (Punjabi)
1. Products Developed
1.1 Spell Checker
1.2 Font Converter
1.3 Sorting Utility
1.4 Bilingual Punjabi/ English Word processor
7. Resource Centres Technology Index
1.5 Gurmukhi OCR
1.6 Gumukhi to Shahmukhi Transliteration
2. Contents Uploaded on Internet
2.1 Punjabi Classic Literature
2.2 Bilingual Dictionaries and Glossary
• Punjabi English On-line Dictionary
• English Punjabi On-line Dictionary
• Hindi Punjabi On-line Dictionary
• Glossar y of En gli sh - Pu n jabi
Administrative Terms
2.3 On Line Teaching of Punjabi
2.4 On Line Font Conversion Utility
• Punjabi Spell Checker
• Punjabi Fonts
3. Interaction with Punjab State Govt.
4. Publications
5. The Team Members
V. Indian Statistical Institute, Kolkata (Bangla)
1. Core Activities
1.1 Website Development and the Language
Design Guide
1.2 Training Programmes
2. Services
2.1 Corpus Development
• Printed Bangla Document Images &
Ground Truth for OCR and Related
• Development of Bangla Text Corpus
in Electronic Form (including a Bangla
Dictionary, and Several Bangla Classics)
• Electronic Corpus of Speech Data
2.2 Font Generation and Associated Tools
• Pu bli c D omai n Ban gla Fon t
• Converter-Font File to ISCII File and
• Bangla Text Editor
• Bangla Spell Checker
3. Products
3.1 OCR system for Oriya
3.2 Adaptation of Bangla OCR to Assamese
3.3 Information Retrieval System for Bangla
3.4 Script Identification and Separation from
Indian Multi-Script Documents
4. Research & Development
4.1 Aut omat ic Processing of Hand-Print ed
Table-Form Documents
4.2 Research and Development of Neural
Net wor k Based Tools for Pr i n t ed
Document (in east ern regional script s)
5. The Team Members
VI. Utkal University (Oriya)
1. Int elligent Document Processing (OCR for
2. Natural Language Processing (Oriya)
2.1 O r i ya Mach i n e Tr an slat i on Syst em
2.2 O r i ya Wor d Pr ocessor ( OWP)
2.3 Oriya Morphological Analyser (OMA)
2.4 Oriya Spell Checker (OSC)
2.5 Oriya Grammar Checker (OGC)
2.6 Oriya Semantic Analyzer (OSA)
2.7 E-Dictionary (Oriya 1 English)
2.8 OriNet (WordNet for Oriya)
2.9 SanskritNet (Word Net in Sanskrit)
3. Speech Processing System for Indian Languages
3.1 Text to Speech (TTS)
3.2 Speech To Text (STT)
4. The Team Members
Orissa Computer Application Centre (Oriya)
1. Products Developed at OCAC
1.1 Oriya Spell Checker
1.2 Thesaurus in Oriya
1.3 Bilingual Electronic Lexicon
1.4 Corpus in Oriya
1.5 Bilingual Chat Server
1.6 Net Education
1.7 XML D ocu men t Cr eat i on an d
1.8 Oriya OCR
1.9 Oriya E-Mail
1.10 Oriya Word Processor with Spell Checker
under Linux
1.11 Computer Based Training (in Oriya)
1.12 O riya Language Based E-Governance
2. ToT done for various projects so far.
3. Training Programmes run for the officials for
the state Govt.
4. Core Activities
4.1 Web Hosting of Oriya Languages Classics
4.2 Hosting of Web Sites of Govt. Colleges in
5. Products Proposed to be Developed
6. The Team Members
VII. Indian Institute of Technology, Guwahati
(Assamese & Manipuri)
1. Knowledge Resources
1.1 Corpora
• Assamese Corpora
• Manipuri Corpora
1.2 Dictionaries
1.3 Design Guides
1.4 Phonetic Guides
2. Knowledge Tools
2.1 The RCILTS, IIT Guwahati Website
2.2 Fonts
2.3 Spell Checker
2.4 Assamese Language Support for Microsoft
2.5 Morphological Analyzers
3. Translation Support System
4. Human Machine Interface Systems
4. 1 O pt i cal Ch ar act er Recogn i t i on for
Assamese and Manipuri
4.2 Speech Recognition System
4.3 Interface for e-Dictionary
5. Lan gu age Tech n ology H u man Resou r ce
6. Standardization
7. Publications
8. Manika Newsletter
9. The Team Members
VIII. Indian Institute of Science, Bangalore,
1. Web Sites and Support to Instruction
1.1 Kannudi
1.2 LT-IISc
1.3 Bodhana Bharathi : Multimedia Educational
CDs for 7
, 8
and 10
1.4 Bilingual Instructional Aid for Learning
German through Hindi
1.5 Information Base in Hindi Pertaining to
German History
2. Knowledge Bases
2.1 Sudarshana a Web Knowledge base on
Darshana Shastras
2.2 Indian Logic Systems
2.3 Indian Aesthetics
3 Technologies and Language Resources
3.1 Brahmi : Kannada Indic Input Method,
Word processor
3.2 O pen Type Font s : Sampige, Mallige,
3.3 Kannada Wordnet
3.4 OCR for Tamil
3.5 O CR of Pr i n t ed Text D ocu men t i n
4. Research
4.1 Aut omat ic Classificat ion of Languages
Using Speech Signals
4.2 Algorithms for Kannada Speech Synthesis
5. Publications
IX. University of Hyderabad (Telugu)
1. RC for ILTS
2. Products
2.2 Tel-Spell : Spell Checker
2.3 AKSHARA : Advanced Mult i-Lingual
Text Processor
2.4 E-mail in Indian Languages
2.5 WILIO : Interactive Web Page in Indian
2.6 Telugu Corpus
2.7 Dictionaries, Thesauri & other Lexical
2.8 Morphology
2.9 Stemmer
2.10 Part of Speech Tagging
2.11 VIDTA : Comprehensive Toolkit for
Web-Based Education
2.12 Grammars & Syntactic Parsers
2.13 Machine Aided Translation
2.14 Tools
2.14.1 Font Decoding
2.14.2 Web Crawler for Search Engine
2.14.3 PSA : A Meta Search Engine
2.14.4 Corpus Analysis Tools
2.14.5 Website Development Tools
2.14.6 Character to Font Mapping Tools
2.14.7 Dictionary to Thesaurus Tools
2.14.8 Dictionary Indexing Tools
2.14.9 Text Processing Tools
2.14.10 Finite State Technologies Toolkit
3. Service & Knowledge Bases
3.1 On-line Literature
3.2 History-Society-Culture Portal
3.3 On-Line Searchable Directory
3.4 Charact er Encoding St andards, Roman
Transliteration Schemes, Tools
3.5 Research Portal
3.6 VAANI : A Text t o Speech Syst em for
3.7 Manpower Development
4. Epilogue
4.1 Strengths and Opportunities
4.2 Outreach
5. Publications
6. The Team Members
X. Centre for Development of Advanced
Computing – Thiruvananthapuram (Malayalam)
1. Human Machine Interface Systems
1.1 N AYAN A
– O pt i cal Ch ar act er
Recognition System for Malayalam
– Malayalam Text t o
Speech System (TTS)
2. Knowledge Tools
– Malayalam Spell
– Malayalam Font
Package and Script Manager
2.3 Text Editor
2.4 Code Converters
– Malayalam Web
Based Search Engine
2.6 Malayalam Portal
3. Services
– Malayalam E-mail Server
3.2 Malayalam E-Com Application
4. Knowledge Resources
4.1 Trilingual Dictionary
5. Language Tutors
5.1 Ezhuthachan – The Malayalam Tutor
5.2 English Tutor
6. Other Activities
6.1 Providing Technology Solutions
6.2 Interaction with State Government
6.3 Training Activities
6.4 COWMAC (Consort ium of Indust ries
wor ki n g i n t h e ar ea of MAlayalam
7. Publications
8. Expertise gained
9. Future Plans
10. The Team Members
XI. School of Computer Science & Eng., Anna
University Chennai (Tamil)
1. Knowledge Resources
1.1 Online Dictionary
1.2 Corpus Collection Tools
1.3 Contents Authoring Tools
1.4 Tamil Picture Dictionary
1.5 Flash Tutorial
1.6 e-Handbook of Tamil
1.7 Karpanaikkat chi- Scenes from Sangam
1.8 District Information
1.9 Educational Resources
2. Knowledge Tools
2.1 Language Processing Tools
• Morphological Analyser
• Morphological Generator
2.2 Text Editor
2.3 Spell Checker
2.4 Document Visualization Tool
2.5 Utilities
• Code Conversion Utility
• Tamil Typing Utility
3 Translation Support System
3.1 Tamil Parser
3.2 Universal Networking Language (UNL) for
3.3 Heuristic Rule Based Automatic Tagger
4. Human Machine Interface System
4.1 Text-to-Speech (Ethiroli)
4.2 Poonguzhali (A Chatterbot)
5. Localization
5.1 Tamil Office Suite (Aluval Pezhai)
• Tamil word Processor (palagai)
• Presentation Tools for Tamil (Arangam)
• Tamil Database (Thaenkoodu)
• Tamil Spread Sheets (chathurangam)
5.2 Tamil Search Engine (Bavani)
6. Lan guage Techn ology Human Resour ces
6.1 Creat ing awareness among t he general
6.2 Language Technology Training
6.3 Co-ordination with Linguistic & Subject
6.4 Co-ordination with Govt. & Industry
7. Standardization
8. Research Activities
8.1 Text Summarization
8.2 Latent Semantic Indexing
8.3 Knowledge Representation System based
on Nyaya Shastra
9. Publications
10. The Team Members
XII. Perso-Arabic Resources Centre, CDAC, GIST,
Pune (Urdu, Sindhi, Kashmiri)
1. The PASCII Standard
1.1 Characteristics of Proposed Standard
1.2 Standardization of Urdu Keyboard
1.3 Standardization of Fonts
2. Script Core Technology & Rule Engine
2.1 Glyph Property Editor
2.2 Rule Engine
2.3 Testing Version of Urdu Editor
2.4 Font Glyph Editor
2.5 The Font Editor Utility
3. Devanagari/ Gurmukhi to Urdu Transliteration
3.1 Followin g is a Sample of H in di Text
Transliterated into Urdu (UTRANs)
3.2 Following is a Sample of Punjabi Text
Transliterated into Urdu (UTRANs)
4. Dictionary Development Tools
5. Nashir – Word processor
6. The Urdu SDK Controls
7. Pocket Translator
8. Multipromter
9. LIPS Advance Creation Station & LIPS for
10. Perso-Arabic Web Site Development
XIII. Jawaharlal Nehru University, Delhi (Sanskrit,
Japanese, Chinese)
1. Sanskrit Lessons and Exercises
2. Devanagari Script Learning and Charact er
Recognition System
3. Sanskrit-English Lexicon
4. Dictionary of Nyaya Term
5. Panini Astadhyayi
6. Dhatu to DhatuRoop
7. DhatuRoop to Dhatu
8. Sandhi
9. Japanese Lessons
10. Japan ese Scr ipt Lear n in g an d Ch ar act er
Recognition System
11. Japanese-Hindi Lexicon
12. HTML Content Converter
12.1 Other Code Converters
13. Current/ Future Tasks
◗ Indian Institute of Technology
Department of Computer Science & Engineering
Kanpur - 208 016
(RCILTS for Hindi & Nepali)
Prof R M K Sinha
Tel : 00-91-512-2597174,2598254
E-Mail :
Website : http:/ / users/ langtech
◗ Indian Statistical Institute
Computer Vision and Pattern Recognition Unit
203, Barrackpore Trunk Road, Kolkata –700035
(RCILTS for Bengali)
Prof. B.B Chaudhary
Tel : 00-91-33-25778086 Extn 2852, 25781832, 25311928
E-Mail :
Website : http:/ / ~rcbangla/
◗ Indian Institute of Technology
Department of Computer Science & Engineering
Mumbai- 400 076
(RCILTS for Marathi & Konkani)
Prof. Pushpak Bhattacharya
Tel : 00-91-22-25767718,25722545 Extn 5479, 25721955
E-Mail :
Website :
◗ University of Hyderabad
Dept. of CIS
Hyderabad -500046
(RCILTS for Telugu)
Prof. K. Narayan Murthy
Tel : 00-91-40-23100500, 23100518 Extn 4017, 23010374
E-Mail :
Website : http:/ /
◗ Jawaharlal Nehru University
School of Computer and Systems Sciences
New Mehrauli Road, New Delhi – 110 067
(RCILTS for Sanskrit, Chinese & Japanese)
Prof. G.V. Singh
Tel: 00-91-11-26107676, 26101885
E-Mail :
Website : http:/ /
◗ Thapar Institute of Engineering & Technology
Department of Computer Science & Engineering
Patiala 147 001
(RCILTS for Gurmukhi)
Prof. R.K. Sharma
Tel: 00-91-175-2393137/ 393374, 2283502
E-Mail :
Website : http:/ /
◗ ER & DCI, Vellayambalam
Thiruvananthapuram -695 033
(RCILTS for Malayalam)
Prof. Ravindra Kumar
Tel: 00-91-471-2723333, 2725897, 2726718
E-Mail :
Website : http:/ /
index.jsp/ index.jsp
◗ Anna University
School of Computer Science & Engineering
Chennai - 600 025
(RCILTS for Tamil)
Dr. T.V.Geetha/ Ms. Ranjani Parthasarthi
Tel : 00-91-44-22351723, 22350397
Ext n- 3342/ 3347, 24422620, 2423557
E-Mail :
Website : http:/ / rctamil/ html/ eindex.htm
◗ M.S.University of Baroda
Department of Gujarati, Faculty of Arts, Baroda-390 002
(RCILTS for Gujarati)
Shri Sitanshu Y. Mehta
Tel : 00-91-265-2792959
E-Mail :
Website : http:/ / rciltg/
◗ Orissa Computer Application Centre
OCAC Building, Plot No. 1/ 7-D,
Acharya Vihar Square, RP-PO, Bhubaneswar – 751 013
(RCILTS for Oriya)
Shri S.K. Tripathi
Tel : 00-91-674-2582484, 2585851, 2554230(R), 2582490
E-Mail :
Website : http:/ /
◗ Utkal University
Department of Computer Science & Application
Vani Vihar,) Bhubaneswar – 751 004
(RCILTS for Oriya)
Prof. (Ms) Sanghmitra Mohanty
Tel : 00-91-674-2585518, 254086
E-Mail :
Website : http:/ /
◗ Indian Institute of Science
Centre for Electronics Design and Technology (CEDT)
Bangalore – 560 012
(RCILTS for Kannada)
Prof. N.J. Rao
Tel. : 00-91-80-3466022, 3942378, 3410764
E-mail :
◗ Indian Institute of Technology
Department of Computer Science & Engineering
Panbazar, North Guwahati, Guwahati -781 031, Assam
(RCILTS for Assamese & Manipuri)
Prof.Gautam Barua
Tel : 00-91-361-2690401, 2690325-28 Extn 2001, 2452088
E-Mail :,
Website : http:/ / rcilts
◗ Centre for Development of Advanced Computing
Pune University Campus, Ganesh Khind Road
Pune - 411 007
(RCILTS for Urdu, Sindhi & Kashmiri)
Shri M.D. Kulkarni
Tel : 00-91-20-25694000, 25694002-09
E-Mail :
Website : http:/ /
List of Resource Centers
◗ Indian Institute of Information Technology
Madhya Pradesh
(IT localization solutions for Madhya Pradesh)
Prof. D.P. Agrawal
Tel : 00-91-751-2449701
E-Mail :
◗ Birla Institute of Technology
Mesra, Ranchi
(IT localization solutions for Jharkhand)
Dr. P.K. Mahanti
Tel : 00-91-651-2275333
E-Mail :
◗ Banasthali Vidyapith
Banasthali, Rajasthan
(IT localization solutions for Rajasthan)
Dr. Aditya Shastri
Tel : 00-91-1438-228647/ 28787
E-Mail :
◗ Banaras Hindu University
Institute of Technology
Uttar Pradesh
(IT localization solutions for Uttar Pradesh)
Dr. K. K. Shukla
Tel : 00-91-542-2307055/ 56
E-Mail :
◗ New Government Polytechnic
Patliputra Colony, Patna
(IT localization solutions for Bihar)
Dr. R.S. Singh
Tel : 00-91-612-2262866/ 700
E-Mail :
◗ School of IT
G.G. University
(IT localization solutions for Chhatisgarh)
Dr. Anurag Shrivastva
Tel : 00-91-7752-272541
E-Mail :
◗ Indian Institute of Technology
Roorkee, Uttranchal
(IT localization solutions for Uttaranchal)
Dr. R. C. Joshi
Tel : 00-91-1332-285650
E-Mail :
◗ Indian Institute of Technology
Kanpur, Uttar Pradesh
(Hindi to English Machine-aided Translation System
based on Anubharati Approach)
Prof. R.M.K. Sinha
Tel : 00-91-512-2597174
E-Mail :
◗ Indira Gandhi National Centre for the Arts
New Delhi
(Development of Digital Library for Regional Heritage)
Prof. N.R Setty
Tel : 00-91-11-23385277
E-Mail :
◗ Institution of Electronics and Telecommunication
New Delhi
Wg Cdr (Retd) Dr. M.L. Bala
(IT Learning material in Hindi)
Tel : 00-91-11-4631810
E-Mail :
◗ Centre for Development of Advanced Computing
(Core Technology Development for Hindi)
Shri Mahesh D. Kulkarni
Tel : 00-91-20-25694000,25694002-09
E-Mail :
Other Centers
◗ Centre for Development of Advanced Computing
“Anusandhan Bhawan”
C-56/ 1, Institutional Area, Sector-62
(Development of Parallel Text Corpora for 12 Indian
Languages & Annotated Speech Corpora for Hindi,
Marathi &Punjabi)
Shri. V.N. Shukla
Tel : 00-91-95120-2402551-6
E-Mail :
◗ Centre for Development of Advanced Computing
Plot E2/ 1,Block GP, Sector V, saltlake
Kolkata- 700091
West Bengal
(Development of Annotated Speech Corpora for
Bengali, Assamese & Manipuri)
Dr. A.B. Saha
Tel : 00-91-33-23579846, 23575989
E-Mail :
List of CoIL-Net Centers
1. Calendar of Events-Year 2002 1L-1R
2. Reports
2.1 TDIL Vision 2010 2L-3R
2.2 LT-Business Meet & TOT 2001 4L-16L
2.3 Intellectual Property Rights (IPR) 16R-16R
2.4 UNESCO-Symposium on Language 17L-19R
in Cyber Space
2.5 SCALLA 2001-Sharing Capability 20L-20R
in Localisation & Human Language
2.6 UNESCO - Workshop on Medium 21L-22L
Terms Strategy for Communicatons
& Information
2.7 The Asia Pacific Development 22R-23L
Information Programme (APDIP)
2.8 Workshop on Corpus-based 23R-23R
Natural Language Processing
2.9 1
Workshop on Indian Language OCR 24L-24R
2.10 1
International Conference on 25L-25R
Global WordNet
3. Standardization
3.1 Revision of Unicode Standard-3.0 26L-37R
for Devanagari Script
3.2 Design Guides (Sanskrit, Hindi, 38L-75R
Marathi, Konkani, Sindhi, Nepali)
3.3 Indian Standard Font Code (INSFOC) 76L-77R
3.4 Indian Standard Lexware Format 78L-86R
4. 4.1 Reader’s Feedback 87L-87R
4.2 Frequently Asked Questions 88L-91R
Contents January 2002, ek?k ek?k ek?k ek?k ek?k & Contents Jan. 2001, ek?k ek?k ek?k ek?k ek?k '
1. Message of Hon’ble Minister Sh. Pramod Mahajan 1-L
2. TDIL Programme 1-R
3. Overcoming the Language Barrier 2-L
4. Achievements 2-L
5. Resource Centres for Language Technology Solutions 3-R
6. Potential Products & Services 4-L
7. TDIL Website 4-L
8. Implementation Strategy 4-L
9. International Programmes in 4-R
Multilingual Computing
10. MNC Products Supporting Indian Languages 4-R
11. Tentative List of Indian Language Products 5-L
12. Portals Supporting Indian Languages 7-L
13. Other Efforts 7-L
14. Major Events of Year 2000 7-R
15. Indian Language Technology Vision 2010 9-L
Contents April 2002, pS = pS = pS = pS = pS = )
1. tdil@Elitex 2002 1L-2L
1.1 TDIL Programme 2R-3L
2. Readers Feedback from Abroad 3R-3R
3. Calendar of Events-Year 2002 4L-4L
4. Standardization 4R-4R
4.1 Gujarati Code Chart 6L-24R
4.1.1 Gujarati Code Chart Details
4.1.2 Gujarati Script Details
4.1.3 Typical Colloquial Sentences
4.2 Malayalam Code Chart 26L-44R
4.2.1 Malayalam Code Chart Details
4.2.2 Malayalam Script Details
4.2.3 Typical Colloquial Sentences
4.3 Oriya Code Chart 46L-62R
4.3.1 Oriya Code Chart Details
4.3.2 Oriya Script Details
4.3.3 Typical Colloquial Sentences
4.4 Gurmukhi Code Chart 64L-82R
4.4.1 Gurmukhi Code Chart Details
4.4.2 Gurmukhi Script Details
4.4.3 Typical Colloquial Sentences
4.5 Telugu Code Chart 84L-101R
4.5.1 Telugu Code Chart Details
4.5.2 Telugu Script Details
4.5.3 Typical Colloquial Sentences
5. Quick Reference to Previous Issues 102L-102R
Special Issue on Language Technology Business Meet
Patrons Message 1-1
Programme Schedule 1-1
I Machine Aided Translation (MAT) 2-3
II Operating System (OS) 3-3
III Human Machine Interface System (HUMIS) 4-16
IV Tools 16-29
V e-Content 30-34
VI Other Milestones 37-38
Quick Reference Guide 39-39
Contents Sept. 2001, »r|º·r-r *
Contents May 2001, T;s ’B T;s ’B T;s ’B T;s ’B T;s ’B +
1. Calendar of Events 1-L
2. TDIL Website 2-L
3. TDIL Meet 2001 4-L
4. UNESCO Expert Group on Multilingualism in Cyberspace 9-L
5. Lexical Resources for Natural Language Processing 10-L
6. Symposium on Translation Support System 10-L
7. Universal Networking Language 10-R
8. Indo-UK workshop on Language Engg. for South 11-R
Asian Languages
9. New Software Testing Facility 12-R
10. Now Domain Name in Regional Languages 12-R
11. Book Shelf 13-L
12. Resource Center(s) for Indian Language 13-L
Technology Solutions 1
Year (2000 - 2001) Progress
13. Impediments in IT Localization & Penetration 15-L
14. Feedback on UNICODE Standard 3.0 15-R
15. Indian Language IT Market 21-L
16. MS Office XP with Indian Language Support 21-L
17. Call for Technologies 21-R
Quick Reference to Previous Issues
Ministry of Communications & Information Technology
Department of Information Technology
Electronics Niketan, 6, CGO Complex, New Delhi-110003
Telefax : 011-2436 3076 E-mail : Website : http:/ /
Contents January 2003, ek?k ek?k ek?k ek?k ek?k ,
Contents July 2002, vk"kk<+ vk"kk<+ vk"kk<+ vk"kk<+ vk"kk<+ -
1. Calendar of Events-Year 2002 1L-1R
2. TDIL Vision 2L-2R
3. Reader's Feedback 3L-3L
4. Universal Digital Library 3R-15R
5. Indo European IEMCT Conference 16L-17R
6. Indian Language Spell Checker Design Workshop 18L-24R
7. Indian Standard Font Code 25L-30R
(Devanagari, Gujarati, Punjabi, Malayalam)
8. INSROT Revision 31L-32R
(Indian Script to Roman Transliteration)
9. Unicode Standardization
9.1 Bangla Code Chart 34L-50R
9.1.1 Bangla Code Chart Details
9.1.2 Bangla Script Details
9.1.3 Typical Colloquial Sentences
9.2 Assamese Language Details 52L-60R
9.2.1 Typical Colloquial Sentences
9.3 Manipuri Language Details 62L-67R
9.3.1 Typical Colloquial Sentences
10. Quick Reference to Previous Issues 68L-68R
1. Calendar of Events-Year 2002-03 1L-1L
2. Reader’s Feedback 1R-2R
3. TDIL Vision 3L-3R
4. Technology Watch 4L-4L
5. MAT Evaluation & Benchmarking 5L-16R
6. OCR Evaluation & Benchmarking 17L-21R
7. Transliteration Tables 22L-26L
(Punjabi to Urdu & Hindi to Urdu)
8. Unicode Standardization
8.1 Kannada Code Chart 28L-40R
8.1.1 Kannada Code Chart Details
8.1.2 Kannada Script Details
8.1.3 Typical Colloquial Sentences
8.2 Tamil Code Chart 42L-56R
8.2.1 Tamil Code Chart Details
8.2.2 Tamil Script Details
8.2.3 Typical Colloquial Sentences
8.3 Perso-Arabic Standard for 58L-73R
Information Interchange
8.3.1 Urdu Design Guide
8.3.2 Typical Colloquial
Sentences in Urdu
8.3.3 Sindhi Design Guide 74L-79R
8.3.4 Typical Colloquial
Sentences in Sindhi
8.3.5 Kashmiri Design Guide 80L-84R
8.3.6 Typical Colloquial
Sentences in Kashmiri
8.4 Vedic Code Set 86L-103R
9. Quick Reference to Previous Issues 104L-104R
Contents October 2002, dkfrZ d dkfrZ d dkfrZ d dkfrZ d dkfrZ d .
1. Calendar of Events-Year 2003 1L-1R
2. Reader’s Feedback 2L-2R
3. TDIL Vision 3L-3R
4. Conference Reports 4L-12R
5. Language Technology Papers - Abstracts
5.1 Human Machine Interface 14L-38R
System (HUMIS)
a) Integrated I/ O Environment
b) Optical Character Recognition
c) Speech Recognition
d) Speech Synthesis
e) Speech Processing
f ) Typesetting
5.2 Knowledge Resources (KR) 42L-44R
a) Corpora
b) Dictionary
c) Lexical Resources
5.3 Knowledge Tools (KT) 46L-58R
a) Lexical Tools
b) Utilities
5.4 Language Engineering (LE) 60L-64R
a) Grammar
b) Language & Script Analysis
5.5 Localisation (L) 66L-66R
a) System Software
5.6 Translation Support Systems (TSS) 68L-72R
a) Machine Translation
b) Universal Networking Language
c) Wordnet
5.7 Standardisation 74L-74R
a) Existing Standards
b) Draft Standards
Index 76L-89R
a) Paper - abstract
b) Addresses
Quick Reference to Previous Issues 90L-90R

Sign up to vote on this title
UsefulNot useful