Mining The Treasure of Palm Leaf Manuscripts Through Information Retrieval Techniques

The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/2059-5816.htm
DLP
35,3/4 Mining the treasure of palm leaf
manuscripts through information
retrieval techniques
146 Bhupendra Singh and Neelu Jyoti Ahuja
School of Computer Science, University of Petroleum and Energy Studies,
Received 12 July 2019 Dehradun, India
Revised 4 October 2019
Accepted 21 October 2019
Abstract
Purpose – This paper aims to popularize information retrieval from palm leaf manuscripts among
computer scientists to make available the guidance of the age-old heritage in shaping the future.
Design/methodology/approach – With computer technology penetrating every aspect of life,
information retrieval algorithms can be exploited to help build a system which can dig into the ocean of
knowledge from these manuscripts.
Findings – The knowledge in them covers all aspects of life. Be it religious beliefs, literature, science,
mathematics, or any other. However, due to discontinuation of practice of copying their content on fresh
leaves, they now possess a fragile life which needs to be preserved at the earliest. The modern means of
digitization can help in their preservation.
Research limitations – The Government of India and other organizations are doing commendable job of
preserving and safeguarding country’s heritage and age-old knowledge system through the movement of
digitization. In the years to come, the agonizing problem of manuscripts degradation will be eradicated
completely. However, next when it will come to mining the knowledge treasure out of these manuscripts, we
would be confronted with another helpless situation.
Practical implications – The digitization process would capture the manuscripts from present physical
palm leaf to digital image form by clicking high-quality pictures. All the text in a palm leaf will be available in
the form of images, but on these images, a simple search for any word would not be possible.
Originality/value – Working towards mining the treasure of knowledge from the palm leaf manuscripts,
hordes of challenges have been outlined. Over and above the problem of preventing decay to palm leaf
manuscripts is the challenge of deciphering text, image analysis, information retrieval and search. Search is
further associated with issues of meaningful and useful extraction through semantic analysis. This paper
advocates the dire need for systematic research to be undertaken in this field opening up avenues for past
knowledge to guide future prospects in several domains.
Keywords Palm leaf manuscript, Information retrieval, Ancient Indian science, Algorithm,
Computer intelligence, Historical documents
Paper type General review
Introduction
India is known as one of the home of world’s ancient knowledge. Many of the modern day
discoveries can be linked to the foundation which our ancestral scholars have laid in past.
The transmission of knowledge from ancient historical time was initially through oral mode
Digital Library Perspectives
and much later the writing system was developed. History of writing in India can be dated
Vol. 35 No. 3/4, 2019
pp. 146-156
back to 3rd millennium BCE in the form of Indus Script by Harappa people (non-deciphered
© Emerald Publishing Limited till today), and Ashokan edict in the form of Brahmi script. However, they all are on stone as
2059-5816
DOI 10.1108/DLP-07-2019-0026 carvings.
The paper for writing which we use today is credited to China; however, the Chinese kept Mining the
this knowledge of producing paper as close-guarded secret for many centuries. Later, it treasure of
reached Arabs from a captured Chinese prisoner. Before this invention, ancient sages and palm leaf
scholars had been using leaf of Palm Tree for writing since fifth century BCE, reason being
its predicted life of 200-300 years. One of the oldest surviving palm manuscripts, a Sanskrit
manuscripts
text Shaivism, dates back to ninth century CE. It was discovered in Nepal and currently
preserved in the library of Cambridge University. 147
One can look into predicting the possible future only after knowing the past. Indeed,
today all the inventions and discoveries are happening based on our past-acquired
knowledge. Information which is vital to our survival is embedded in our genes, but the
societal and personnel significant work has to be remembered and passed on to next
generation either vocally or in the form of writings and pictures. The Hindu holy texts, the
four Vedas were first passed over orally, preserved with precision, up to the end of first
millennium BCE. The ancient scholars (Rishis) would listen, remember and chant the hymns
to their disciples who would repeat and the same cycle continued.
With the advent of writing system in India around third century BCE in the form of
Brahmi script, it was possible to capture more and more significant events and observations.
In the later centuries, Palm leaf manuscripts became the most common form of writing in
India and also in South Asian countries like Nepal, Sri Lanka, and Indonesia. Since ancient
times, these manuscripts have been the source of knowledge for religion, understanding
universe (its nature, laws, and language) and help us progress our life to be better, as a
guiding Guru.
Palm leaf preparation for writing

The palm leaves are collected from petioles of palm tree and then cut into sizes as required.
They are then boiled in water and later on after drying, placed between two wood planks of
the same size as the leaves. After two days the leaves are taken out, polished and then dried
again for two more days. Finally, the leave is ready. However, the writing on the leaf is not
as simple as writing on paper with pen. The writing on the palm leaves is carried out by the
practiced professional called Lipikaras, by scribing with the help of metal Stylus (Figure 1).
History stored on palm leaf manuscripts

The amount of palm leaf manuscripts available is of the order of millions. Some insight of
the significance of Palm Leaf Manuscripts is provided by Sah (2002). Here we are
highlighting only a few areas of work.
Figure 1.
Palm leaf manuscript
DLP Literature. Besides religion and war, in literature, one of the greatest Sanskrit poets ever,
35,3/4 Kavi Kalidasa, of fourth to fifth century, in his poem Meghdootam, with his vivid
imagination, has inspired poets of coming centuries. In his another masterpiece, a play
Abhijñanasakuntalam, with his imaginative power, he brings the character of Shakuntala to
life. The latter is the mother of Bharata, who gave the original name to a country, today
known as India. As Pandit Jawaharlal Nehru notes, Kalidasa’s work excludes life’s misery
148 and colors life with the beauty of nature. Unfortunately, only 5 per cent of his work survived
upto this century.
Kamasutra is well known for its explicitly erotic acts and teachings. Numerous
manuscripts with titles Smaradipika, Pancasayaka, Ratimanjari can still be found in Orissa
State Museum under the same discipline.
Mathematics. Dr K V Sarma, a well-known manuscriptlogist, rightly notes that, it would
seem that most of the work had been done on epics, literature, philosophy and religion,
which may overshadow the impressive quality of record of works in science, despite the fact
that a great deal of work has been lost or destroyed over many centuries. Undoubtedly, there
is a loss of important work as ancient authors Crake and Sarnagadhara refer to much earlier
texts which are no longer available. Below we have briefly described few great ancient
mathematical texts and scholars.
Yuktidipika (Lamp of Astronomical Rationale): A sixteenth century commentary by
Sankara on Tantrasangraha of Nilakantha Somayaji (born 1443 CE) supplies detail
rationales of the theory of numbers, division, square and square root, summations of natural
numbers, squares, cubes, summations of summations, rules relating to triangle, circle and
many others.
Yuktibhasa: A major treatise on mathematics and astronomy by Jyesthadeva (1503CE) in
the Kerala School of Mathematics include work on infinite series expression of a function,
power series, taylor series, trigonometric series of sine, cosine, tangent and other functions.
It is believed to be the foundation of calculus and predates those of James Gregory and
Newton by many centuries (Divakaran, 2010).
Brahmasphutasiddhanta: The work of Brahmagupta (628 C) contains the rules for
manipulating both negative and positive numbers and a method for computing square
root. It contained the first clear description of the quadratic formula which we use
today.
Surya Sidhanta: A sixth century CE text gives estimates of the diameters of a number of
planets, for example Mercury, Saturn with an error of less than 1 per cent.
Chandahsastra: was written by Pingala, a second and third century BCE scholar who
worked extensively in combinatorics and also developed a binary system for describing
prosody. He used binary number in the form of long and short syllables.
Aryabhata: A fifth century scholar who believed in geocentric model of the solar system.
However, he presented many ideas which laid the foundation of modern astronomy and
mathematics. He asserted that moon shines by reflecting light of Sun and correctly
explained the causes of Solar and Lunar eclipse, and stated that earth rotates about its axis.
He calculated value of Pi (p ), and the length of the year as 365 days 6 h 12 min 30 s only 200 s
longer to modern accepted values. Aryabhata’s using of sine (jya), cosine(kojya) and inverse
sine(olkram jya) influenced the birth of trigonometry.
Bhaskara II: His work Siddhanta Siromani is divided into four parts Lilavati (Arithmetic),
Bijaganita (Algebra), Grahaganita(mathematics of planets) and Goladhyaya (Sphere) which
are themselves considered as four different individual works. His work on calculus predates
Newton and Leibniz by over 500 years.
Medicine. The geography of India has vast wealth of flora and fauna, that may be the Mining the
reason for its enormous knowledge of medicine in the form of Ayurveda. The foundation of treasure of
Ayurveda is based on two ancient texts that have still survived.
Charak samhita. The second century text containing eight books and one hundred
palm leaf
twenty chapters. Mentioning medicinal uses of 177 substances of animal origin, 341 manuscripts
medicinal plants, and 64 animals. It also includes section on the importance of diet, hygiene,
causes and prevention of diseases and others.
Sushruta samhita. A second century BCE text on medicine and surgery. There is only a
149
small subset of manuscripts that are printed, and still a critical edition of it is yet to be
prepared. The text talks about the efforts for prevention of diseases. It also discusses
surgical procedures for plastic, rhinoplastic, lithotomic and obstetrical procedures.
In a famous “rhinoplasty” operation of Pune in March 1793, an Indian surgeon used skin
from the foreheads of an injured soldier to repair his noses. Two senior British surgeons
witnessed this operation and prepared its detailed descriptions with diagrams. The publication
in Europe in 1816 of their account later gave birth to modern plastic surgery (pib.nic.in, 2019).
In a paper by Bhattacharya (2014), he has listed some great manuscripts from Orissa
State Museum in the area of Ayurveda. For example, Cikitasamanjuri belonging to seventh
to ninth century CE while Vesaja Ratnavali deals with surgery. There are lots of other
manuscripts as well which talk of Syrups, avian Medicine, Pediatric Medicine, Chronic
diseases, blood and septum based pathology etc. In one rear treatise on cosmetology,
Kalahandi Naresa, treatment for alopecia is described, a diseases incurable in modern
science.
In Ayurvedic literature, bulk still remains in the form of unpublished manuscripts. A
twelfth century alchemist, Vagbhata, in his Rasaratnasamuccaya(ch4) talks about the
medicinal aspects of gems and the curative properties of medicines using their ashes.
Collection of Palm Leaf Manuscripts (PLM) in India

Orissa state museum
A vast collection of around 50,000 manuscripts from a range of topics, one such is the rare
collection of an Ayurveda manuscript named Cikitsamanjuri belongs to seventh to ninth
century CE and other Vesaja Ratnavali which deals with surgery. The Museum has also
developed an online system from where one can purchase digital images of listed Manuscripts.
Government Oriental Manuscripts Library and Research Center, Chennai

The center started in year 1869 CE and has enormous collections of ancient knowledge with
a total of 50,180 palm leaf manuscripts. For example, A manuscript of an author who lived
before kalidasa period, around sixth century, A manuscript on Architecture, Amsumad-
bheda, another manuscript Kapphanabhyudaya dealing with the life of king Kapphana
written around ninth century CE. The collection also contains some non-deciphered
manuscripts like Gangavamsanucarita, which is a poem giving the history of princes, who
ruled the Kalinga kingdom (Cuttack). The institute also publishes the unpublished
manuscripts annually.
Oriental Research Institute and Manuscripts Library, Chennai

It is an academic department which comes under the University of Kerala. The institute has
around 50,000 collection of PLM covering almost every field like Philosophy, Nyaya, Kavya,
Tantra, Jyotisha, Shilpa, Vedanta, Astronomy, Astrology, Medicine, Mantra Vastuvidya
(Traditional architecture) and many others.
DLP The Asiatic Society, Bengal
35,3/4 The Asiatic Society was established in the year of 1794. The collections of manuscripts
cover from the period seventh century to the nineteenth century. Very few of rarest
collections of the museum are Kiranavali, Horoscope of a Muslim of the Mughal Court
(1640CE.), a text on Buddhist Nyaya, A Deed of Mortgage (1639), Vajrayana text (11th c.)
and others. The manuscript of Kubjikamatam is of the seventh century CE and the
150 manuscript of Rigveda Padapatha, copied in 1362 CE, seems to be the oldest manuscript of
Rigveda.
Degradation of PLM over the time

The palm leaf manuscripts have a natural life of 250-300 years and then historically, the
scholars used to copy their information on fresh leaves and this process continued. In the
past 200 years, this chain has been broken and now we are losing these historical heritages
very fast due to degradation. The threats to manuscripts are numerous, some being natural
like dampness, fungus, discoloration, seepage of ink, smearing along cracks, some being the
damage because of ants, rats, cockroaches, termites, others are human’s mishandling and
sometimes natural disasters like floods and earthquakes.
The ASA archives in Nepal houses more than 6,700 manuscripts from the medical,
tantric, astrology, Buddhism, Vedic, Purana and many others. Though they have digitized
almost all of their collection but the damage already done is not irreversible. Out of the 400
rolls of tamsuk (land grant documents), 151 have been chewed away by rats. Many
manuscripts underwent vertical cracks, tears, folds because of the rolls being pressed down
over the years (Naoko et al., 2005).
Traditionally the preservation of the manuscripts has been done through indigenous
methods as per Sahoo et al. (2004) like:
Dried powdered leaves of Aswagandha in small packets kept with the manuscripts
and covered in clothes to repel insect attack.
Coatings of lemon-grass oil to strengthen the leaves of manuscripts and also destroy
the growth of microorganisms.
Powdered Ajwain is also used as an insect killer and fungicide.
However, today’s technology requires the use of Digitization for their life long preservation,
of the precious knowledge present on them.
Treasures still in dark

As per Prof David Pingree, there is estimate of about three million manuscripts still existing.
Regrettably, only few of them have been subjected to modern analysis and gradually the
whole ensemble is decaying. In addition, since there are only few scholars trained to read
and understand these texts, it is of utmost risk that most of these manuscripts may
disappear before anyone is able to analyze and describe their content correctly.
According to UNESCO, more than a hundred thousand palm-leaf manuscripts are still
unpublished on various aspects of traditional Indian knowledge in Tamil itself. One
example is tantra theology manuscript, Paramesvaratantra of year 828 CE in the library of
University of Cambridge which is still unpublished.
Shepherds in Gilgit in Kashmir valley in 1931, discovered a large number of manuscripts
opening a new chapter on Buddhist Kashmir. In India, these manuscripts are among the
oldest surviving manuscripts.
Below is the list of findings of Dr. K V Sarma and his team during the extensive survey of Mining the
manuscripts in 1995-1997 present in Kerala and Tamil Nadu, in both public and private treasure of
collections where 400 repositories with 150,000 manuscripts were extensively surveyed by K
V Sarma and his team (Sarma et al., 2002) (Tables I).
palm leaf
Sadly, only 7 per cent of these texts are in print which is regarded as the sole contribution manuscripts
of India to science through ancient manuscripts.
151
Digitization efforts
UNESCO under the program “Memory of the world” is engaged in the mission of facilitating
preservation, assistance in providing universal access to heritage documents and increasing
the awareness worldwide about the existence and significance of heritage documents.
National Mission for Manuscripts (NMM)

The National Mission for Manuscripts (NMM) established in February 2003, by the Ministry
of Tourism and Culture, Government of India, is a unique project committed to preserving
the vast manuscript wealth distributed throughout of India. It is estimated that there are ten
million manuscripts, in India. NMM is working towards fulfilling its motto, “conserving the
past for the future”. The NMM is developing a database of all Indian manuscripts named
Kriti Sampada which will be available through internet.
As per Ministry of state for Culture and Tourism, till 2017 2.8 lakh manuscripts have
been digitized as per PIB India (Digitization of Manuscripts, 2017).
With its commissioning since 2003 and given the volume of manuscripts to be digitized.
It is in question whether all the manuscripts would survive given their present brittle state.
There is one more challenge that there are only few scholars who can decipher them. PMM
has organized workshops on Manuscriptology and Paleography to tackle this challenge.
The NMM has provided 23 states with trained conservators who have discovered
manuscripts in the state of Mizoram, not known before 2010.
NMM has also submitted a proposal for setting up of Digital Manuscripts Library to
store Digital Manuscripts for accessibility purpose. It also has a publication program where
30 unpublished manuscripts have been published.
Information retrieval system for palm leaf manuscripts repository

The Government of India and other organizations are doing commendable job of preserving
and safeguarding country’s heritage and age-old knowledge system through the movement
of digitization. In the years to come, the agonizing problem of manuscripts degradation will
be eradicated completely. However, next when it will come to mining the knowledge
treasure out of these manuscripts, we would be confronted with another helpless situation.
Section Title No. of manuscripts
I Astronomy 2,919
II Astrology 6,794
III Medicine 1,286
IV Veterinary Science 146
V Chemistry 166
VI Physics 326 Table I.
VII Botany 8 Manuscripts
VIII Architecture 599 collection in Kerala
Total 12,244 and Tamil Nadu
DLP The digitization process would capture the manuscripts from present physical palm leaf to
35,3/4 digital image form by clicking high quality pictures. All the text in a palm leaf will be
available in the form of images, but on these images a simple search for any word would not
be possible. Consider for example, if you are reading a PLM on astrology and want to search
for your zodiac prediction, without textual format you will have to do manual search
through all the leaf images involving stressful exercise of your eye.
152 Thus, the manuscripts need to be available in textual form and not only in image form.
The next problem which we will encounter is to categorize millions of manuscripts into
title/headings of Astrology, Astronomy, Medicine, Mathematics, Yoga, religion and so on.
Doing this categorization manually would not only require massive human resource and
time, but also the errors would be untraceable. We thus need, an automated system for
categorization, and sub categorization.
The system should also be intelligent enough to create category/title name itself. For
example, there may be a particular disease cure in significant number of manuscripts, thus
that particular disease should pop out as a sub-category by itself under medicine category.
The system should also be capable of adding any new manuscript into its structure later on
and arrange and rearrange itself as and when required.
The categorization will help scholars and others to search for their topic from within the
different categories and sub-categories. The great king Ashoka is well known to almost all
those who have gone through Indian Primary Education. Most of the story of Ashoka has
been derived from a Buddha manuscript titled Ashokavadana. Thus if someone is interested
in knowing about Ashoka, he would have to go through the hierarchical structure of:
History ! Chronology ! Kingdoms ! Kings ! Ashoka
However, here we simply want all the manuscripts which deal with the word Ashoka in
their content. Thus, we would need a search engine which can retrieve all the manuscripts
which deal with Ashoka.
The search engine should be intelligent in nature. It should not only pop up manuscripts
on the basis of words (specifically called keywords) provided by the user but also look for
similar words or synonyms. Consider for example the case that historically king Ashoka was
known as Priyadasi as well. Thus, the search should also include synonym names to its
results. In another dimension of the problem, synonyms inclusion may fail as well. Consider
again, there may be many manuscripts detailing about the battle of Kalinga, administration
of Ashoka, life of people under Ashoka. All these manuscripts are closely related to the same
primary subject “Ashoka” but none may have any reference to “Ashoka” word or its
synonyms. These issues need to be dealt with the search for common linking words between
different manuscripts. For example, the common link can be “name of the city” some
manuscripts are describing and, the same “name of the city”, having Ashoka or its synonym.
These are the topics which come under the realm of semantic analysis.
An Intelligent Search Engine with semantic analysis is need of the hour.
Information Retrieval: All the above problems fall more or less under the purview of
information retrieval. According to the definition of Raghvan (Christopher et al., 2008),
“Information retrieval is finding material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from within large collections (usually stored
on computers).”
It works as:
processing all the documents and then finding all the tokens/words from them;
arranging the tokens alphabetically having information about the frequency of their
occurrence within the corpus and link to all the documents where they are present;
adding rules for associating different tokens; and Mining the
adding semantic rules for query processing which can look for possible synonyms treasure of
palm leaf
After processing the query from the user it provides results with the links to different
documents. The documents are presented to the user on the basis of their relevance, with the manuscripts
most relevant document on top of the search results.
Challenges to the proposed information retrieval (IR) system:
Though the above discussed problems may be addressed in the proposed system but the
153
amount or level of intelligence of the system will remain in question. For example, if
someone searches with the query:
Query: “Ashoka”s battle after Kalinga’.
How the system would respond? Should it pop out all the manuscripts describing Ashoka’s
battle and the particular battle of Kalinga? How it comes up with a date which is not given in
the query and also, may not be given in any particular text.
The challenge of the system would enter a new dimension, if someone requires the
manuscripts with the queries as:
Query1: Was Ashoka cruel?
Query2: How much highest amount of money Ashoka ever had?
Answer of the first, requires boolean response in the form of yes or no, which no matter how
much a group of historians/scholars could debate, never a trivial yes/no would reach. But an
intelligent machine could argue with an empirical number, based on all the events where
people were killed, their number, of the direct consequences of Ashoka’s order and the
number of events where Ashoka’s decision resulted in savings of life, their number. Well
again the above framework of answering is another debate in itself.
We want to leave query 2 to the reader’s themselves for suggestions and see how a
machine could be taught to stand equal to human’s reasoning ability.
Prominent IR techniques based proposed solutions

Document categorization with clustering algorithm
The solution to the task of auto-categorization of huge volumes of palm leaf manuscripts can be
achieved with the help of clustering techniques. Clustering algorithm involves iteration several
steps to group similar data points into clusters. All the data points in a single cluster exhibit
similar properties and contrasting properties to other clusters. This technique can be utilized to
cluster documents having similarity into different categories. Similarity or closeness can be
defined as the documents having several words or their synonyms. In the work in (Christopher
et al., 2008), demonstrates the categorization of medical documents on the concept that they
contain a standard list of medical keywords. Thus in their work, they build a database of the
medical terms and created categories accordingly. Any document is assigned to a particular
category based on the frequency of keywords contained in the document for a particular
category. In other similar work (Jindal and Taneja, 2015), an ontology which is defined as a
common set of terms which describes or represent a domain is used. Along with, term weighting
approach is also used in which the term is assigned weights as per its frequency of occurrence in
the document. This approach helps increase the accuracy of the clustering algorithm. When
applied to the corpus of web pages having categories like football, basketball, cricket and
hockey, the technique yields 95 per cent recall and 93 per cent precision. There is a hierarchical
based clustering algorithm as well, which groups clusters among clusters in a top-down or
DLP bottom-up fashion. The algorithm starts with a single cluster having all the documents, it then
35,3/4 recursively splits clusters into sub-clusters. Similarly, the most similar clusters are merged.
Text summarization
Since the horde of manuscripts are unreadable by anyone and also the information may not
be relevant today. It makes sense to a obtain text summary on the PLM collections. Once the
reader finds the information relevant the full-length manuscripts would be accessed. Today
154 there are several techniques available for automatic text summarization, which tries to
produce a concise summary of the original document while preserving overall meaning and
key information content. These techniques are grouped into two categories extractive and
abstractive. Extractive techniques first find the most relevant sentences in the document-
based upon some sentence weighing method and then present the “k” most relevant
sentences. The weight of the sentence can be calculated as how closely or similarly the words
of the title or heading are appearing in the sentence. Or based on the location i.e. the sentences
appearing at the beginning of the document or the paragraphs are assigned higher weight.
The abstractive techniques, on the other hand, analyze the relationship among sentences in
the original text. The summary then produced by the abstractive methods is a reformulated
version of the original text. A number of summarization techniques have been covered and
discussed in work (Qazi and Goudar, 2018). In one example work (Allahyari et al., 2017), the
abstractive method works to summarize the Tamil language based documents.
Automated question answering system
This is the advanced version of the information extraction system as it takes the query from
the user in natural language and provides answers in the natural language itself. The process
starts with finding the type or category of the question which can be either Named Entity
using the words like “When”, “how much”, “how many”, “date/time”, “place” or the second
type of questions which uses the words like “why” or “how”. The first types of questions are
evaluated much with several techniques and methods proposed as can be found in (Kallimani
and Srinivasa, 2011), while the second type of questions are hard to answer and thus very
little attempts are made to answer these types of questions. After finding the type of the
question, the next step is usually an external ontology resource, for matching possible
answers. Aqualog (Bouziane et al., 2015) is an example of an ontology-based QA system
which allows the user to ask query in natural language and also to choose ontology.
Sentiment analysis
Sentiment analysis is the task of determining the opinion of the writer or a speaker on a topic.
The given text is analyzed for finding the polarity of the writer or speaker. The polarity can be
positive, negative or neutral. For example if someone has written the feedback about a product
or movie, the sentiment analysis task tries to find out whether the writer is positive/praising the
product or movie, or whether the writer is negative/criticizing the product or movie or the writer
is neutral that is don’t have any positive/praising or negative/criticizing opinion about the
product or movie. This analysis can help find out bias in documents. For example, a cure for a
particularly common disease like cold may be prescribed in several PLMs but manuscripts with
the supported claim are supposed to contain least bias or no self-admiration by the author.
The task is usually carried out with the help of subjective lexicon which is the collection
of several words with an associated score indicating the positiveness, negativeness or
neutrality. The system then finds out subjective lexicons from the given text and sums up
all the individual scores to get a final score. Thus predicting the sentiment based on the final
score. One example application of sentiment analysis can be found in the work by
(Lopez et al., 2007), where the authors have studied online product reviews using the data from
online seller website and classified the reviews to different categories up to some success.
Metadata Mining the
To effectively access or retrieve digital documents by an IR system, metadata plays an treasure of
important role. It is a kind of structured information about the document like title, date of
creation which facilitates finding, accessing and retrieving digital documents. Metadata
palm leaf
finds its usefulness in a digital library, digital museum, digital archives, and others, manuscripts
especially for historical document collections. The NMM, India has developed its metadata
for palm leaf manuscripts and also made it available for public access.
Several works have suggested the framework for the metadata preparation, especially for
155
PLM. One such work (Fang and Zhan, 2015) proposes a metadata framework for the
management of digital and original PLM collections in Asia for increasing efficiency to use,
access and manage. The work also presents the current use of PLMs metadata schema in
different working projects, 16 in total, all around the world. They found that 13 elements were
present in almost all the 16 projects which includes “title”, “language”, “storage location” and
13 others. “Title” is found to appeal in all the 16 projects, thus advocating it as the useful and
necessary element for PLM metadata. In another work (Chamnongsri, 2018), metadata for
PLMs in Thailand is proposed. In the study of current metadata schemes, no single scheme is
found to meet the requirement of every digital PLM collection. In the work (Challa and Mehta,
2019), the researcher tries to evaluate the automatic metadata schema generation, especially
for Indian PLMs. As the automatic process gives more efficient, less costly and consistent
results in comparison to the manual, human-oriented processes. In addition to common
elements in PLM metadata, 3 new elements are proposed for the Indian PLM namely the total
damage on the boundary of the original manuscripts, total damage inside the main body of
the manuscript and the total number of words in the digital image of the original manuscript.
A total of 24 essential metadata schema properties are designed in this research. The results
obtained suggest decent success rate, as metadata element “title” is recalled 98 per cent time
with 97.19 per cent accuracy while “author” and “subject” elements with 94.38 per cent recall,
94.4 per cent precision and 96.41 per cent recall, 96.13 per cent precision respectively has been
achieved. When compared to manual metadata extraction process 18.58 per cent improved
recall rate is achieved along with 13.24 per cent improved precision. These results strongly
suggest that an automated system can be developed to generate metadata schema for the
PLM collections. Few known metadata schemes used for digital documents are MARC, EAD,
Dublin Core, MODS, etc. (Huang et al., 2007).
Conclusion
Working towards, mining the treasure of knowledge from the palm leaf manuscripts, hordes
of challenges have been outlined. Over and above the problem of preventing decay to palm
leaf manuscripts, is the challenge of deciphering text, image analysis, information retrieval
and search. Search is further associated with issues of meaningful and useful extraction
through semantic analysis. This paper advocates the dire need for systematic research to be
undertaken in this field opening up avenues for past knowledge to guide future prospects in
several domains. A brief introduction about the current available techniques for retrieving
the information out of the digital collection of the PLMs is also presented at the end.
References
Allahyari, M. et al. (2017), “Text summarization techniques: a brief survey”, arXiv preprint
arXiv:1707.02268.
Bhattacharya, D. (2014), “Historical notes: select palm leaf manuscripts on health care”, Indian Journal
of History of Science, pp. 293-297.
DLP Bouziane, A., Bouchiha, D., Doumi, N. and Malki, M. (2015), “Question answering systems: survey and
trends”, Procedia Computer Science, Vol. 73, pp. 366-375.
35,3/4
Challa, N.P. and Mehta, R.V.K. (2019), “Evaluation of automatic metadata schema for Indian palm leaf
manuscripts”, International Journal of Innovative Technology and Exploring Engineering, Vol. 8 No. 5.
Chamnongsri, N. (2018), “Metadata standards for palm leaf manuscripts in Asia”, Research Conference
on Metadata and Semantics Research, Metadata Development for Palm Leaf Manuscripts in
Thailand, Springer, Cham.
156
Christopher, D.M., Raghvan, P. and Schutze, H. (2008), Introduction to Information Retrieval’,
Cambridge University Press, 7 July.
Digitization of Manuscripts (2017), “Press information bureau of India”, 20-March-2017.
Divakaran, P.P. (2010), “Calculus in India: the historical and mathematical context”, Current Science, pp. 8-14.
Fang, X. and Zhan, J. (2015), “Sentiment analysis using product review data”, Journal of Big Data, Vol. 2
No. 1, p. 5.
Huang, W., Ruizhe, L. and Tan, C.L. (2007), “Extraction of vectorized graphical information from
scientific chart images”, Ninth International Conference on Document Analysis and Recognition
(ICDAR 2007), IEEE, Vol. 1.
Jindal, R. and Taneja, S. (2015), “A lexical approach for text categorization of medical documents”,
Procedia Computer Science, Vol. 46, pp. 314-320.
Kallimani, J.S. and Srinivasa, K.G. (2011), “Information extraction by an abstractive text summarization
for an Indian regional language”, 7th International Conference on Natural Language Processing
and Knowledge Engineering, IEEE.
Lopez, V., Uren, V., Motta, E. and Pasin, M. (2007), “AquaLog: an ontology-driven question answering
system for organizational semantic intranets”, Journal of Web Semantics, Vol. 5 No. 2, pp. 72-105.
Naoko, T., Yoriko, C. and Reiko, M. (2005), in Nepal, 14 November, Asian Art.
pib.nic.in (2019), “India’s contribution to plastic surgery”, Press Information Bureau, Government of
India, available at: http://pib.nic.in/feature/fe0299/f0602991.html
Qazi, A. and Goudar, R.H. (2018), “An ontology-based term weighting technique for web document
categorization”, Procedia Computer Science, Vol. 133, pp. 75-81.
Sah, A. (2002), “Palm leaf manuscripts of the world: material, technology and conservation”, Studies in
Conservation, Vol. 47 Supplement-1, pp. 15-24, doi: 10.1179/sic.2002.47.Supplement-1.15.
Sahoo, J. and Mohanty, B. (2004), “Indigenous methods of preserving manuscripts: and overview”, The
Orissa Historical Research Journal, pp. 28-32.
Sarma, K.V. and Kutumba Sastry, V. (2002), “Science texts in Sanskrit in the manuscripts repositories
of Kerala and Tamil Nadu, Rashtriya Sanskrit Sansthan”.
Further reading
Deshmukh, P.C., Pillay, K.J., Raju, T.S., Dutta, S. and Banerjee, T. (2017), “GTR component of planetary
precession”, Resonance, Vol. 22 No. 6, pp. 577-596.
Corresponding author
Bhupendra Singh can be contacted at: bhupendrabisht59@gmail.com
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com

Mining The Treasure of Palm Leaf Manuscripts Through Information Retrieval Techniques

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mining The Treasure of Palm Leaf Manuscripts Through Information Retrieval Techniques

Uploaded by

Copyright:

Available Formats

The current issue and full text archive of this journal is available on Emerald Insight at:

Palm leaf preparation for writing

History stored on palm leaf manuscripts

Collection of Palm Leaf Manuscripts (PLM) in India

Government Oriental Manuscripts Library and Research Center, Chennai

Oriental Research Institute and Manuscripts Library, Chennai

Degradation of PLM over the time

Treasures still in dark

National Mission for Manuscripts (NMM)

Information retrieval system for palm leaf manuscripts repository

Section Title No. of manuscripts

Prominent IR techniques based proposed solutions

You might also like