You are on page 1of 4

International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013

A Survey of Recent Keywords and Topic Extraction


Systems for Indian Languages
Vishal Gupta
Assistant Professor, UIET, Panjab University,
Sector-25, Chandigarh, India

Abstract— Keywords are the thematic words in any document. retrieval of words is much helpful concept related to semantic
They represent topic of that document. Keywords are commonly resemblance, automatic translation by machines and automatic
used for search engines and document databases to locate managing of knowledge and data etc.
information and determine if two pieces of test are related to
each other. Key terms retrieval is also addressed as mining of
Concept recognition [15] deals with guessing words of text
words, words retrieval, recognition of words, or retrieval of
glossary, is a small phase of retrieval of information. Overall which can represent concept or topic. In recent years, this task
motive of retrieval of terminology is retrieving relevant words is done by persons in field of computational linguistic
automatically in a corpus. Moreover, techniques of automatic associated with different areas i.e. resolution of anaphora,
words retrieval mainly apply language techniques (automatic coreference and discourse. Prediction of relevant terms and
chunking of words phrases and tagging part of speech) for concepts from documents is very critical thing In field of
retrieving suitable keywords. These retrieved keywords are very information extraction for purpose of extracting important text
much helpful in field of knowledge or in favouring making documents, but these are not doing correspondence with topic
ontology in same domain. Identification of topic is job related to or theme. Guessing relevant words includes giving numerical
identification of unknown concepts or topics which are hidden
weight-age to different words of that text document. Words
earlier. Identification of concept is the task of abstracting group
of documents related to stories which represent same idea for having higher scores are relevant and important. These terms
that event. Concept or idea identification relates with association can be called as denoting the whole text document.
of documents related to stories and concepts or topics which are These retrieved keywords are very much helpful in field of
not hidden. Moreover, retrieval of words is much helpful concept knowledge or in favouring making ontology in same domain.
related to semantic resemblance, automatic translation by Moreover, retrieval of words is much helpful concept related
machines and automatic managing of knowledge and data etc. to semantic resemblance, automatic translation by machines
The paper describes review of different recent keywords and and automatic managing of knowledge and data etc. The paper
topic extraction techniques from Indian languages. describes review of different recent keywords and topic
extraction techniques from Indian languages.
Keywords— Keywords extraction, topic extraction, Indian
languages, term extraction, topic extraction. II. KEYWORDS AND TOPIC EXTRACTION TECHNIQUES FOR
INDIAN LANGUAHES
I. INTRODUCTION
Preeti and Brahmaleen Kaur Sidhu (2013) [1] proposed a
Keywords are the thematic words in any document. They Punjabi keywords extraction system, in which Punjabi text is
represent topic of that document. Keywords are commonly input in Unicode format. Text is scanned to filter out special
used for search engines and document databases to locate tokens such as \\, ||, (,), [,] *, {,},!, ^, , +, -,. Several
information and determine if two pieces of test are related to modifications are made: punctuation marks, brackets, and
each other. Key terms retrieval is also addressed as mining of numbers are replaced by blank space. Word segmentation
words, words retrieval, recognition of words, or retrieval of phase is applied for recognizing and dividing individual terms
glossary, is a small phase of retrieval of information. Overall lying in input document in a manner as each term could be
motive of retrieval of terminology is retrieving relevant words represented as separate token. Results from this words
automatically in a corpus. Moreover, techniques of automatic segmentation phase are treated as input by part of speech
words retrieval mainly apply language techniques (automatic tagger. Each word is built using the words of various word
chunking of words phrases and tagging part of speech) for classes like pronoun, pronoun, adjective etc. After POS
retrieving suitable keywords. These retrieved keywords are tagging, the part of speech tags are added into the database.
very much helpful in field of knowledge or in favouring Then system identifies phrases from database using the rule
making ontology in same domain. Identification of topic [11] subject-object-verb. The generated list of candidate phrase is
is job related to identification of unknown concepts or topics input to the final step of key phrase extraction. After
which are hidden earlier. Identification of concept is the task identification of phrases, the list of phrases is generated as
of abstracting group of documents related to stories which output. The frequency of every phrase is calculated. The most
represent same idea for that event. Concept or idea frequently occurring phrases are selected as Punjabi key
identification relates with association of documents related to phrases. The average number of key phrases extracted from
stories and concepts or topics which are not hidden. Moreover,

ISSN: 2231-5381 http://www.ijettjournal.org Page 340


International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013

system are 11, 9.6 and 6.6 from Punjabi stories, articles and Jayashree.R et al. (2011) [6] proposed Keyword extraction
news documents respectively. Gupta and Lehal (2011) [2] for Kannada document summarization. This system retrieves
proposed retrieval of key terms automatically in Punjabi. This key terms from Kannada text documents which are
system has different steps: eliminating Punjabi stop terms, categorized previously. These documents can be obtained
recognizing nouns in Punjabi text and automatic rule based from different types of resources which are online by mixing
noun stemmer in Punjabi, Finding values of frequency of coefficients of GSS (Galavotti, Sebastiani, Simi) and by using
words and Inverse lines frequency also called as TF and ISF, methods of inverse document frequency with word frequency
noun key terms in Punjabi with high value of frequency of & then apply retrieved key terms in performing task of
words-Inverse lines frequency and Punjabi sentences summarization. Sarkar (2011) [7] proposed automatic key
belonging to title/Punjabi news-headline sentence feature for phrase extraction from Bengali documents. A sequence of
Punjabi text documents. Punjabi noun terms having high value phrases are termed as Key terms which can highlight the
of TF-ISF can be considered as Punjabi Key terms. At last concepts of any text document. key terms assist users to
key terms of Punjabi are extracted using union operator of key quickly grasp, manage, share and access information present
terms in title and key terms retrieved from earlier step i.e noun in text documents. They proposed initial approach for
key terms with higher value of TF-ISF. The values of F- extracting key terms in the documents related to Bengali by
measure, recall and precision, for this Punjabi key terms applying two very essential features, i.e. term frequency
retrieval are 85.2%, 90.6% and 80.4% respectively. Kaur and inverse document frequency, initial occurrence of the term in
Gupta (2011) [3] proposed another Punjabi keywords input text. They have designed a initial model of this approach
extraction system. This system used hybrid approach, that applies as: retrieve n-grams in input text document,
containing different techniques, for example we can say that recognize suitable key terms and at last scores those suitable
mixing many different features for creating key terms retrieval key terms for finding required key terms. It was tested on
system. This system applies lists called gazetteer lists large number of documents related to Bengali language. These
generated from Punjabi dictionary by using part of speech documen6ts were taken from online corpus related to Bengali
tagger. Thus key terms from Punjabi text belonging to cue language. Balabantaray et al. (2012) [8] proposed key term
terms, title terms and noun terms with more frequency are extraction based Odia text summarization system. This system
retrieved. Results given by system are very good accepts text input which is having .txt as extension. Initially it
independently for different types of features and final outputs applies tokenizer for tokenizing input text in to individual
are better after combining those features results good key words or terms. After that they apply filter for filtering input
terms retrieval. Score values of F-measure, recall and by eliminating stop terms. Then they apply Odia stemmer for
precision are 93.03, 90.19 and 98.28 respectively. stemming of every term. Then they give value of weights to
Sarkar (2011) [4] proposed a technique of key terms every word which can be obtained: ratio of word frequency to
retrieval for Bengali language. This system comprises various total frequency of words lying in text document. Next task is
phases like retrieval of n-grams, detection of suitable key to assign scores to different lines in accordance with value of
terms and giving score values to these key terms. Because their weights. At last we can calculate final weight of line
Bengali is very inflectional in nature, therefore a Bengali using summation of weights different words in that line and
stemmer which is lightweight in nature has been made for then divide that by frequency of words for that line. Das and
doing stemming of key terms. This system was tested Bandyopadhyay (2010) [9] developed a keyword-based
thoroughly on set of documents in Bengali language which Bengali opinion summarization system that finds information
were taken from online corpus of Bengali which anyone can related to sentiments from every text document and then this
download on website of TDIL. Saraswathi et al. (2010) [5] system aggregates them & denotes information related to
proposed keywords extraction system for Tamil and English summary in that text. It applies model of topic sentiment for
for bilingual information retrieval system. The motive of this detection of sentiments & aggregation. His model is made in
proposed approach is to extract output for input question the form as discourse level detection of concept. Then it gets
typed in language which is same with the language of query. topic sentiment aggregation using clustering of concepts by
In it, they built a tree called as ontological tree in the same using k-means approach and at the level of text document
field in a manner that entries could be done in the two representation of relational graph. Finally this graph at the
languages at each node belongs to tree. They have used part of level of document is ultimately is utilized for selection of lines
speech tagging for finding key terms in the input question. for summary using suitable algorithms of page rank which are
Question typed by a person is treated as input for this applied for information extraction. This technique has been
tagger. Input line is tagged by this tagger and gives tested with F-measure, Recall and precision of 69.65%,
parts of line. We can recognize nouns and verbs from results 67.32% and 72.15% respectively. Das and Bandyopadhyay
of this tagger which are treated as suitable key terms for (2010) [10] proposed an approach for identifying topic-
doing search operation. On basis of topic, these key terms are Keywords from annotated Bengali blog sentences. They have
converted into suitable target languages by applying tree of made a system which is unsupervised and syntactic in nature
ontology. Then we can do search for extracting the text on the basis of structure of argument in the lines according to
documents on the basis of key terms. its verb. If this structure which is acquired of blog in Bengali
line according to verb satisfies the match from any frame

ISSN: 2231-5381 http://www.ijettjournal.org Page 341


International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013

syntax extracted for same verb in English with same meaning for Punjabi. In the first experiment, NER module has been
in VerbNet, then topic Key term and holder compared key tested. In the second experiment, the keyword extraction
roles attached with VerbNet in English frames are mapped to module has been tested. The third experiment tested the topic
suitable terms in line of Bengali language. They have used tracking system by evaluating using NER technique alone and
rule oriented simple techniques for eliminating the errors and keyword extraction technique alone. After that, topic tracking
to better performance of system related to syntactic for is implemented by combining both the techniques. In the last
creating lines. This approach outperforms the base line system experiment, a number of similarity measures have been
with F-measures of 66.03% and 61.98% as compared to analysed to evaluate which similarity measure finds the best
baseline technique having F-measures of 53.85% and 50.02% results for topic tracking. Dutta et al. (2005) [16] discussed a
in case of multiple and single holders of emotions and concept model which is hybrid in nature and is related to information
respectively for 500 reference test lines. retrieval which is on the basis of keywords identification
Das and Ching (2005) [12] proposed a system which is geographical technique which is created for retrieving
dependent on speaker and is called as spotter of Bengali key information related to geographic from Hindi text which is not
terms for speech of English which is not structured in nature restricted. The bond among objects of geographic retrieved
had implemented in this method. They have used two with adjacent text is graphically depicted for relating
techniques. Both of these approaches applied HMM which is information related to those entities. This technique is hybrid
full term based for key terms. Training was provided to of linguistics and statistical methods, recognizes multiple and
Terms of Bengali as isolated terms. Full term filler based single geographical names. It is used on text in Hindi
technique was used language, and this technique can be easily adapted to other
by 1st approach. Trained phoneme related to English model languages of world. The author conducted some mathematical
was used by 2nd approach along with network of experiments for finding accuracy of this technique.
all grammer related to phone network for modelling of Kothwal and Varma (2013) [17] proposed cross lingual text
filler part. Full term oriented technique shows very good reuse detection based on keyphrase extraction This approach
optimal performance of 94.22%. 2nd technique shows very addressed the problem proposed in FIRE CLITR 2011 task of
good performance with hit rate 95.83% detecting plagiarized documents in Hindi language which was
J. Allan et al. (2003) described his effort for developing reused from English language source documents. This
topic identification & tracking approach for Hindi stories technique proposed three approaches using classification and
related to news. Massachusetts university showed output for key-phrase retrieval techniques and winning approach attained
three tasks of topic tracking and identification in evaluation of 0.792 F-measure.
surprise language by DARPA. It was based on vector space
technique of information extraction. The approach told us the
process for generating the judgements which were relevant III. CONCLUSIONS
and were used for evaluation of system. Output shows that
effectiveness of tracking of topic is equivalent with topic This paper presents the survey of different recent keywords
and topic extraction techniques from Indian languages. We
detection methods for other languages. Outputs of clustering
can conclude from this survey that very less number of
and identification of new event denotes that stetting of
linguistic resources are available for Indian languages. It
different parameters for those jobs are language sensitive
requires lot of research and development for developing these
which is currently used.
resources. Keywords extraction systems and Topic extraction
Kaur and Gupta (2011) [14] proposed the topic tracking for
Punjabi language. This system has been experimented with systems for Indian languages are in the early stage of research.
two approaches. NER based approach and keyword extraction Although sufficient amount of linguistic resources are
approaches have been implemented. This method finds if any available for Hindi, but for other Indian languages, we are still
lacking for these resources.
two news articles in Punjabi highlights same concept or topic
or not. Many features are retrieved out of the text using the REFERENCES
two approaches. The NER and keyword features of initial
[1] Preeti and B. K. Sidhu, “Keyphrase Extraction From Punjabi Corpus”,
news document are compared with the respective features of International Journal of Engineering Research and Application, vol. 3,
target news document. The percentage of match or tracking pp. 491-494, 2013.
same topic is evaluated. It was developed and implemented on [2] V. Gupta and G.S. Lehal, “Automatic Keywords Extraction for Punjabi
Platform of VB.NET and different lists called gazetteers lists Language”, International Journal of Computer Science Issues, vol. 8,
pp. 327-331, 2011.
were made in the form of tables in database. This system takes [3] K. Kaur and V. Gupta, “Keyword Extraction for Punjabi Labguage ”,
news articles as input text, which are to be compared to check Indian Journal of Computer Science and Engineering, vol. 2, pp. 364-
if they track same topic or not. These input text documents are 370, 2011.
obtained from different websites of Punjabi like: likhari.org, [4] K. Sarkar, “An N-Gram Based Method for Bengali Keyphrase
Extraction”, In Proceedings of International Conference ICISIL-2011,
jagbani.com, ajitweekly.com, punjabispectrum.com, Springer, Patiala, India, pp. 36-41, 2011.
europevichpunjabi.com, quamiekta.com, sahitkar.com, [5] S.Saraswathi, M. A.Siddhiqaa., K. Kalaimagal. and M. Kalaiyarasi,
onlineindian.com, europesamachar.com, parvasi.com etc. Four “BiLingual Information Retrieval System for English and
experiments have been carried out to implement topic tracking Tamil” , Journal of Computing, vol. 2, pp. 85-89, 2010.

ISSN: 2231-5381 http://www.ijettjournal.org Page 342


International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013

[6] Jayashree.R, Srikanta Murthy.K and Sunny.K, “Document


Summarization in Kannada using Keyword Extraction”, In Proceedings
of AIAA 2011,CS & IT 03, pp. 121–127 , 2011.
[7] K. Sarkar, “Automatic Key phrase Extraction from Bengali Documents:
A Preliminary Study”, In Proceedings of IEEE Second International
Conference on EAIT’11, pp. 125-128, 2011.
[8] R. C. Balabantaray, B. Sahoo, D. K. Sahoo and M. Swain, "Odia Text
Summarization using Stemmer", International Journal of Applied
Information System,vol.1, pp. 21-24, 2012
[9] A. Das and S. Bandyopadhyay, “Topic-Based Bengali Opinion
Summarization”, Coling 2010: Poster Volume, pp. 232–240, Beijing,
2010.
[10] D. Das and S. Bandyopadhyay, "Identifying Emotion Holder and Topic
from Bengali Emotional Sentences", Proceedings of ICON-2010: 8th
International Conference on Natural Language Processing, India, 2010.
[11] http://www.itl.nist.gov/iaui/894.01/tdt98/doc/tdtslides/sld001.htm,
1998
[12] S.Das and P.C Ching, "Speaker Dependent Bengali Keyword spotting
in unconstrained English Speech", A Project report, Indian Institute of
Technology Guwahati, India, 2005.
[13] J. Allan, V. Lavrenko and M. E. Connell, "A month to topic detection
and tracking in Hindi", International Journal ACM Transactions on
Asian Language Information Processing (TALIP), vol. 2, pp. 85-100,
2003.
[14] K. Kaur and V. Gupta, “Topic Tracking for Punjabi Language” ,
Computer Science & Engineering: An International Journal (CSEIJ),
vol.1, pp. 37-49, 2011.

[15] T. Nomoto and Y. Matsumoto, "Exploring the text structure for Topic
Identification", In Proceedings of the 4th Workshop on Very Large
Corpora, pp.101-112, 1996.
[16] R. Kothwal and V.Varma, "Cross Lingual Text Reuse Detection Based
on Keyphrase Extraction and Similarity Measures", Springer's
Multilingual Information Access in South Asian Languages Lecture
Notes in Computer Science, pp 71-78, vol.7536, 2013.

ISSN: 2231-5381 http://www.ijettjournal.org Page 343

You might also like