You are on page 1of 32

Performance Enhancement and Customization of Information Storage and Retrieval system

Synopsys for the Degree of Doctor of Philosophy in Department of Engineering submitted to

MEWAR UNIVERSITY GANGRAR CHITTORGARH(RAJASTHAN) Research Supervisor Research Scholar Dr. Suresh Jain Dharmendra Sharma

Agenda
Introduction Literature survey Objective of research work Proposed Methodology

Introduction
Information can organized in structure, semi structure and un structured form Information Storage and Retrieval Google, LexisNexis, Dilog Language dependency
Multiple spelling color / colour Word ambiguity bat-cricket bird Context - I eat what I see and I see what I eat

Introduction
Web search analysis for AltaVista
Informational 48% Transactional 30% Navigational 22%

Balance of Recall and Precision

Literature survey
Historical Development of ISR system Main problem of Information storage and retrieval
Document or query Indexing and translation Information storage and retrieval model System evaluation

Document indexing
H.C. yang and C.H. proposed three different strategy
Dictionary based Treasure based Corpus based

Hull and Gerferetette Dictionary based context evaluation without word sense disambiguate

Word Sense Disambiguate


There are two way to solve WSD
Supervised learning Unsupervised learning

P. Bhattacharyya evaluate the context


Context of word in sentence Context evaluate by WordNet

ISR Models
Document set converted into suitable representation There are three type of presentation or model
Boolean model Probabilistic model Algebraic model

Boolean model
Standard Boolean model

Based on classical set theory Let original documents O={ O1,O2,O3,,On} Set of Term T={t1,t2,t3,t4,.,tn} Document set D=(d1,d2,d3,---di-dn),di can be power set of T. Let d1=(t1,t2),d2=(t2,t3),d3=(t2,t4) Let Q=(t2,t3) then etrieved will be(d1,d2,d3) and (d2) Calulate (d1,d2,d3) Intersection(d2) Result id d2

Standard Boolean model(Cont.)


Pros
Clear formation Easy to implement

Cons
Exact match lead to few or large document All term have equal weight How to rank out More like data retrieval than information retrieval

Boolean model
Extended Boolean model
Document represent as vector Each ith term present ith dimension Term weight weight calculate as tf-idf Vdj={w1j,w2j,.wij} K1 and k2 weight is w1,w2 Then Q(k1 or K2)={(w12+w22)/2}1/2 Q(k1 and k2)={1-{((1-w12)+(1-w22)/2)^1/2}

Boolean model
Fuzzy retrieval
Membership define by degree Mixed Min and Max (MMM) Paice model

Both do not provide way to evaluate the query Query evaluated by P norms algorithm

Probabilistic model
Binary Independence model Uncertain Inference Language model

Probabilistic model
Binary Independence model
Introduced by Yu and Salton Assume document as binary vector, only present or absent of term in document is recoreded as 0 or 1 Terms are independently distributed in relevant and irrelevant document Document represent as ordered set of Boolean variable

Probabilistic model
Uncertain inference
Proposed by Rijsbergen Measure of uncertainty of document d to query q is probability of logical implication P(d->q) System task is to infer a document if query assertion is true. Knowledge base of fact and rule is used

Probabilistic model
Language Model
Language model is associated with document Document ranked on the basis of probability that document language model would generate the term of query Unigram model

Algebraic model
Vector space model Latent semantic model

Algebraic model
Vector space model
Document and query represent by terms Weight of term assign by tf-idf scheme Each ith term represent ith dimension of vector The similarity calculated by correlation between query and document vector

Algebraic model
Latent semantic indexing
Mathematical technique ( Singular value decomposition) Word used in same context have same meaning Indentify pattern and relation ship between term and concept.

System Evaluation
Precision Recall Fall out F measure Average precision Mean average precision

Impact of research on other domain


The research on Information storage and retrieval system draws on achievement and techniques in several related area.
Information Access: Document indexing, retrieval, filtering, clustering, presentation and summarization of information, cross language information retrieval. Machine translation: comparable and parallel text alignment, language generation. Computational linguistics: morphological analysis, syntactic parsing, technique for disambiguation, document segmentation, corpus analysis, term recognition and term expansion.

Objectives of proposed work


i. Investigation of various techniques useful for Information storage and retrieval and their comparison. ii. Apply grid computing and domain knowledge as supervised learning to eliminate the word sense ambiguity from the query. iii. Analysis of models to store the data and improvement of document representation by using following methods.
Clustering techniques Concept mapping

iv. Evaluation and customization of Information storage and retrieval models according to users need. V. Experimental evaluation of the developed algorithms.

Proposed Methodology
Grid computing for semantic database Domain creation by knowledge base Clustering algorithms Concept mapping for mapping between clusters

Methodology
Grid computing
Language limitation, color/colour , bat I eat what I sea / I sea what I eat Combining of different domain to achieve common. Difficulty in corpus analysis Domain creation on the basis of knowledge base Parallel processing for different semantic network

Methodology
Clustering
Research gap Grouping of data into meaning full category Document descriptor and descriptor extraction Following algorithm will be used
Hierarchical algorithm- association and dividing the document Ontology support clustering Graph based clustering

Concept map
Mapping between clusters How idea create from word Decision making system Word and concept related to each other and whole idea

Methodology
Following criteria will be used to evaluate the performance
Turn around time Response time Precision

Methodology
Recall Fall out

F measure

Bibliography
[1]. Wissam Tawileh, chair of business informatics Exploring web search behavior of Arab internet users IEEE, International conference on Innovation in Information Technology, Dresden, Germeny, 2011 [2]. D. B. Cleveland and A.D Cleveland.Introduction to Indexing and Abstracting, Englewood, CO: Libraries Unlimited, Inc, (1990) [3] K. Sparck Jones A statistical interpretation of term specificity and its application in retrieval, Journal of Documentation 28, 11-21, (1972) [4] G. Salton and C. Buckley Term Weighting Approaches in Automatic Text Retrieval, Information Processing and Management, 24, 513-523, 1988. [5] H.C. yang and C.H. lee Multi Lingual Information Retrieval, International conference on Intelligent system design and application, Kaohsiung, Taiwan , 2008 [6] D.A.Hull and G. Grefenstette Query across a language: a dicitinary based approach to multilingual information retrieval , International conference on research and development in information retrieval ,1996 [7] Singhal, Amit "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (2001). [8] Maron, Melvin E. "An Historical Note on the Origins of Probabilistic Indexing". Information Processing and Management 2008 [9] Pushpak Bhattacharyya, M. Sinha, M. K. Reddy, P. Pande, L. Kashyap, Hindi Word Sense Disambiguation Indian Institute of Technology, Bombay, India, 2011 [10] Lashkari, Mahdavi, Ghomi, A Boolean Model in Information Retrieval for Search Engines 2009

Bibliography
[11] Manning,, Christopher D.; Prabhakar Raghavan, Hinrich Schtze Introduction to Information Retrieval. Cambridge University Press. Standerd Boolean model:2008 [12] Turpin, Andrew; Scholer, Falk (2006). User performance versus precision measures for simple search tasks. "Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06". Proceedings of the 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Seattle, Washington, USA, August 06-11, 2006) (New York, NY: ACM): 11 18. doi:10.1145/1148170.1148176. ISBN 1595933697. [13] Salton, Gerard; Edward A. Fox, Harry Wu (1983), Extended Boolean information retrieval, Communications of the ACM, Volume 26, Issue 11, [14] Lee, W. C.; E. A. Fox (1988), Experimental Comparison of Schemes for Interpreting Boolean Queries [15] Kang, Bo-Yeong; Dae-Won Kim, Hae-Jung Kim (2005), Fuzzy Information Retrieval Indexed by Concept Identification, Springer Berlin / Heidelberg, Zadrozny, Sawomir; Nowacka, Katarzyna (2009), Fuzzy information retrieval model revisited, Elsevier North-Holland, Inc., doi:10.1016/j.fss.2009.02.012 [16] Fox, E. A.; S. Sharat (1986), A Comparison of Two Methods for Soft Boolean Interpretation in Information Retrieval, Technical Report TR-86-1, Virginia Tech, Department of Computer Science [17] Ding, C., A Similarity-based Probability Model for Latent Semantic Indexing, Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 5965. [18] Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, 1988, pp. 3640. [19] Bartell, B., Cottrell, G., and Belew, R., Latent Semantic Indexing is an Optimal Special Case of Multidimensional Scaling, Proceedings, ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp. 161167. [20] Dumais, S., and Nielsen, J., Automating the Assignment of Submitted Manuscripts to Reviewers, Proceedings of the Fifteenth Annual International Conference on Research and Development in Information Retrieval, 1992, pp. 233244.

Bibliography
[21] Berry, M. W., and Browne, M., Understanding Search Engines: Mathematical Modeling and Text Retrieval, Society for Industrial and Applied Mathematics, Philadelphia, (2005). [22] Sparc, Robertson, Using Latent Semantic Analysis to Identify Similarities in Source Code to Support Program Understanding, Proceedings of 12th IEEE International Conference on Tools with Artificial Intelligence, Vancouver, British Columbia, November 1315, 2000, pp. 4653. [23] G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Communications of the ACM, vol. 18, nr. 11, pages 613620. (Article in which a vector space model was presented) [24] David Dubin , The Most Influential Paper Gerard Salton Never Wrote (Explains the history of the Vector Space Model and the non-existence of a frequently cited publication)2004 [25] Yarowsky, D., and Florian, R., Taking the Load off the Conference Chairs: Towards a Digital Paper-routing Assistant, Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very-Large Corpora, 1999, pp. 220230. [26] Soboroff, I., et al, Visualizing Document Authorship Using N-grams and Latent Semantic Indexing, Workshop on New Paradigms in Information Visualization and Manipulation, 1997, pp. 4348. [27] Yuanhua Lv and ChengXiang Zhai, Positional Language Models for Information Retrieval, in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR), 2009. [28] Buyya, Rajkumar . "Grid Computing: Making the Global Cyberinfrastructure for eScience a Reality" (PDF). CSI Communications (Mumbai, India: Computer Society of India (CSI)) 29 (1Francesco Lelli, Eric Frizziero, Michele Gulmini, Gaetano Maron, Salvatore Orlando, Andrea Petrucci and Silvano Squizzato. The many faces of the integration of instruments and the grid. International Journal of Web and Grid Services 2007 Vol. 3, No.3 pp. 239 266 [29] Goodrum, Abby A. "Image Information Retrieval: An Overview of Current Research". Informing Science,2000 [30] J M Ponte and W B Croft . "A Language Modeling Approach to Information Retrieval". Research and Development in Information Retrieval. pp. 275281. 1998

Bibliography

[31] Beel, Jran; Gipp, Bela; Stiller, Jan-Olaf . "Information Retrieval On Mind Maps - What Could It Be Good For?". Proceedings of the 5th International Conference on Collaborative Computing: Networking, Applications and Worksharing:2009 [32] Benedict, Shajulin; Vasudevan. "A Niched Pareto GA approach for scheduling scientific workflows in wireless Grids". Journal of Computing and Information Technology 16: 101. 2008 [33] R. korra, P. sujatha, Sidige, Chetana, N. Kumar Performace Evaluation of Multilingual Information Retrieval System over Information Retrieval System, IEEE , International conference on Recent Trends in Information Technology, ICRTIT 2011, MIT , Anna Univercity, Chennai, June 3-5-2011 [34] Achtert, E.; Bohm, C.; Kriegel, H. P.; Krger, P.; Zimek, A. "On Exploring Complex Relationships of Correlation Clusters". 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007). pp. 7.2007 [35] Auffarth, B. Clustering by a Genetic Algorithm with Biased Mutation Operator. WCCI CEC. IEEE, July 1823, 2010. [36] Achtert, E.; Bhm, C.; Krger, P.; Zimek, A. "Mining Hierarchies of Correlation Clusters". Proc. 18th International Conference on Scientific and Statistical Database Management (SSDBM): 119128. doi:2006 [37] Z. Huang. "Extensions to the k-means algorithm for clustering large data sets with categorical values". Data Mining and Knowledge Discovery, 2:283304, 1998. [38] Joseph D. Novak & Alberto J. Caas . "The Theory Underlying Concept Maps and How To Construct and Use Them", Institute for Human and Machine Cognition. Accessed 24 Nov 2008. [39] Moon, B.M., Hoffman, R.R., Novak, J.D., & Caoas, A.J. Applied Concept Mapping: Capturing, Analyzing and Organizing Knowledge. CRC Press: New York,2011 [40] Anderson, J. R., & Lebiere, C. (1998). The atomic components of thought. Mahwah, NJ: Erlbaum.

Thank You

You might also like