You are on page 1of 12

School of Information Studies Information Storage and Retrieval 783

INSTRUCTOR: SCHOOL: CLASSROOM: TIME: SEMESTER: OFFICE: OFFICE HOURS: PHONE: EMAIL: Dr. Jin Zhang, Associate Professor School Information Studies, UWM Bolton 521 Thursday, 2:00 pm to 4:40 pm Fall 2009 Bolton 532 Wed, 1:00 pm to 2:00 pm 229-2712 jzhang@uwm.edu

1 Course Materials Textbook: [1] Information Storage and Retrieval by R. R. Korfhage, published by John Wiley & Sons in 1997. ISBN 0-471-14338-3. [2] (Optional) Introduction to Information Retrieval by Manning, C.D., Raghavan, P. and Schtze, H., 2008, Cambridge University Press, ISBN-13: 9780521865715. [3] (Optional) Information Storage and Retrieval Systems Theory and Implementation, Second Edition. Gerald J. Kowalski, Mark T. Maybury September 2000, Kluwer Academic Publisher, ISBN 0-7923-7924-1. Other books: [1] Text Information Retrieval Systems by Charles T. Meadow, published by Academic Press, Inc. in 2007, ISBN 0-12-369412-4. (Third Version) [2] INFORMATION RETRIEVAL by C. J. van RIJSBERGEN B.Sc., Dip. NAAC, Ph.D., M.B.C.S., F.I.E.E., C. Eng., F.R.S.E. The book is available at http://www.dcs.gla.ac.uk/Keith/Preface.html [3] (Optional) Visualization for Information Retrieval by Zhang, J. 2008, Srpinger, ISBN: 978-3-540-75147-2 2 Course Description This course on information storage and retrieval focuses on the theory and concepts of information retrieval system, introduces the basic principles of information storage, processing, and retrieval in terms of the information retrieval system analysis and design. The knowledge, experience and background in information systems are preferred. Pre-requisites: L&I Sci 571; or cons instr. 3 4 Course Credit Graduate, 3 credits

Course Objectives Generally speaking, information retrieval includes two different levels: the first one is to effectively use information in an already existing information retrieval system, it is

external; the second is to address how information is processed within an information retrieval system, it is internal. The information retrieval and storage focuses on the latter, the second level. The topics in this courses include query structure and its characteristics, the representation of documents and other objects within an information system, internal matching mechanisms, document analysis, users perspective, reference points, retrieval effectiveness measure, alternative retrieval techniques, output presentation, data file structures, visualization for information, the Internet search engine, as well as a discussion of current research trends in the field. The aim of this course is to prepare students as information retrieval system analysts and designers. The objectives are: To outline basic terminology and components in information storage and retrieval systems To compare and contrast information retrieval models and internal mechanisms such as Boolean, Probability, and Vector Space Models To outline the structure of queries and documents To articulate fundamental functions used in information retrieval such as automatic indexing, abstracting, and clustering To critically evaluate information retrieval system effectiveness and improvement techniques To understand the unique features of Internet-based information retrieval To describe current trends in information retrieval such as information visualization. 5 Course Grading
A AB+ B BC+ superior work satisfactory, but undistinguished work 74-76 70-73 67-69 64-66 60-63 below 60 C work is below standard CD+ D unsatisfactory work DF

96-100 91-95 88-90 84-86 80-83 77-79

Assignments include, and are not limited to, automatic indexing, presentations like relevance measure review, comparison of different types of information retrieval systems and application of expert system in information retrieval, and information system use. A significant portion of your grade is determined by your individual assignments. It is extremely important for you to understand the grading policies and obtain high points on your assignments. Assignment Weekly Assignments (9) Participation & Discussion Project Grade 45% (5% each) 10% 45%

6 Attendance & Class Participation: Attendance is mandatory and class participation is expected. You will be graded on your participation and contributions to class discussions. 7 Course Schedule

Week 1 Introduction Content What is information retrieval , Significance of information retrieval and storage, Definition of information retrieval system, Objectives of information retrieval system , Function overview , Relationships between Digital library and IRS , Abstraction , Algorithm , Data structure, Measure of information systems, Logical organization, Physical organization, Components of information retrieval systems, Comparisons among different information systems , Research topics in IR .

Reading: Chapter1 Introduction to Information Retrieval by Manning Lecture notes


N. J. Belkin and W. B. Croft. Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12):2938, 1992.

Week 2 Vocabulary control and data compression Content TREC , query

, differences between documents and queries , type of documents , types of data structure, document surrogates, vocabulary control , structure of a thesaurus, structural representation, fine data structure, bit and byte, data compression, Huffman code technique . Reading: Chapter2. Information Storage and Retrieval by Korfhage Chapter 2 Introduction to Information Retrieval by Manning Lecture notes

Week 3 Database file structures Content Information organization structures, sequential file, structure of a sequential file , inverted file, structure of an index file , tree structure , N-gram data structure, hash approach, signature file structure.

Reading: Chapter 2. Information Storage and Retrieval Systems by Korflage Lecture notes Week 4 Boolean retrieval systems Content Matching criteria, Boolean logic , limitations of Boolean logic, processing query expression: reverse Poland Expression, rules for operations

Reading: Chapters 3 and 4. Information Storage and Retrieval by Korfhage Chapter 1 Introduction to Information Retrieval by Manning Lecture notes Week 5 Vector retrieval system Content Vector model, document-term matrix, methods for designing weights to terms , query in the vector model, spatial representation of a document in vector model, Similarity between a query and a document (approach I), similarity between a query and a document (approach II), some considerations for the vector model.

Reading: Chapters 3and 4. Information Storage and Retrieval by Korfhage Chapter 14 Introduction to Information Retrieval by Manning Lecture notes Week 6 Probability Retrieval System Content Basic concepts of probability, probability theory, statistical independence, Bayes theorem, representation of documents in the probability model, discrimination function, probability search, assumptions of the probability model. Reading: Chapters 3 and 4. Information Storage and Retrieval by Korfhage Chapter 11 Introduction to Information Retrieval by Manning Lecture notes
S. E. Robertson and S. Walker. Some simple effective approximations to the 2poisson model for probabilistic weighted retrieval. In Proceedings of ACM SIGIR94, pages 232241, 1994.

Crestani, Fabio, Mounia Lalmas, Cornelis J. Van Rijsbergen, and Iain Campbell. 1998. Is this document relevant?. . . probably: A survey of probabilistic models in information retrieval. ACM Computing Surveys 30(4):528552. Fuhr, Norbert. 1992. Probabilistic models in information retrieval. Computer Journal 35(3):243255. Week 7 Automatic indexing and abstracting Content Indexing, automatic indexing , purpose of indexing, why use automatic indexing, stop list approach , raw term frequency approach, normalized term frequency approach, inverse term frequency approach, and other considerations. Reading: Chapter 4 Introduction to Information Retrieval by Manning Zhang J, and Nguyen T (2005). A new term significance weighting approach. Journal of Intelligent Information Systems, 24(1), 61-85. Robertson, S. (2004).Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5), pp.503-520. Salton, G., Allan, J., & Singhal, A.K. (1996). Automatic text decomposition and structuring. Information Processing and Management, 32(2), pp.127-138. Cleverdon, CyrilW. 1991. The significance of the Cranfield tests on index languages. In Proc. SIGIR, pp. 312. ACM Press.

Week 8 Similarity measure algorithms Content

Data fusion, term association, general similarity measures, similarity measures in the vector retrieval model, comparisons of the two kinds of similarity approaches, extended user profile, current awareness systems , retrospective search systems, reference point, modifying the query by the user profile. Reading: Chapters 4 and 5. Information Storage and Retrieval by Korfhage, and lecture notes

Zhang J, and Rasmussen E (2001). Developing a new similarity measure from two different perspectives. Information Processing & Management, 37(2), 279-294. Zhang J, and Rasmussen E (2002). An experimental study on the iso-content-based angle similarity measure, Information Processing & Management, 38(3), 325-342.
A. Griffiths, H. C. Luckhurst, and P.Willett. Using interdocument similarity in document retrieval systems. Journal of the American Society for Information Science, 37:311, 1986.

Bartell, Brian T., Garrison W. Cottrell, and Richard K. Belew. 1998. Optimizing similarity using multi-query relevance feedback. JASIS 49(8):742761. Moffat, Alistair, and Justin Zobel. 1998. Exploring the similarity space. SIGIR Forum 32(1). Week 9 Automatic clustering approaches Content Definition of automatic clustering, criteria of clustering, differences between clustering and classification, significance of a clustering approach in IR, categorization of clustering algorithms, non- hierarchical clustering algorithm, the K-means clustering algorithm, K-means in SPSS, hierarchical clustering algorithm, hierarchy cluster in SPSS. Reading: Chapters 16 and 17 Introduction to Information Retrieval by Manning Lecture notes Rasmussen, E. (1992). Clustering algorithms. In W. B. Frakes and R. Baeza-Yates (Eds.) Information retrieval: data structures & algorithms (pp.419-442). Englewood Cliffs, NJ.: Prentice Hall.

Zhao, Y. and Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document databases. In Proceedings of the eleventh international conference on Information and knowledge management, pp.515-524. November 04-09, 2002, McLean, Virginia, USA. Hamerly, Greg, and Charles Elkan. 2003. Learning the k in k-means. In Proc. NIPS. URL: books.nips.cc/papers/files/nips16/NIPS2003_AA36.pdf. Jain, Anil, M. Narasimha Murty, and Patrick Flynn. 1999. Data clustering: A review. ACM Computing Surveys 31(3):264323. Murtagh, Fionn. 1983. A survey of recent advances in hierarchical clustering algorithms. Computer Journal 26(4):354359.

Week 10 Theory: Information Visualization Content Visualization, visualization for information retrieval, analysis of traditional information retrieval systems, navigation problems on WWW , why use visualization for information retrieval , core of visualization for information retrieval, functionality of visualization , Boolean-based information retrieval system, non-Boolean-based information retrieval system, visualization of web-based information , consideration from cognitive engineering, history of visualization, technical environment for the visualization, potential research topics. Reading: Chapter 1 Visualization for Information Retrieval by Zhang Lecture notes Card, S.K., Machinlay, J.D., and Shneiderman, B. (1999). Readings in information visualization: using vision to think. San Francisco: Morgan Kaufmann, pp, 1-34. Hearst, M.A. (1999). User interfaces and visualization. In R. Baeza-Yates and B. RibeiroNeto, editors, Modern Information Retrieval, chapter 10, pp. 257--323. Addison Wesley, Harlow. Keim, D. A. (2001). Visual exploration of large data sets. Communications of the ACM, 44, 8, pp.38-44.

Week 11 Systems: Information Visualization Content Visualization systems , VIBE, DARE, Visual thesaurus, Inxight, Reveal things, Tilebars, SQWID, JAIR INFORMATION SPACE , WebMap, Excentric Labeling, Tree map, LifeLines, Web Brain, NiF Elastic Catalog, Dynamic Diagrams, Health InfoPark Reading: Chapter 8 Visualization for Information Retrieval by Zhang Lecture notes Benford S, Snowdon D, Greenhalgh C, Ingram R, and Knox I (1995). VR-VIBE: A Virtual Environment for Co-operative Information Retrieval. Proceeding of Eurographics95, August 30th-September 1st , 1995, Maastricht, pp.349-360. Chen C (1999). Visualising semantic spaces and author co-citation networks in digital libraries. Information Processing and Management, 35(3), 401-420. Hearst MA (1995). TileBars: visualization of term distribution information in full text information access. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems95, May 7-11, 1995, Denver, Colorado, pp. 59-66. Korfhage RR, and Olsen KA (1994). The role of visualization in document analysis. Proceedings of Third Annual Symposium on Document analysis and Information retrieval94, April 11-13th, 1994, Las Vegas, Nevada, pp.199-207. Nuchprayoon, A. and Korfhage, R.R. (1997). GUIDO: visualizing document retrieval. Proceedings of the IEEE Information Visualization symposium97, September 2326, 1997, Isle Capri, Italy, pp.184-188.

Zhang J, and Korfhage RR (1999). DARE: Distance and Angle Retrieval Environment: A Tale of the Two Measures. Journal of the American Society for Information Science, 50(9), 779-787.

Week 12 Internet Information Retrieval Content Challenge in the Web , language distribution, centralized architecture, crawlers , jargons , crawling the Web , breadth first approach, depth first approach, crawling approach , web page ranking, meta-search , considerations for meta-search engines, trends

Reading: lecture notes Chapters 9 and 10 Introduction to Information Retrieval by Manning Brin, Sergey, and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. In Proc. WWW, pp. 107117. Broder, Andrei. 2002. A taxonomy of web search. SIGIR Forum 36(2):310. Gerrand, Peter. 2007. Estimating linguistic diversity on the internet: A taxonomy to avoid pitfalls and paradoxes. Journal of Computer-Mediated Communication 12(4). URL: jcmc.indiana.edu/vol12/issue4/gerrand.html. Glover, Eric J., Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pennock, and Gary W. Flake. 2002. Using web structure for classifying and describing web pages. In Proc. WWW, pp. 562569. ACM Press.

Week 13 Evaluation issues Content Seven criteria for evaluation for information retrieval, Average recall and average precision, Harmonic mean, evaluation of a search engine, relevance issue, Kappa measure, quality versus quantity, Possible factors which influence outcome of a search, Grandfield experimental study

Reading: Chapter 8. Information Storage and Retrieval by Korfhage Chapter8 Introduction to Information Retrieval by Manning Lecture notes Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology, 58(3), 19151933. Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance. Journal of the American Society for Information Science and Technology, 58(13), 2126-2144. Harter, Stephen P. 1998. Variations in relevance assessments and the measurement of retrieval effectiveness. JASIS 47:3749. Week 14 Student presentation Note: * If you are a student with special needs, please feel free to discuss them with the instructor. * The schedule may be changed
Summary: Week1, Sept 3 Week2, Sept 10 Week3, Sept 17 Week4, Sept 24 Week5, Oct 1 Week6, Oct 8 Week7, Oct 15 Week8, Oct 22 Week9, Oct 29 Week10, Nov 5 Week11, Nov 12 Week12, Nov 19 Week13, Nov 26 Thanksgiving recess Week14, Dec 3 Week15, Dec 10 Introduction Vocabulary control and data compression Database file structures Boolean retrieval systems Vector retrieval system Probability Retrieval System Automatic indexing and abstracting Similarity measure algorithms Automatic clustering approaches Theory: Information Visualization Systems: Information Visualization Internet Information Retrieval No class Evaluation issues Project presentation

Term paper topic list


Students will develop a 15-20 page paper on one of the topics listed below. Papers will characterize current issues associated with the topic, discuss the state of the art of the topic,

evaluate sample systems, and outline future directions for the area. Papers must integrate a minimum of 15 relevant sources. Papers should use the American Psychological Association (APA) style (http://apastyle.apa.org/) [1]. Music information retrieval [2]. Image information organization and retrieval [3]. Automatic indexing theory and practice [4]. Evaluation of search engines [5]. On Boolean-based information retrieval system [6]. Comparison between Boolean-based and Vector-based information systems [7]. Visualization of information: theoretical aspect [8]. Visualization of information: system aspect [9]. Automatic indexing/abstracting theory and practice [10]. Other information retrieval models [11]. Evaluation of an information visualization system <SOIS and University Policies to be inserted>

You might also like